nimforum mirror - Noob question: proper way to read binary files byte by byte

vimal73700 (orginal) [2019-02-24T15:40:40+01:00] view original

Hi all,

What is the best way to read a binary file byte by byte?

import os, streams

var fs_pos = 0
var fs = newFileStream(paramStr(1), fmRead)

while true:
  var one_char = fs.readChar()
  echo one_char
  if (one_char == '\0'):
    echo "breaking at " & $fs_pos
    break
  fs_pos += 1

streams.readChar() returns the same '0' for a null byte as well as EOF. Please advise.

r3c (orginal) [2019-02-24T17:15:39+01:00] view original

check this out

mashingan (orginal) [2019-02-24T17:28:14+01:00] view original

use atEnd to check whether it's ended or not and use getPosition for its current position.

import os, streams

var fs = newFileStream(paramStr(1), fmRead)

while not fs.atEnd:
  var one_char = fs.readChar()
  echo one_char

cblake (orginal) [2019-02-24T20:57:11+01:00] view original

You can also use memfiles. There writing/reading is the same as accessing memory. Besides being possibly simpler presenting an "as if you already read the whole file into a buffer" view, it may also be much more efficient, especially for byte-at-a-time operation where other APIs might do a lot of behind the scenes work on a per-IO basis. Of course, to be usable as a MemFile, the data needs to be random access (e.g. on the disk as opposed to a network socket or pipe or some other unseekable input).

vimal73700 (orginal) [2019-02-27T06:13:23+01:00] view original

Thank you for your wonderful suggestions. I finally ended by using memfiles as suggested by @cblake which improved (reduced) the runtime significantly, but if I may ask, does this (below code) load the entire file content into memory?

var fs = newMemMapFileStream(paramStr(1), fmRead)
while not fs.atEnd:
  echo fs.readChar()

mashingan (orginal) [2019-02-27T09:59:40+01:00] view original

memfile is basically loading all content to memory, cmiiw

federico3 (orginal) [2019-02-27T10:56:25+01:00] view original

No, memfiles use the memory mapping mechanism provided by the OS (e.g. mmap). https://nim-lang.org/docs/memfiles.html

cblake (orginal) [2019-02-27T12:46:28+01:00] view original

A short reply like this may be inadequate to explain virtual memory mechanisms if you have never heard of them before. That said, if you have heard in the past and forgotten this may help.

The newMemMapFileStream will call memfiles.open with default flags. Default flags typically just lookup the size of the file and create an address range in your process that -- when page faulted by the virtual memory hardware -- will (transparently to your process inside the OS kernels "page fault handler") cause loading/population of 4k (or possibly larger) "pages" of memory, on-demand with file contents for the corresponding spot. This is all fairly portable behavior.

So, in light of that, at the beginning of your loop, nothing will be "loaded". By the end of the loop, as much will be loaded as can fit in the RAM of your machine. The actual fact of the matter of "being loaded" depends upon the sometimes highly dynamic competition for physical memory among all the programs on a system. This is generally also true of any buffered IO mechanism when "swap files" or "page files" or partitions are enabled. Each page will be loaded for a little while or your program cannot make progress, but by the time you get to the end of your loop if the file is gigantic/larger than RAM or if some other process is demanding a lot of RAM then the beginning may no longer be "resident" in RAM.

Certain operating systems allow you to "tune" the on-demand loading behavior with "flag" arguments to the API that sets up this "auto-loading" mechanism. For example, Linux allows you to specify MAP_POPULATE which will, in effect, pre-load the whole file into RAM before your program loop/without your program making the CPU dereference any of those file data addresses. You may want to do this for example if the persistent backing store is a magnetic spinning disk, the file is small and yo want to avoid "seeking" the disk head around. Similarly, on Unix, there are also the madvise/posix_madvise interfaces which lets a program advise the OS that memory accesses are likely to be sequential (your case) or random, or even specify certain ranges as candidates for preloading. These little tweaks tend to be very non-portable, though, and the default behavior probably does what you want.

If it does not do what you want, MemMapFileStream does not (presently) support adding "flags" to the OS mapping calls. I did recently improve the memfiles.open interface to allow just that. You might like the non-stream API better anyway. You can cast[ptr UncheckedArray[char]] the MemFile.pointer and just use the file as an array of bytes if you like. You do have to be careful not to overrun the end of the file. And another recent addition I got in was to allow toOpenArray(cast[ptr UncheckedArray[char]](ThePointer), 0, TheFileSize-1) style passing of such arguments to Nim procs expecting OpenArray[char] parameters.

Mirror of forum.nim-lang.org

4680 :: Noob question: proper way to read binary files byte by byte