I am very new to nim and pulling my hear out trying to read a large file into a buffer in chunks and hash the buffer's content like so:
import streams, murmur, strutils
const size = 1_048_576
var
i = open("input")
buf: array[size, char]
fhash: BiggestInt = 0
while i.readBuffer(buf.addr, size) > 0:
fhash += hash(buf)
echo fhash
i.close()
This obviously doesn't work, as the hash expects a string, not an array of char. Casting the buffer to string doesn't work either since string has different internal representation (terminating null character + length field). Reading the whole file into a string is not an option for my use-case as some of the files are hundreds of MBs to GBs in size and I am trying to write a memory-efficient algorithm. There is a closed thread that talks about a similar challenge, but to be honest, the tensor stuff there reads like Chinese to me. Would appreciate an easy to understand/implement solution. If there is an existing library that does these sort of conversions, that's cool too.
Here is how I would do it:
import streams, murmur, strutils
const size = 1_048_576
var
i = open("input")
buf = newString(size)
fhash: BiggestInt = 0
while i.readChars(buf, 0, size) > 0:
fhash += hash(buf)
echo fhash
i.close()
Hope this helps :)
+= is usually a bad way of mixing hash values, MurmurHash seems to use
hash ← hash XOR k
hash ← (hash ROL r2)
hash ← hash × m + n
as the mixing step. https://en.wikipedia.org/wiki/MurmurHash
Changed to the following. I know it's not exactly how murmur does it, but this seems to work reliably as in producing identical duplicate lists comparing to the previous version without the extra math.
hash = hash xor (BiggestInt)hash(buf)
Thanks again @Araq for the suggestion. If anything, it made the code a lot cleaner and easier to understand.