As a toy to learn a bit about Nim I decided to create a hash calculator for files. I noticed there was a MD5 hash algorithm in the library so it seemed straight forward. What I discovered is that for files > ~250MB my implementation would return incorrect hash values. It's quite possible I'm doing something stupid, but it does return correct hashes for files < 250MB. Also, I'm using the 32-bit version on Windows with minGW. Seems like a data type overflow might be a reasonable assumption. Any insight would be welcome.
import md5
import os
proc calculateMD5Incremental(filename: string) : string =
const blockSize: int = 8192
var
c: MD5Context
d: MD5Digest
f: File
bytesRead: int = 0
buffer: array[blockSize, char]
byteTotal: int = 0
#read chunk of file, calling update until all bytes have been read
try:
f = open(filename)
md5Init(c)
bytesRead = f.readBuffer(buffer.addr, blockSize)
while bytesRead > 0:
byteTotal += bytesRead
md5Update(c, buffer, bytesRead)
bytesRead = f.readBuffer(buffer.addr, blockSize)
md5Final(c, d)
except IOError:
echo("File not found.")
finally:
if f != nil:
close(f)
result = $d
if paramCount() > 0:
let arguments = commandLineParams()
echo("MD5: ", calculateMD5Incremental(arguments[0]))
else:
echo("Must pass filename.")
quit(-1)
I'm trying to debug it. I remember getting the exact same problem, but couldn't find the cause back then (and apparently forgot to report it): https://github.com/def-/nimutils/blob/e97cde616193a2092642a7150080a2a6ea1489bc/md5sum.nim
Edit: My current test is a file of binary 0s, size 268435455 works, size 268435456 doesn't, that's exactly 256 MiB.
It's indeed an overflow, looking at the state:
at byte 268435454:
STATE: -1889439815 -371250275 2058418518
COUNT: 2147483632 0
BUFFER: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
at byte 268435455:
STATE: -1889439815 -371250275 2058418518
COUNT: 2147483640 0
BUFFER: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
at byte 268435456:
STATE: -64320197 -902710823 -963762803
COUNT: -2147483648 1
BUFFER: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
I guess that has to do with int32s being used instead of uint32s. Not sure how to fix yet.
Edit2: Made a small change, now it works up to 536870912, but not 536870913, strange:
if c.count[0] <% (len shl 3): c.count[1] = c.count[1] +% 1'i32
Edit3: Converted the entire md5sum module to unsigneds, works fine now (and is a bit faster). Maybe there is a bug in one of the non-overflowing signed integer operations. (+%, -%, ...): https://github.com/Araq/Nim/pull/1847
Wow. Thanks for the quick fix!
It's definitely better. Now it returns correct hash values up to ~2GB. Anything over fails with "Error: unhandled exception: over- or underflow [OverflowError]". Much prefer this over an incorrect result!
The other thing is that this implementation is pretty slow. As a comparison I found an equivalent MD5 calculation for Go (ref: https://www.socketloop.com/tutorials/how-to-generate-checksum-for-file-in-go) which on a 2GB file runs in 16 seconds compared to 120 seconds for the nim version.
The Go version also seems to not have the data type overflow issue. I ran it on a ~6GB file with no issues.
Do you compile with nim -d:release c? For me Nim's md5 module is nearly as fast as md5sum from GNU coreutils.
Edit: I don't get an overflow. Are you compiling for 32bit? If so, I think it's just your byteTotal that's overflowing, should make that an int64 instead, rest should work fine without any changes.
Edit2: 4.4 GB file: GNU md5sum: 12.555 s, Nim md5 module: 16.944 s, go checksum: 40.417 s
No, I did not. Compiling with -d:release makes a world of difference! It is now faster than the Go test app as your numbers confirm.
re: overflow - correct, I am compiling for 32bit and it was the byteTotal overflowing. One note for newbies - when in release mode types will silently overflow unless you compile with overflow checking enabled (--overflowChecks:on).
Any plans to implement the more secure SHA hashing algorithms in the standard library? I found a port by OnionHammer (https://github.com/onionhammer/sha1), but it doesn't support the "Update" semantic you've implemented which makes incremental calculations possible.
Thank you def!
I guess you could extend OnionHamer's SHA1 so it's usable in the same way as MD5.
Another path is to wrap an existing C SHA1 library. For example OpenSSL has SHA1: https://www.openssl.org/docs/crypto/sha.html
You could do something similar to this, except by wrapping SHA1_Init, SHA1_Update and SHA1_Final (compile with nim -d:ssl c): https://github.com/def-/nim-unsorted/blob/master/sha1.nim