nimforum mirror - MD5 fails with large files

james305 (orginal) [2015-01-03T06:26:38+01:00] view original

disclaimer - first day with the language

As a toy to learn a bit about Nim I decided to create a hash calculator for files. I noticed there was a MD5 hash algorithm in the library so it seemed straight forward. What I discovered is that for files > ~250MB my implementation would return incorrect hash values. It's quite possible I'm doing something stupid, but it does return correct hashes for files < 250MB. Also, I'm using the 32-bit version on Windows with minGW. Seems like a data type overflow might be a reasonable assumption. Any insight would be welcome.


import md5
import os

proc calculateMD5Incremental(filename: string) : string =
  
  const blockSize: int = 8192
  var
    c: MD5Context
    d: MD5Digest
    f: File
    bytesRead: int = 0
    buffer: array[blockSize, char]
    byteTotal: int = 0
  
  #read chunk of file, calling update until all bytes have been read
  try:
    f = open(filename)
    
    md5Init(c)
    bytesRead = f.readBuffer(buffer.addr, blockSize)
    
    while bytesRead > 0:
      byteTotal += bytesRead
      md5Update(c, buffer, bytesRead)
      bytesRead = f.readBuffer(buffer.addr, blockSize)
    
    md5Final(c, d)
  
  except IOError:
    echo("File not found.")
  finally:
    if f != nil:
      close(f)
  
  result = $d

if paramCount() > 0:
  let arguments = commandLineParams()
  echo("MD5: ", calculateMD5Incremental(arguments[0]))
else:
  echo("Must pass filename.")
  quit(-1)

Araq (orginal) [2015-01-03T11:48:50+01:00] view original

Your code looks absolutely correct and I have no idea why that would fail for files > 250MB.

def (orginal) [2015-01-03T11:54:01+01:00] view original

I'm trying to debug it. I remember getting the exact same problem, but couldn't find the cause back then (and apparently forgot to report it): https://github.com/def-/nimutils/blob/e97cde616193a2092642a7150080a2a6ea1489bc/md5sum.nim

Edit: My current test is a file of binary 0s, size 268435455 works, size 268435456 doesn't, that's exactly 256 MiB.

It's indeed an overflow, looking at the state:


at byte 268435454:
STATE: -1889439815 -371250275 2058418518
COUNT: 2147483632 0
BUFFER: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
at byte 268435455:
STATE: -1889439815 -371250275 2058418518
COUNT: 2147483640 0
BUFFER: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
at byte 268435456:
STATE: -64320197 -902710823 -963762803
COUNT: -2147483648 1
BUFFER: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

I guess that has to do with int32s being used instead of uint32s. Not sure how to fix yet.

Edit2: Made a small change, now it works up to 536870912, but not 536870913, strange:

if c.count[0] <% (len shl 3): c.count[1] = c.count[1] +% 1'i32

Edit3: Converted the entire md5sum module to unsigneds, works fine now (and is a bit faster). Maybe there is a bug in one of the non-overflowing signed integer operations. (+%, -%, ...): https://github.com/Araq/Nim/pull/1847

james305 (orginal) [2015-01-03T17:07:53+01:00] view original

Wow. Thanks for the quick fix!

It's definitely better. Now it returns correct hash values up to ~2GB. Anything over fails with "Error: unhandled exception: over- or underflow [OverflowError]". Much prefer this over an incorrect result!

The other thing is that this implementation is pretty slow. As a comparison I found an equivalent MD5 calculation for Go (ref: https://www.socketloop.com/tutorials/how-to-generate-checksum-for-file-in-go) which on a 2GB file runs in 16 seconds compared to 120 seconds for the nim version.

The Go version also seems to not have the data type overflow issue. I ran it on a ~6GB file with no issues.

def (orginal) [2015-01-03T17:28:16+01:00] view original

Do you compile with nim -d:release c? For me Nim's md5 module is nearly as fast as md5sum from GNU coreutils.

Edit: I don't get an overflow. Are you compiling for 32bit? If so, I think it's just your byteTotal that's overflowing, should make that an int64 instead, rest should work fine without any changes.

Edit2: 4.4 GB file: GNU md5sum: 12.555 s, Nim md5 module: 16.944 s, go checksum: 40.417 s

james305 (orginal) [2015-01-03T18:25:19+01:00] view original

No, I did not. Compiling with -d:release makes a world of difference! It is now faster than the Go test app as your numbers confirm.

re: overflow - correct, I am compiling for 32bit and it was the byteTotal overflowing. One note for newbies - when in release mode types will silently overflow unless you compile with overflow checking enabled (--overflowChecks:on).

Any plans to implement the more secure SHA hashing algorithms in the standard library? I found a port by OnionHammer (https://github.com/onionhammer/sha1), but it doesn't support the "Update" semantic you've implemented which makes incremental calculations possible.

Thank you def!

def (orginal) [2015-01-03T18:38:58+01:00] view original

I guess you could extend OnionHamer's SHA1 so it's usable in the same way as MD5.

Another path is to wrap an existing C SHA1 library. For example OpenSSL has SHA1: https://www.openssl.org/docs/crypto/sha.html

You could do something similar to this, except by wrapping SHA1_Init, SHA1_Update and SHA1_Final (compile with nim -d:ssl c): https://github.com/def-/nim-unsorted/blob/master/sha1.nim

Orion (orginal) [2015-01-03T22:49:32+01:00] view original

If you update/improve the sha1 library, feel free to submit a PR and I'll update it on nimble.

Mirror of forum.nim-lang.org

716 :: MD5 fails with large files