Hi Nim devs and users,
I'm a huge fan of Nim. I used it a while back and was blown away. I forgot most of what I learned at the time, but I'm getting back into it and stumbled across an odd (potential) issue. I have a file with a million lines (each that looks basically like "1000018335904674186_719985749172374695,119429087,1").
Running "wc -l" on it takes almost no time at all:
$ time wc -l head.txt
1000000 head.txt
real 0m0.071s
user 0m0.058s
sys 0m0.012s
Doing the same with this Nim code takes ~3.5 seconds:
var count = 0
for line in stdin.lines:
count += 1
echo count
Any idea why? I'm using these flags to compile the code:
nim c --cc:clang --passC:-mcpu=native -d:release preprocess.nim
This sure comes up a lot: http://forum.nim-lang.org/t/1151
You can also see my experiments with writing wc -l with a few methods: https://github.com/def-/nim-unsorted/blob/master/wcl.nim
Basically, the way Nim counts newlines is not the same as wc -l, you have to put the code into a proc because the compiler prefers them, and in current devel the lines iterator is faster.
This should be a bit faster, but still not as fast as wc -l:
proc main =
var
buf: array[4096, char]
count = 0
size = 4096
while size > 0:
size = stdin.readBuffer(addr buf, 4096)
for i in 0 .. <size:
if buf[i] == '\l':
inc count
echo count
main()
I guess you could check how wc actually does it, maybe we can get some performance improvements from there.
def: I guess you could check how wc actually does it, maybe we can get some performance improvements from there.
The above code is actually a bit faster for me than wc -l (OS X 10.10.3, with an SSD drive).
stdin.lines does something like "read char-by-char until we get a newline. Get a newline. Copy the line into whatever buffer we have (? not sure the semantics of yield), and then read char-by-char until we get another newline." You end up doing a lot of copying from the file when all you really wanted was the number of lines. stdin.lines was sped up a lot because of fgetc_unlocked, but the overall approach is important.
The last time this came up, I looked into how wc did this, because I was curious. The key things that I remember it did differently than the Naiive C / Nim approach:
wc is a well-optimized, mature (and single-purpose) program. That said, the Naiive Go approach (http://play.golang.org/p/_Nar8-uBDs) is also fast, but this is because it does buffering behind-the-scenes.
FWIW, in testing stdin.lines, my own wrap of GNU readline(), go, and other approaches, on large enough files I still found all to be roughly equivalent in time because it's IO-bound.
Thanks to each of you for the detailed responses - especially so given that I now realise this exact question has been asked before. My prior attempts at searching the forum must have failed me.
The posited explanations for why wc would be faster make sense. That said, I built Nim from HEAD and recompiled the same code and it ran in ~0.4s -- more or less exactly the same as wc! Guess I will live on the bleeding edge :)
Varriount: although it might make memory usage a tad unpredictable
You don't have to map the whole file, just a page at a time.
Varriount: Actually, would it be feasible to replace the io implementation of the stdlib with a memory map backend, rather than one based on C standard functions?
Sure; others have done such things, e.g., https://www.sqlite.org/mmap.html
However, you need more control to optimize something like wc -l, where you want to map in a page of memory followed by a \n sentinel and then search for \n until you hit the sentinel. And for the last page of the file, you need to either switch to a length-limited search or copy the data to another buffer where you can plant the sentinel, since you can't write into the mapped page.
Just saw
http://lemire.me/blog/2017/02/14/how-fast-can-you-count-lines/
Speed increase by factor of 10 by SIMD instructions is not bad, and it is a relative simple SIMD example.
Maybe some time in future that will be possible in Nim too...
[EDIT]
Well, maybe it is already possible using
https://forum.nim-lang.org/t/212#17578 https://github.com/jcosborn/qex/tree/master/src/simd
@def, I'm curious about your example here: https://github.com/def-/nim-unsorted/blob/master/wcl.nim
When you cast to string with
var
size = 4096
buf = cast[ptr string](alloc0(size))
Doesn't a string have length, cap, and data? So the available size for data must be less than 4096. So wouldn't we read past allocated memory with
size = file.readBuffer(buf, 4096)
?@Stefan_Salewski, you can also just call the libc memchr (which is what the current memfiles does to delimit lines aka slices until string conversion). A good memchr will do the SSE/AVX internally. For example this program:
import memfiles, os, times
var f = memfiles.open(paramStr(1))
var cnt = 0
let t0 = epochTime()
for x in memSlices(f): inc(cnt) # The action here is just one line
let dt = (epochTime() - t0) * 1e9
echo "cnt: ", cnt, " ns: ", dt, " ns/cnt: ", dt / float(cnt),
" B/ns: ", float(f.size) / dt
f.close
On my Linux machine this Nim program runs about as fast as the memchrcount version in Dan Lemiere's approach (which is about 4x the "naive"/"strawman" approach). While Dan does get a speed-up of another 3x more over that memchr approach with vector code, that also loses the feature to do anything besides count newlines. Since some kind of non-"mere counting" processing is usually desired, the memchr approach is probably what most people would want.