nimforum mirror - Nim version of "wc -l" relatively slow

pugz (orginal) [2015-04-28T12:39:28+02:00] view original

Hi Nim devs and users,

I'm a huge fan of Nim. I used it a while back and was blown away. I forgot most of what I learned at the time, but I'm getting back into it and stumbled across an odd (potential) issue. I have a file with a million lines (each that looks basically like "1000018335904674186_719985749172374695,119429087,1").

Running "wc -l" on it takes almost no time at all:


$ time wc -l head.txt
 1000000 head.txt

real	0m0.071s
user	0m0.058s
sys	0m0.012s

Doing the same with this Nim code takes ~3.5 seconds:


var count = 0

for line in stdin.lines:
  count += 1

echo count

Any idea why? I'm using these flags to compile the code:


nim c --cc:clang --passC:-mcpu=native  -d:release preprocess.nim

def (orginal) [2015-04-28T14:15:49+02:00] view original

This sure comes up a lot: http://forum.nim-lang.org/t/1151

You can also see my experiments with writing wc -l with a few methods: https://github.com/def-/nim-unsorted/blob/master/wcl.nim

Basically, the way Nim counts newlines is not the same as wc -l, you have to put the code into a proc because the compiler prefers them, and in current devel the lines iterator is faster.

This should be a bit faster, but still not as fast as wc -l:

proc main =
  var
    buf: array[4096, char]
    count = 0
    size = 4096
  
  while size > 0:
    size = stdin.readBuffer(addr buf, 4096)
    for i in 0 .. <size:
      if buf[i] == '\l':
        inc count
  
  echo count

main()

I guess you could check how wc actually does it, maybe we can get some performance improvements from there.

Jehan (orginal) [2015-04-28T14:35:13+02:00] view original

def: I guess you could check how wc actually does it, maybe we can get some performance improvements from there.

The above code is actually a bit faster for me than wc -l (OS X 10.10.3, with an SSD drive).

perturbation2 (orginal) [2015-04-29T02:48:33+02:00] view original

stdin.lines does something like "read char-by-char until we get a newline. Get a newline. Copy the line into whatever buffer we have (? not sure the semantics of yield), and then read char-by-char until we get another newline." You end up doing a lot of copying from the file when all you really wanted was the number of lines. stdin.lines was sped up a lot because of fgetc_unlocked, but the overall approach is important.

The last time this came up, I looked into how wc did this, because I was curious. The key things that I remember it did differently than the Naiive C / Nim approach:

Read chunks of the file in at a time (1024) bits, and then split

If the file is 'small' (according to some heuristic from fstat), load all of it into RAM in one slurp

Long, hard-to-read switch statements that jump out of the file if a newline character is encountered.

Doesn't copy from buffer / do other things if it knows all it's doing is counting the number of lines.

wc is a well-optimized, mature (and single-purpose) program. That said, the Naiive Go approach (http://play.golang.org/p/_Nar8-uBDs) is also fast, but this is because it does buffering behind-the-scenes.

FWIW, in testing stdin.lines, my own wrap of GNU readline(), go, and other approaches, on large enough files I still found all to be roughly equivalent in time because it's IO-bound.

pugz (orginal) [2015-04-29T03:47:43+02:00] view original

Thanks to each of you for the detailed responses - especially so given that I now realise this exact question has been asked before. My prior attempts at searching the forum must have failed me.

The posited explanations for why wc would be faster make sense. That said, I built Nim from HEAD and recompiled the same code and it ran in ~0.4s -- more or less exactly the same as wc! Guess I will live on the bleeding edge :)

jibal (orginal) [2015-04-29T05:05:08+02:00] view original

The fastest implementation is to memory map the file.

Varriount (orginal) [2015-04-30T09:48:06+02:00] view original

Jibal: Possibly, although it might make memory usage a tad unpredictable. Actually, would it be feasible to replace the io implementation of the stdlib with a memory map backend, rather than one based on C standard functions?

jibal (orginal) [2015-04-30T11:27:48+02:00] view original

Varriount: although it might make memory usage a tad unpredictable

You don't have to map the whole file, just a page at a time.

Varriount: Actually, would it be feasible to replace the io implementation of the stdlib with a memory map backend, rather than one based on C standard functions?

Sure; others have done such things, e.g., https://www.sqlite.org/mmap.html

However, you need more control to optimize something like wc -l, where you want to map in a page of memory followed by a \n sentinel and then search for \n until you hit the sentinel. And for the last page of the file, you need to either switch to a length-limited search or copy the data to another buffer where you can plant the sentinel, since you can't write into the mapped page.

Stefan_Salewski (orginal) [2017-03-26T15:22:17+02:00] view original

Just saw

http://lemire.me/blog/2017/02/14/how-fast-can-you-count-lines/

Speed increase by factor of 10 by SIMD instructions is not bad, and it is a relative simple SIMD example.

Maybe some time in future that will be possible in Nim too...

[EDIT]

Well, maybe it is already possible using

https://forum.nim-lang.org/t/212#17578 https://github.com/jcosborn/qex/tree/master/src/simd

cdunn2001 (orginal) [2017-03-27T13:27:25+02:00] view original

@def, I'm curious about your example here: https://github.com/def-/nim-unsorted/blob/master/wcl.nim

When you cast to string with

var
  size = 4096
  buf = cast[ptr string](alloc0(size))

Doesn't a string have length, cap, and data? So the available size for data must be less than 4096. So wouldn't we read past allocated memory with

size = file.readBuffer(buf, 4096)

def (orginal) [2017-03-27T13:55:50+02:00] view original

Yeah, that looks wrong, I guess I was just playing around.

cblake (orginal) [2017-03-28T01:48:41+02:00] view original

@Stefan_Salewski, you can also just call the libc memchr (which is what the current memfiles does to delimit lines aka slices until string conversion). A good memchr will do the SSE/AVX internally. For example this program:

import memfiles, os, times
var f = memfiles.open(paramStr(1))
var cnt = 0
let t0 = epochTime()
for x in memSlices(f): inc(cnt)      # The action here is just one line
let dt = (epochTime()  - t0) * 1e9
echo "cnt: ", cnt, " ns: ", dt, " ns/cnt: ", dt / float(cnt),
     " B/ns: ", float(f.size) / dt
f.close

On my Linux machine this Nim program runs about as fast as the memchrcount version in Dan Lemiere's approach (which is about 4x the "naive"/"strawman" approach). While Dan does get a speed-up of another 3x more over that memchr approach with vector code, that also loses the feature to do anything besides count newlines. Since some kind of non-"mere counting" processing is usually desired, the memchr approach is probably what most people would want.

Mirror of forum.nim-lang.org

1164 :: Nim version of "wc -l" relatively slow