Hi,
I wrote a wc program, the first version using FileStream but it's very slow, so I change to posix's open/read.
In the comparison with coreutils's wc, it's also slow.
Below is my code.
import os, posix, strutils, times
proc main(): void =
let start = epochTime()
var buffer = newString(16384).TaintedString
let p = addr(string(buffer)[0])
for i in 1..paramCount():
let fd = open(paramStr(i), 0, O_RDONLY)
if fd == -1:
stderr.writeln("Open file failed: $1" % $strerror(errno))
continue
defer: discard close(fd)
var lcount = 0
while true:
let bytes = read(fd, p, 16384)
case bytes:
of 0: break
of -1:
stderr.writeln("Read file failed: $1" % $strerror(errno))
break
else:
for idx in 0..(bytes-1):
let c = buffer[idx]
case c:
of '\L': lcount += 1
of '\r': lcount += 1
else: discard
echo lcount
stderr.writeln("$1 seconds" % $(epochTime() - start))
main()
When test 52MB file in a cloud host, Elapsed times are:
4. for c in string: 101ms
for c in buffer[0..(bytes-1)]:
case c:
of '\L': lcount += 1
of '\r': lcount += 1
else: discard
5. change switch statement to if statement: 42ms
for idx in 0..(bytes-1):
let c = buffer[idx]
if c == '\L' or c == '\r':
lcount += 1
6. If I replace the loop to:
for idx in 0..(bytes-1):
lcount += 1
It's very fast (of cause it's wrong): 11ms
So seems string iteration is very slow. How to optimise it?
No, string iteration isn't slow, it's that when you do buffer[0..(bytes)-1], you create a copy that is allocated on the heap and garbage collected later.
FileStream isn't designed for speed, but to conform to a polymorphic interface.
wc -l counts only newlines, not carriage returns.
The following code is about as fast as wc -l for me on Linux and a bit faster on OS X:
proc main(file: File) =
var buffer: array[8192, char]
var lines = 0
while true:
let n = readChars(file, buffer, 0, len(buffer))
for i in 0..n-1:
if buffer[i] == '\L':
lines += 1
if n == 0:
break
echo lines
main(stdin)
Compile with -d:release for max speed.
It's faster than all, 34ms on my machine. but wc -l just 14ms (11ms is sys time on above).
I read coreutils's wc source and found it used memchr. I rewrote the C source's loop in nimcache, recompiled with gcc -O3, it just 14ms!
And how to +/- a pointer value in Nim? like this:
char* p = NULL;
p ++;
p += 5;
int n = p - NULL;
You can also look at memfiles.memSlices in the Nim standard library. (You may need the github devel branch.)
The below code is only 1.06X slower than "wc -l" for me (1.17 ms vs 1.10 ms for L3 cached input with average line length 35 characters and 80,000 lines. Your longer run times suggest larger inputs that are probably in main memory and so your results should be even closer together).
import memfiles
when isMainModule:
proc main() =
var counter = 0
var inp = memfiles.open("/dev/stdin") # Map stdin on Unix
for slc in memSlices(inp): inc(counter)
echo counter
memSlices() uses memchr internally. So, you can look at lib/pure/memfiles.nim to see how to call it directly from Nim rather than hacking on the nimcache .c files as well as how to do some pointer arithmetic. :-)
In terms of absolute maximum performance for this particular problem, if line lengths are typically very short then something like Jehan's approach may be faster. The function call overhead to invoke memchr can cost more than is saved via the SSE/AVX optimizations in a good libc memchr. It just depends upon both how good the libc is as well as features of your data that may be hard to know ahead of time (e.g. distribution of line lengths).
cblake: In terms of absolute maximum performance for this particular problem, if line lengths are typically very short then something like Jehan's approach may be faster.
Yes, I tested with /usr/shared/dict/words cat-ted together a few dozen times; i.e., pretty short lines. That may be why I got better results.
lijie: And how to +/- a pointer value in Nim?
This is not directly supported, because (1) pointer math is inherently unsafe and error-prone and (2) whether pointer math is even actually faster is a question of the compiler and CPU; you cannot blindly assume that it is. If you absolutely need it, you can do it via casts and unsigned arithmetic, as shown in this post.