nimforum mirror - String iteration performance.

lijie (orginal) [2015-09-09T06:34:41+02:00] view original

Hi,

I wrote a wc program, the first version using FileStream but it's very slow, so I change to posix's open/read.

In the comparison with coreutils's wc, it's also slow.

Below is my code.


import os, posix, strutils, times

proc main(): void =
  let start = epochTime()
  var buffer = newString(16384).TaintedString
  let p = addr(string(buffer)[0])
  
  for i in 1..paramCount():
    let fd = open(paramStr(i), 0, O_RDONLY)
    if fd == -1:
      stderr.writeln("Open file failed: $1" % $strerror(errno))
      continue
    
    defer: discard close(fd)
    
    var lcount = 0
    
    while true:
      let bytes = read(fd, p, 16384)
      case bytes:
      of 0: break
      of -1:
        stderr.writeln("Read file failed: $1" % $strerror(errno))
        break
      else:
        for idx in 0..(bytes-1):
          let c = buffer[idx]
          case c:
          of '\L': lcount += 1
          of '\r': lcount += 1
          else: discard
    
    echo lcount
  
  stderr.writeln("$1 seconds" % $(epochTime() - start))

main()

When test 52MB file in a cloud host, Elapsed times are:

coreutils's wc: 11ms

FileStream: 590ms

posix version (above code): 72ms

4. for c in string: 101ms


for c in buffer[0..(bytes-1)]:
  case c:
  of '\L': lcount += 1
  of '\r': lcount += 1
  else: discard

5. change switch statement to if statement: 42ms


for idx in 0..(bytes-1):
  let c = buffer[idx]
  if c == '\L' or c == '\r':
    lcount += 1

6. If I replace the loop to:


for idx in 0..(bytes-1):
  lcount += 1

It's very fast (of cause it's wrong): 11ms

So seems string iteration is very slow. How to optimise it?

Jehan (orginal) [2015-09-09T08:23:12+02:00] view original

No, string iteration isn't slow, it's that when you do buffer[0..(bytes)-1], you create a copy that is allocated on the heap and garbage collected later.

FileStream isn't designed for speed, but to conform to a polymorphic interface.

wc -l counts only newlines, not carriage returns.

The following code is about as fast as wc -l for me on Linux and a bit faster on OS X:

proc main(file: File) =
  var buffer: array[8192, char]
  var lines = 0
  while true:
    let n = readChars(file, buffer, 0, len(buffer))
    for i in 0..n-1:
      if buffer[i] == '\L':
        lines += 1
    if n == 0:
      break
  echo lines

main(stdin)

Compile with -d:release for max speed.

lijie (orginal) [2015-09-09T13:29:46+02:00] view original

It's faster than all, 34ms on my machine. but wc -l just 14ms (11ms is sys time on above).

I read coreutils's wc source and found it used memchr. I rewrote the C source's loop in nimcache, recompiled with gcc -O3, it just 14ms!

And how to +/- a pointer value in Nim? like this:

char* p = NULL;
p ++;
p += 5;
int n = p - NULL;

cblake (orginal) [2015-09-09T14:37:42+02:00] view original

You can also look at memfiles.memSlices in the Nim standard library. (You may need the github devel branch.)

The below code is only 1.06X slower than "wc -l" for me (1.17 ms vs 1.10 ms for L3 cached input with average line length 35 characters and 80,000 lines. Your longer run times suggest larger inputs that are probably in main memory and so your results should be even closer together).

import memfiles
when isMainModule:
  proc main() =
    var counter = 0
    var inp = memfiles.open("/dev/stdin")       # Map stdin on Unix
    for slc in memSlices(inp): inc(counter)
    echo counter

memSlices() uses memchr internally. So, you can look at lib/pure/memfiles.nim to see how to call it directly from Nim rather than hacking on the nimcache .c files as well as how to do some pointer arithmetic. :-)

In terms of absolute maximum performance for this particular problem, if line lengths are typically very short then something like Jehan's approach may be faster. The function call overhead to invoke memchr can cost more than is saved via the SSE/AVX optimizations in a good libc memchr. It just depends upon both how good the libc is as well as features of your data that may be hard to know ahead of time (e.g. distribution of line lengths).

Jehan (orginal) [2015-09-09T14:46:44+02:00] view original

cblake: In terms of absolute maximum performance for this particular problem, if line lengths are typically very short then something like Jehan's approach may be faster.

Yes, I tested with /usr/shared/dict/words cat-ted together a few dozen times; i.e., pretty short lines. That may be why I got better results.

lijie: And how to +/- a pointer value in Nim?

This is not directly supported, because (1) pointer math is inherently unsafe and error-prone and (2) whether pointer math is even actually faster is a question of the compiler and CPU; you cannot blindly assume that it is. If you absolutely need it, you can do it via casts and unsigned arithmetic, as shown in this post.

lijie (orginal) [2015-09-09T16:43:30+02:00] view original

Thank you @cblake and @Jehan, very useful for me.

Mirror of forum.nim-lang.org

1622 :: String iteration performance.