nimforum mirror - speeding up split?

joe (orginal) [2014-04-29T06:54:28+02:00] view original

I benchmarked ruby vs nimrod vs c for splitting strings from a 100_000_000 line file: https://gist.github.com/anonymous/11390832

if you want to try it, run ruby split.rb first, it will create a junkdata.txt file used by the 3 scripts.

results (on an ssd): c: 14 seconds ruby: 78 seconds nimrod: 159 seconds (compiled with -d:release)

How can I speed this up in nimrod? I do a lot of splitting and such, so fast is nice...

Araq (orginal) [2014-04-29T10:58:19+02:00] view original

Quick solution: Write C-like Nimrod and put what you do in a main proc, this way it uses local variables instead of global ones which might speed up things slightly. That said, I'm suprised it is THAT slow, we should profile it.

Araq (orginal) [2014-04-29T11:11:36+02:00] view original

Or try this untested piece of code:



import strutils

proc fastSubStr(dest: var string; src: string, a, b: int) =
  # once the stdlib uses TR macros, these things should not be necessary
  template `+!`(src, a): expr = cast[pointer](cast[int](cstring(src)) + a)
  setLen(dest, b-a+1)
  copyMem(addr dest[0], src+!a, b-a+1)

proc main =
  const fn = "junkdata.txt"
  var data1 = newStringOfCap(40)
  var data2 = newStringOfCap(100)
  for line in lines(fn):
    let pos = line.find('\t')
    assert pos >= 0
    fastSubstr(data1, line, 0, pos-1)
    fastSubstr(data2, line, pos+1, line.high)

main()

joe (orginal) [2014-04-29T11:44:09+02:00] view original

Your code got it down to 25 seconds. Thank you. I will have to study it.

One thing, this:


import strutils

proc main =
  const fn = "junkdata.txt"
  var x = 0
  for line in lines(fn):
    x.inc(1)
  echo x

main()

takes 23 seconds. I think that's not right. wc -l takes 1 second. and this takes 0.3 seconds!:


var SIZE = 8192
var buf = cast[ptr string](alloc0(SIZE))
var s = open("junkdata.txt",fmRead)
var rlen = 0
var totlen = 0
while true:
  rlen =  s.readBuffer(buf, SIZE)
  if rlen == 0: break
  totlen.inc(rlen)
echo totlen

of course I don't know how to access "buf" need to learn the voodoo. but something weird with "for line in lines(fn)"

gradha (orginal) [2014-04-29T11:46:00+02:00] view original

Have you noticed that the C version is not doing the same as the ruby/nimrod versions? The C sscanf function using %s will stop at the first whitespace, it won't go up to the tab character, as suggested by the higher level versions. In fact, if I uncomment the printf in the loop I get:


1) this -> is
2) this -> is
3) this -> is
…

which is different from the ruby/nimrod versions splitting at the first tab character.

Anyway, C code of such read loops will always win because it avoids memory allocations and copies, which is what Araq's version tries to avoid. On my machine, the incorrect C version takes 32s. Araq's version takes 37s. The following version, which doesn't avoid copies, is still quite fast at 49s:


import strutils

proc main() =
  let fn = "junkdata.txt"
  var
    data1 = newStringOfCap(70)
    data2 = newStringOfCap(2)
    j, pos: int
  
  for line in lines(fn):
    pos = line.find('\t')
    data1 = line[0 .. <pos]
    data2 = line[pos+1 .. <line.len]
    inc j
    if j < 100:
      echo j, " ", data1, " -> ", data2

main()

But then, my correct version of the program in C would be this, which takes 27s on my machine:


#include <stdio.h>

int main() {
  
  char *data1, *data2, *trail;
  char line[100];
  
  int j = 0;
  
  FILE *fp = fopen("junkdata.txt", "rb");
  if (!fp) {
    printf("Couldn't open file for reading\n");
    return 0;
  }
  while(fgets(line, 99, fp) != NULL){
        data1 = data2 = line;
        while (*data2 != '\t') data2++;
        *data2++ = 0;
        trail = data2;
        while (*trail != '\n') trail++;
        *trail = 0;
    
    j++;
    if(j < 100) {
      printf("%d) %s -> %s\n", j,data1,data2);
    }
  }
  printf("%d\n",j);
}

joe (orginal) [2014-04-29T12:01:55+02:00] view original

of course I had a bug in the C code! why wouldn't I? :-)

Jehan (orginal) [2014-04-29T14:47:07+02:00] view original

A quick way to speed it up significantly is to use line.split('\t') rather than line.split("\t") (i.e., splitting on characters rather than strings).

There is a fairly big inefficiency in the splitting implementation for string separators in that a substring is allocated for every character in the string that is being split. This can be avoided by using a substring matching routine that avoids that allocation, e.g.:


proc substrmatch(s: string, sub: string, start: int): bool {.inline.} =
  let k = len(sub)
  if start + k > len(s):
    return false
  for i in 0 .. <k:
    if s[start+i] != sub[i]:
      return false
  return true

iterator split*(s: string, sep: string): string =
  ## Splits the string `s` into substrings using a string separator.
  ##
  ## Substrings are separated by the string `sep`.
  var last = 0
  if len(s) > 0:
    while last <= len(s):
      var first = last
      while last < len(s) and not substrmatch(s, sep, last):
        inc(last)
      yield substr(s, first, last-1)
      inc(last, sep.len)

For long strings and separators with length > 1, one may want to look at Boyer-Moore or KMP for additional speedup.

asterite (orginal) [2014-04-29T21:51:38+02:00] view original

Joe, the loop that increments x while reading the lines of the file takes 23 seconds because you probably need to create those lines (I'm guessing here).

"wc -l" is probably just traversing the file contents and looking for line terminators: no need to create lines, allocate memory, etc. "wc -l" is faster because it's highly optimized for that use case.

Mirror of forum.nim-lang.org

435 :: speeding up split?