I benchmarked ruby vs nimrod vs c for splitting strings from a 100_000_000 line file: https://gist.github.com/anonymous/11390832
if you want to try it, run ruby split.rb first, it will create a junkdata.txt file used by the 3 scripts.
results (on an ssd): c: 14 seconds ruby: 78 seconds nimrod: 159 seconds (compiled with -d:release)
How can I speed this up in nimrod? I do a lot of splitting and such, so fast is nice...
import strutils
proc fastSubStr(dest: var string; src: string, a, b: int) =
# once the stdlib uses TR macros, these things should not be necessary
template `+!`(src, a): expr = cast[pointer](cast[int](cstring(src)) + a)
setLen(dest, b-a+1)
copyMem(addr dest[0], src+!a, b-a+1)
proc main =
const fn = "junkdata.txt"
var data1 = newStringOfCap(40)
var data2 = newStringOfCap(100)
for line in lines(fn):
let pos = line.find('\t')
assert pos >= 0
fastSubstr(data1, line, 0, pos-1)
fastSubstr(data2, line, pos+1, line.high)
main()
Your code got it down to 25 seconds. Thank you. I will have to study it.
One thing, this:
import strutils
proc main =
const fn = "junkdata.txt"
var x = 0
for line in lines(fn):
x.inc(1)
echo x
main()
takes 23 seconds. I think that's not right. wc -l takes 1 second. and this takes 0.3 seconds!:
var SIZE = 8192
var buf = cast[ptr string](alloc0(SIZE))
var s = open("junkdata.txt",fmRead)
var rlen = 0
var totlen = 0
while true:
rlen = s.readBuffer(buf, SIZE)
if rlen == 0: break
totlen.inc(rlen)
echo totlen
of course I don't know how to access "buf" need to learn the voodoo. but something weird with "for line in lines(fn)"
Have you noticed that the C version is not doing the same as the ruby/nimrod versions? The C sscanf function using %s will stop at the first whitespace, it won't go up to the tab character, as suggested by the higher level versions. In fact, if I uncomment the printf in the loop I get:
1) this -> is
2) this -> is
3) this -> is
…
which is different from the ruby/nimrod versions splitting at the first tab character.
Anyway, C code of such read loops will always win because it avoids memory allocations and copies, which is what Araq's version tries to avoid. On my machine, the incorrect C version takes 32s. Araq's version takes 37s. The following version, which doesn't avoid copies, is still quite fast at 49s:
import strutils
proc main() =
let fn = "junkdata.txt"
var
data1 = newStringOfCap(70)
data2 = newStringOfCap(2)
j, pos: int
for line in lines(fn):
pos = line.find('\t')
data1 = line[0 .. <pos]
data2 = line[pos+1 .. <line.len]
inc j
if j < 100:
echo j, " ", data1, " -> ", data2
main()
But then, my correct version of the program in C would be this, which takes 27s on my machine:
#include <stdio.h>
int main() {
char *data1, *data2, *trail;
char line[100];
int j = 0;
FILE *fp = fopen("junkdata.txt", "rb");
if (!fp) {
printf("Couldn't open file for reading\n");
return 0;
}
while(fgets(line, 99, fp) != NULL){
data1 = data2 = line;
while (*data2 != '\t') data2++;
*data2++ = 0;
trail = data2;
while (*trail != '\n') trail++;
*trail = 0;
j++;
if(j < 100) {
printf("%d) %s -> %s\n", j,data1,data2);
}
}
printf("%d\n",j);
}
A quick way to speed it up significantly is to use line.split('\t') rather than line.split("\t") (i.e., splitting on characters rather than strings).
There is a fairly big inefficiency in the splitting implementation for string separators in that a substring is allocated for every character in the string that is being split. This can be avoided by using a substring matching routine that avoids that allocation, e.g.:
proc substrmatch(s: string, sub: string, start: int): bool {.inline.} =
let k = len(sub)
if start + k > len(s):
return false
for i in 0 .. <k:
if s[start+i] != sub[i]:
return false
return true
iterator split*(s: string, sep: string): string =
## Splits the string `s` into substrings using a string separator.
##
## Substrings are separated by the string `sep`.
var last = 0
if len(s) > 0:
while last <= len(s):
var first = last
while last < len(s) and not substrmatch(s, sep, last):
inc(last)
yield substr(s, first, last-1)
inc(last, sep.len)
For long strings and separators with length > 1, one may want to look at Boyer-Moore or KMP for additional speedup.
Joe, the loop that increments x while reading the lines of the file takes 23 seconds because you probably need to create those lines (I'm guessing here).
"wc -l" is probably just traversing the file contents and looking for line terminators: no need to create lines, allocate memory, etc. "wc -l" is faster because it's highly optimized for that use case.