Hi,
I am new to nimrod, and find it really, really great.
Text processing matters to me, so I created a quick program in nimrod to search a 32MB, 600K+ line log file for occurrences of a particular string.
Here is my program:
[start program ff1.nim] import strutils
[end program ff1.nim]
The program is compiled with the following command line:
C:\programs\nimrod\bin\nimrod.exe c -d:release ff1.nim
The program compiles without errors or warnings. BTW, I am using mingw gcc 4.8.1 as my C compiler.
When I run it against the log file, the average time over 3 runs is 2.932 seconds on my machine (an old quad-core Intel i3 with 4GB RAM running Windows 7 Ultimate 32-bit).
By comparison, nimrod is even slower than tcl, lua, gawk, c#, netrexx, rebol, python and perl. Even a naive C implementation using strstr on each line clocks in at 0.44 seconds, more than 6 times faster than nimrod.
I use the exact same input file when testing all other languages and utilities, so if there is any difference in timing, it should be down to the language.
Am I doing anything wrong with my usage of the strutils' find command?
Please don't take this as a criticism or disrespect of the amazing work that has been put into creating nimrod. I'm just trying to make sense of the difference with C and other interpreted languages.
Kris
I'll see what I can do to optimize things later today.
Lastly, it would be really helpful if we had your different implementations (especially the C implementation) as well as a sample log file to test on.
I think that strutils.find is currently using Boyer-Moore (not 100% sure, would have to look at the code again). This means that it needs to precompute the offset table, which takes much too long when the string that is to be searched is short.
It would probably be a good idea to replace the implementation with one that works better for short strings or use an algorithm that avoids the overhead in general (e.g., Rabin-Karp with a simple and fast hash function).
@Varriount, yes, I am using the -d:release option in the compilation command line. I will also try out the 2 other suggestions you've made, but after testing for a larger buffer size as per Araq's post. As for the code and the test file, I'll try to gather it all up and post it ASAP.
@Jehan, I'm not sure which algorithm the GCC C library strstr function uses, but it seems pretty quick to me.
@Araq (Andreas?), I think the buffer size could ultimately be the difference. Would you know how the current nimrod default buffer compares against the GCC's standard C library buffer?
BTW and OT, if anyone is interested on fast literal string searching under Windows, the Windows built-in command findstr.exe gives excellent performance. On this above micro-benchmark, findstr.exe clocks in at 0.181 seconds on my machine, more than 2.4 times faster than the naive C implementation.
Kris
kris: I'm not sure which algorithm the GCC C library strstr function uses, but it seems pretty quick to me.
Glibc uses a variant of the two-way string matching algorithm by Crochemore and Perrin (original paper here), which uses constant space rather than O(len(sub)) or worse. It also optimizes performance for short strings.
Edit: Here's a simple implementation of Rabin-Karp that statistically outperforms the strutils version for short strings:
proc needle_match(haystack, needle: string, start: int): bool {.inline.} =
for i in 0..len(needle)-1:
if haystack[start+i] != needle[i]:
return false
return true
proc find*(haystack, needle: string, start: int = 0): int =
# Simple Rabin-Karp search. We use the sum of the ordinal
# values of the characters as the hash function.
#
# This algorithm works generally well on human-readable text,
# but performance can degenerate due to the rather simple hash
# function on strings that encode more regular patterns,
# especially when the alphabet used is very small and
# permutations of ``needle`` frequently occur in ``haystack``
# as a consequence.
#
# The implementation is intended for relatively small values
# of ``len(haystack)``, where preprocessing ``needle`` is not
# worth the effort.
if len(haystack) < len(needle) + start:
return -1
case len(needle)
of 0:
return start
of 1:
let ch = needle[0]
for i in start..haystack.high:
if haystack[i] == ch:
return i
return -1
of 2:
let ch1 = needle[0]
let ch2 = needle[1]
for i in start..haystack.high-1:
if haystack[i] == ch1 and haystack[i+1] == ch2:
return i
return -1
else:
var needle_hash = 0 # hash value of needle
var window_hash = 0 # rolling hash value of the window
var p = start # position of current window
for ch in needle:
needle_hash += int(ch)
for i in start..start+len(needle)-1:
window_hash += int(haystack[i])
while true:
if window_hash == needle_hash:
if needle_match(haystack, needle, p):
return p
p += 1
if p + len(needle) > len(haystack):
break
window_hash += int(haystack[p+len(needle)-1])-int(haystack[p-1])
return -1
when isMainModule:
assert(find("alpha", "a") == 0)
assert(find("beta", "a") == 3)
assert(find("gamma", "a") == 1)
assert(find("gamma", "b") < 0)
assert(find("foo", "oo") == 1)
assert(find("bar", "ro") < 0)
assert(find("foobar", "ar") == 4)
assert(find("foobar", "oo") == 1)
assert(find("foobar", "fo") == 0)
assert(find("haystack", "hay") == 0)
assert(find("haystack", "sta") == 3)
assert(find("haystack", "needle") < 0)
assert(find("cbacbabc", "abc") == 5)
Jehan,
Terribly bad of me not to acknowledge your post of the Rabin-Karp implementation in Nimrod earlier. Thanks!
Kris