nimforum mirror - Is strutils find slow?

kris (orginal) [2014-05-19T11:35:18+02:00] view original

Hi,

I am new to nimrod, and find it really, really great.

Text processing matters to me, so I created a quick program in nimrod to search a 32MB, 600K+ line log file for occurrences of a particular string.

Here is my program:

[start program ff1.nim] import strutils

for line in stdin.lines:

if strutils.find(line, "02 01 08 45") > -1:: echo(line)

[end program ff1.nim]

The program is compiled with the following command line:

C:\programs\nimrod\bin\nimrod.exe c -d:release ff1.nim

The program compiles without errors or warnings. BTW, I am using mingw gcc 4.8.1 as my C compiler.

When I run it against the log file, the average time over 3 runs is 2.932 seconds on my machine (an old quad-core Intel i3 with 4GB RAM running Windows 7 Ultimate 32-bit).

By comparison, nimrod is even slower than tcl, lua, gawk, c#, netrexx, rebol, python and perl. Even a naive C implementation using strstr on each line clocks in at 0.44 seconds, more than 6 times faster than nimrod.

I use the exact same input file when testing all other languages and utilities, so if there is any difference in timing, it should be down to the language.

Am I doing anything wrong with my usage of the strutils' find command?

Please don't take this as a criticism or disrespect of the amazing work that has been put into creating nimrod. I'm just trying to make sense of the difference with C and other interpreted languages.

Kris

Varriount (orginal) [2014-05-19T13:11:25+02:00] view original

Ok, here's a checklist for optimization:

Are you compiling with "-d:release" ? This turns off debugging information, and tells the C compiler to optimize as much as possible.

Try putting your code, along with its variables, in a 'main' procedure. This speeds up the program due to optimizations that the compiler can perform.

Try changing the garbage collector used. The default garbage collector is optimized for responsiveness, not throughput, which means that memory is dynamically allocated and deallocated as close to an objects creation and destruction.

I'll see what I can do to optimize things later today.

Lastly, it would be really helpful if we had your different implementations (especially the C implementation) as well as a sample log file to test on.

Jehan (orginal) [2014-05-19T13:31:01+02:00] view original

I think that strutils.find is currently using Boyer-Moore (not 100% sure, would have to look at the code again). This means that it needs to precompute the offset table, which takes much too long when the string that is to be searched is short.

It would probably be a good idea to replace the implementation with one that works better for short strings or use an algorithm that avoids the overhead in general (e.g., Rabin-Karp with a simple and fast hash function).

Araq (orginal) [2014-05-19T13:33:23+02:00] view original

system.lines in the devel branch uses a decent buffer size (I hope) and so should be faster and this is likely the bottleneck here.

kris (orginal) [2014-05-19T17:50:48+02:00] view original

@Varriount, yes, I am using the -d:release option in the compilation command line. I will also try out the 2 other suggestions you've made, but after testing for a larger buffer size as per Araq's post. As for the code and the test file, I'll try to gather it all up and post it ASAP.

@Jehan, I'm not sure which algorithm the GCC C library strstr function uses, but it seems pretty quick to me.

@Araq (Andreas?), I think the buffer size could ultimately be the difference. Would you know how the current nimrod default buffer compares against the GCC's standard C library buffer?

BTW and OT, if anyone is interested on fast literal string searching under Windows, the Windows built-in command findstr.exe gives excellent performance. On this above micro-benchmark, findstr.exe clocks in at 0.181 seconds on my machine, more than 2.4 times faster than the naive C implementation.

Kris

Jehan (orginal) [2014-05-19T20:46:09+02:00] view original

kris: I'm not sure which algorithm the GCC C library strstr function uses, but it seems pretty quick to me.

Glibc uses a variant of the two-way string matching algorithm by Crochemore and Perrin (original paper here), which uses constant space rather than O(len(sub)) or worse. It also optimizes performance for short strings.

Edit: Here's a simple implementation of Rabin-Karp that statistically outperforms the strutils version for short strings:


proc needle_match(haystack, needle: string, start: int): bool {.inline.} =
  for i in 0..len(needle)-1:
    if haystack[start+i] != needle[i]:
      return false
  return true

proc find*(haystack, needle: string, start: int = 0): int =
  # Simple Rabin-Karp search. We use the sum of the ordinal
  # values of the characters as the hash function.
  #
  # This algorithm works generally well on human-readable text,
  # but performance can degenerate due to the rather simple hash
  # function on strings that encode more regular patterns,
  # especially when the alphabet used is very small and
  # permutations of ``needle`` frequently occur in ``haystack``
  # as a consequence.
  #
  # The implementation is intended for relatively small values
  # of ``len(haystack)``, where preprocessing ``needle`` is not
  # worth the effort.
  
  if len(haystack) < len(needle) + start:
    return -1
  case len(needle)
  of 0:
    return start
  of 1:
    let ch = needle[0]
    for i in start..haystack.high:
      if haystack[i] == ch:
        return i
    return -1
  of 2:
    let ch1 = needle[0]
    let ch2 = needle[1]
    for i in start..haystack.high-1:
      if haystack[i] == ch1 and haystack[i+1] == ch2:
        return i
    return -1
  else:
    var needle_hash = 0 # hash value of needle
    var window_hash = 0 # rolling hash value of the window
    var p = start       # position of current window
    for ch in needle:
      needle_hash += int(ch)
    for i in start..start+len(needle)-1:
      window_hash += int(haystack[i])
    while true:
      if window_hash == needle_hash:
        if needle_match(haystack, needle, p):
          return p
      p += 1
      if p + len(needle) > len(haystack):
        break
      window_hash += int(haystack[p+len(needle)-1])-int(haystack[p-1])
    return -1

when isMainModule:
  assert(find("alpha", "a") == 0)
  assert(find("beta", "a") == 3)
  assert(find("gamma", "a") == 1)
  assert(find("gamma", "b") < 0)
  assert(find("foo", "oo") == 1)
  assert(find("bar", "ro") < 0)
  assert(find("foobar", "ar") == 4)
  assert(find("foobar", "oo") == 1)
  assert(find("foobar", "fo") == 0)
  assert(find("haystack", "hay") == 0)
  assert(find("haystack", "sta") == 3)
  assert(find("haystack", "needle") < 0)
  assert(find("cbacbabc", "abc") == 5)

kris (orginal) [2014-05-27T17:43:06+02:00] view original

Jehan,

Terribly bad of me not to acknowledge your post of the Rabin-Karp implementation in Nimrod earlier. Thanks!

Kris

Mirror of forum.nim-lang.org

447 :: Is strutils find slow?