nimforum mirror - strutils find

adrianv (orginal) [2019-03-21T12:53:33+01:00] view original

playing around with text search (strutils.find) I realized, that the nim find implementation uses a Boyer-Moore algorithm. This is nice in many cases, but there are scenarios where a naive brute force search is faster (small search patterns and/or a simple one shot search like find("nim is a great programming language", "great")). The situation gets worse, when a stupid user like me uses the find repeatedly without precalculating the SkipTable. - I know stupid programmers write stupid code.

Now I tried to find a better find algorithm. After some googling my first stop was a Knuth-Morris-Pratt, but in all my benchmarks a simple brute force find was better (Maybe someone can show me an artificial search string and pattern where KMP shines? ). Than I had a genius idea and "invented" a rolling hash search. After implementing it, I googled again and found out that I re-"invented" a naive form of the Rabin-Karp algorithm :) But to my astonishment my naive implementation works very well. In all my tests, it is always better than the naive brute force search. In many cases it beats the BM and when it looses it is not so bad. But I want to make some more tests to be sure. Can someone show me some real world test where BM shines - my current observation is the longer the pattern, the better is BM.

When it turns out like I guess, I would like to suggest the following for the strutils find:

find a cut off size for patterns where BM is likely to be better than RK

for smaller patterns use the RK

for longer patterns use the BM

what do you think ?

Araq (orginal) [2019-03-21T14:49:57+01:00] view original

Well one easy heuristic is to only use the lookup table if the search substring's length is >= 4 and otherwise use the bruteforce algorithm. It's of course tricky to figure out a good criterion for a standard library implementation. PRs are welcome.

dataman (orginal) [2019-03-21T14:53:20+01:00] view original

There is awesome String Matching Algorithms Research Tool.

adrianv (orginal) [2019-03-25T12:30:58+01:00] view original

@dataman thanks that was a really helpful link.

@araq I tested my simple hash algorithm against brute forth and BM with that smart tool. It always wins against brute forth. Against BM it wins always for pattern lengths under 16. Given that the BM has always an overhead for the first search I would use a heuristic with pattern under 16 or search text under 100 char.

Mirror of forum.nim-lang.org

4730 :: strutils find