nimforum mirror - Regex speed problem

stbalbach (orginal) [2016-06-12T18:12:32+02:00] view original

Using Nim's XML parser on the Wikipedia database (50GB) into individual records. 1 record = 1 Wikipedia article. This works and is very fast. However when searching the records using regex, things slow down considerably. Blazing fast XML parsing and super slow regex. It will take over 10 days to process the file at current speed by comparison awk can process the file in 8 hours. The program gets a running count of number of RE matches (searchRe) within each article (article[TEXT]) using findIter() from module nre:

let searchRe = re"archive[.]org/w?e?b?/?[0-9]{1,14}/|[{][{][ ]?[Ww]ayback"
for ss in article[TEXT].findIter(searchRe):
  artcount = artcount + 1

Is there a faster way using regex? Not stuck on regex but it's the only method I know. I'm aware of pegs but wouldn't know how to translate this to pegs or if it would any faster.

def (orginal) [2016-06-12T18:40:12+02:00] view original

You could try the re module instead of nre or write your own parser using parseutils: http://nim-lang.org/docs/parseutils.html

Another idea is that too many strings are allocated.

Do you have the full code somewhere or another way to see the slowdown? Would be interesting to look at with a callgrind for example.

stbalbach (orginal) [2016-06-12T19:25:21+02:00] view original

Source: http://pastebin.com/RR7fnksY

Sample Wikipedia XML (7 MB unzipped): http://bachlab.balbach.net/wikipedia.zip

Stefan_Salewski (orginal) [2016-06-12T19:46:59+02:00] view original

@stbalbach

Maybe you can test the new http://nim-lang.org/docs/strscans.html

def (orginal) [2016-06-12T19:58:50+02:00] view original

Yep, much faster without creating strings, just using regular re. The only difference is import re instead of import nre as well as:

echo "checking ", article[TITLE]

var pos = -1
while pos < article[TEXT].len:
  pos = article[TEXT].find(mySearchRe, pos + 1)
  if pos == -1: break
  inc artcount

echo " found ", artcount

This notably does not allocate any data structures.


Time without regexp:     0.107 s
Time with nre.findIter: 34.687 s
Time with re.find:       0.113 s

Looks fine now. Full code here with some minor style fixes: https://gist.github.com/def-/c36ce30571bdfcd08d5902fb88cbcd5b

stbalbach (orginal) [2016-06-12T21:53:33+02:00] view original

Awesome, thanks! Comparing with awk:

Nim (speed optimized): 0:00.90 GNU awk: 0:01.69

Nearly 90% faster. My faith in nim's advantage restored :)

xyz32 (orginal) [2016-06-15T15:00:58+02:00] view original

This comment is a bit off topic, but related to regex parsers.

As nim is still in pre 1.0, shouldn't we pick the default regex parser to be the fastest one, and give it the default name?

jlp765 (orginal) [2016-06-16T02:20:45+02:00] view original

Since both nre and re libs use PCRE as the regex engine, that suggests they should be similar in speed.

So maybe nre may need some optimisation, or is it understanding how to best use nre, or just improved documentation, or ... ?

stbalbach (orginal) [2016-06-17T03:40:16+02:00] view original

According to @def, it's nre allocating data structures.

Tarmean (orginal) [2016-06-21T19:52:23+02:00] view original

Well, if a truly fast regex engine were needed it probably shouldn't be based on backtracking in the first place.

Of course that also means that it couldn't handle backreferencing and the non linear run time doesn't matter most of the time.

Mirror of forum.nim-lang.org

2312 :: Regex speed problem