Using Nim's XML parser on the Wikipedia database (50GB) into individual records. 1 record = 1 Wikipedia article. This works and is very fast. However when searching the records using regex, things slow down considerably. Blazing fast XML parsing and super slow regex. It will take over 10 days to process the file at current speed by comparison awk can process the file in 8 hours. The program gets a running count of number of RE matches (searchRe) within each article (article[TEXT]) using findIter() from module nre:
let searchRe = re"archive[.]org/w?e?b?/?[0-9]{1,14}/|[{][{][ ]?[Ww]ayback"
for ss in article[TEXT].findIter(searchRe):
artcount = artcount + 1
Is there a faster way using regex? Not stuck on regex but it's the only method I know. I'm aware of pegs but wouldn't know how to translate this to pegs or if it would any faster.
You could try the re module instead of nre or write your own parser using parseutils: http://nim-lang.org/docs/parseutils.html
Another idea is that too many strings are allocated.
Do you have the full code somewhere or another way to see the slowdown? Would be interesting to look at with a callgrind for example.
Source: http://pastebin.com/RR7fnksY
Sample Wikipedia XML (7 MB unzipped): http://bachlab.balbach.net/wikipedia.zip
Yep, much faster without creating strings, just using regular re. The only difference is import re instead of import nre as well as:
echo "checking ", article[TITLE]
var pos = -1
while pos < article[TEXT].len:
pos = article[TEXT].find(mySearchRe, pos + 1)
if pos == -1: break
inc artcount
echo " found ", artcount
This notably does not allocate any data structures.
Time without regexp: 0.107 s
Time with nre.findIter: 34.687 s
Time with re.find: 0.113 s
Looks fine now. Full code here with some minor style fixes: https://gist.github.com/def-/c36ce30571bdfcd08d5902fb88cbcd5b
Awesome, thanks! Comparing with awk:
Nim (speed optimized): 0:00.90 GNU awk: 0:01.69
Nearly 90% faster. My faith in nim's advantage restored :)
This comment is a bit off topic, but related to regex parsers.
As nim is still in pre 1.0, shouldn't we pick the default regex parser to be the fastest one, and give it the default name?
Since both nre and re libs use PCRE as the regex engine, that suggests they should be similar in speed.
So maybe nre may need some optimisation, or is it understanding how to best use nre, or just improved documentation, or ... ?
Well, if a truly fast regex engine were needed it probably shouldn't be based on backtracking in the first place.
Of course that also means that it couldn't handle backreferencing and the non linear run time doesn't matter most of the time.