I'm trying to use Nim to write a fast filter to convert MediaWiki XML files (such as the one you can download with the complete text of Wikipedia) into plain, un-marked-up text.
When I try this code on the French Wikipedia download ( https://dumps.wikimedia.org/frwiki/20150119/frwiki-20150119-pages-articles.xml.bz2 ), with its input being piped from bunzip2 and its output being redirected to a file, I find that it slows down dramatically after a few minutes, after writing 500 MB or so. I ran it overnight and found it nearly stalled after 600 MB. Its memory usage also seems to steadily increase. This looks like a memory leak.
The output at first is blazingly fast and makes me feel great about trying Nim. The fact that it seems to suddenly hit a wall makes me feel less great.
Am I doing something wrong, or is the garbage collector doing something wrong, and what can I do about it?
The code is at: http://pastebin.com/WHfDduAC
I don't know enough about the garbage collector to find out whether something is leaking. It sure looked that way to me with your code and I didn't see you doing anything weird. A simple benchmark that causes this would be nice. I suspect the stringstream.
A few lessons about speeding this up, basically allocating strings is expensive:
I did a few changes in this direction and now (after fixing the allocations stuff) the main slowdown seems to be PCRE. I got some pointers at what was slow like this:
$ nim -d:release --lineDir:on --debuginfo c wiki
$ gdb ./wiki
(gdb) run < foo.xml > foo
^C
Program received signal SIGINT, Interrupt.
0x00007ffff75eb6a4 in pcre_exec () from /lib64/libpcre.so.1
(gdb) bt
#0 0x00007ffff75eb6a4 in pcre_exec () from /lib64/libpcre.so.1
#1 0x000000000040182e in find_120726 (s=s@entry=0x7ffff4f31040, pattern=0x7ffff7f44050, start=26871)
at /media/nim/lib/impure/re.nim:215
#2 0x00000000004137c8 in filterwikitext_122083 (text=0x7ffff4f31040) at /home/deen/wiki.nim:125
#3 0x00000000004140fb in handlearticle_122462 (article=0x7fffffffd8a0) at /home/deen/wiki.nim:168
#4 0x00000000004142a5 in readmediawikixml_122479 (input=<optimized out>, filename=0x416150 <TMP99>)
at /home/deen/wiki.nim:200
#5 0x0000000000414681 in NimMain () at /media/nim/lib/system.nim:2338
#6 0x0000000000400e3d in main (argc=<optimized out>, args=<optimized out>, env=<optimized out>)
at /media/nim/lib/system.nim:2345
(gdb) c
Repeating this a few times is good enough to see if something specific is taking all the time.
Changes to the parsexml module were necessary as well to prevent copies and new memory allocations: https://github.com/Araq/Nim/pull/2171
The final code is here: https://gist.github.com/def-/53146adcb48b7624dae8
All the changes I made: https://gist.github.com/def-/53146adcb48b7624dae8/revisions
The code is fast enough that bunzip2 now reaches 100% CPU on my system. Before my changes bunzip2 started at about 20% CPU, then went down to 10% after a minute (and so on). I hope this helps.
Edit: I used this file because it was faster to download for me: http://ftp.fi.muni.cz/pub/wikimedia/dewiki/20120719/dewiki-20120719-pages-articles.xml.bz2
Nice! Being able to consume the output of bunzip2 at 100% CPU is an ideal that I had no idea would be so achievable.
I'll try it with these changes and see how much of it is due to allocating strings unnecessarily, and how much is due to new allocations in the XML parser (which I think I'll wait until it's merged into the master branch of nim).
You might be able to see from some things like that crazy textBuffer.delete call that I was a bit worried about string allocations, but I didn't know enough Nim tricks to avoid them. I've definitely learned some things from your changes.
But now the cheeky follow-up question: why is allocating strings such a problem?
I'm trying to write this code in Nim as an improvement over a Python script ( http://medialab.di.unipi.it/wiki/Wikipedia_Extractor ) that's kind of slow and doesn't give the output I want anyway. The Python script does get through the file eventually -- I recall it taking less than a day for the French data -- and Python doesn't even let you be clever about allocating strings. It seems that Nim eventually slows down to be slower than Python would be. Why does allocating strings naively cause this to happen?
The Python script does get through the file eventually -- I recall it taking less than a day for the French data
Just tried on the French data, takes about 9 minutes with the new program version and parsexml optimizations.
Something may be broken in the compiler, maybe it's the garbage collector or somewhow the memory gets overly fragmented by the many string allocations. An even smaller test case to reproduce would help.
I'm not sure yet if your program version slows down for me. Maybe I just didn't have the patience to let it run long enough yet. So far memory stays below 80 MB for me and bunzip2 keeps being utilized at around 3-6% and I'm at 600 MB output already. I'm using the current devel version of the compiler on Linux, x86_64 and compling with -d:release and gcc as the C compiler. Are you on an old Nim compiler version maybe?
Edit: Ok, output seems to slow down around 700 MB. Not seeing increased memory usage though.
Edit2: Still slowly running at about 1.2 GB. I guess it will finish if you give it all day.
It seems that Nim eventually slows down to be slower than Python would be. Why does allocating strings naively cause this to happen?
How am I supposed to know wen you don't even answer my questions? ;-)
@def It slows down (I tested with pv in the chain of shell-pipes which makes testing this trivial). I tried all the different GC of the compiler and there where some differences but none of them was preventing slowing down at least to half the speed in the beginning. It does not really use a lot of memory though, just some MB. So the leaks would be really small I guess.
I was testing just the xml parser mostly. Like just read all xml without the processing. I am interested in that because I am going to write something which reads a lot of xml (>200k 1-2 MB files) and love your work to speed up the xmlparser!
import streams, parsexml, times
proc readMediaWikiXML(input: Stream, filename="<input>") =
var xml: XmlParser
xml.open(input, filename, options={reportWhitespace})
var t = epochTime()
var i = 0
while true:
inc i
if i > 1_000_000:
i = 0
echo epochTime() - t
t = epochTime()
xml.next()
case xml.kind
of xmlEof:
break
else:
discard
xml.close
when isMainModule:
readMediaWikiXML(newFileStream(stdin))
On my system with the devel branch this starts with 0.5 seconds for 1 million iterations, then slows down to 5 seconds and more. With the xmlparse-performance branch from my PR this actually speeds up from 0.5 seconds to 0.2 seconds and stays there for the entire file. The reason for this is that I removed the need to allocate new memory and have the GC run.
But why does the regular version get so much slower? The main difference is that my.c in parsexml keeps being nilled and reallocated. This should of course be slower, but not keep getting slower with time.
Edit: Removing this one line in parsexml.nim makes the problem disappear: my.c = my.a # save for later
The code slows down because my.a is not set to the empty string at the beginning of parsing a tag, but setLen(my.a, 0) is called. While this reuses the my.a buffer, any time my.c = my.a is done, copyString() (the internal routine to perform the string copy) copies the entire buffer, not just the part that is occupied by the string. That is, once my.a has grown to a large size (because a large tag was parsed), the my.c = my.a assignment will take a disproportionate amount of time (and also waste space). You can tell that when profiling because most time is actually spent in the libc function bzero().
If you replace setLen(my.a, 0) in parsexml.rawGetTok() with my.a = "", the problem disappears.
The proper solution would probably be to compact a string when copying it.
It turns out I don't know how to compile a development version of Nim, so I'm still kind of stuck here without the fix.
I tried checking out def-'s branch and following the usual compilation instructions, but I got:
lib/pure/osproc.nim(647, 4) Error: undeclared identifier: 'defer'
What should I do to get the fix to xmlparse?
$ git clone -b xmlparse-performance https://github.com/def-/Nim
$ cd Nim
$ git clone --depth 1 git://github.com/nim-lang/csources
$ cd csources && sh build.sh
$ cd ..
$ bin/nim c koch
$ ./koch boot -d:release
Nice. Built myself a version of Nim from that branch, and managed to get something that chugs through the Dutch Wikipedia in 20 minutes. I'll say again that this is way faster than I ever expected it to be.
However, I think this data is revealing more ways to crash the XML parser.
In the Dutch data, I have to disregard an invalid entity that it reads as a character number that can't be converted to UTF-8 (&# followed by a seven-digit number and no semicolon).
In the French data -- I haven't figured out why yet, but I eventually get a SIGSEGV. Here's what it says after 26 seconds, when compiled without -d:release:
Traceback (most recent call last) wiki2text.nim(217) wiki2text wiki2text.nim(202) readMediaWikiXML wiki2text.nim(169) handleArticle parsexml.nim(152) filterHTML SIGSEGV: Illegal storage access. (Attempt to read from nil?)
You said you ran it on the French data, right? Any thoughts why this would be happening? Am I encountering an underlying bug in xmlparse, or a bug in the new branch?
You said you ran it on the French data, right? Any thoughts why this would be happening?
I'm just trying again on the French data. No crash after a few minutes of running, on my xmlparse-performance branch with the code from https://gist.github.com/def-/53146adcb48b7624dae8 and no compiler arguments (no -d:release). The data I use is this one: ftp://ftpmirror.your.org/pub/wikimedia/dumps/frwiki/20150119/frwiki-20150119-pages-articles.xml.bz2 and I'm calling it with bzip2 -d < frwiki-20150119-pages-articles.xml.bz2 | ./wiki2text > foo.
Are you using another wiki2text.nim? The line numbers don't seem to correspond to what I see.
Edit: I do get another interesting crash though:
Traceback (most recent call last)
wiki.nim(215) wiki
wiki.nim(200) readMediaWikiXML
wiki.nim(168) handleArticle
wiki.nim(97) filterHTML
parsexml.nim(604) next
parsexml.nim(592) getTok
parsexml.nim(585) rawGetTok
parsexml.nim(337) parseEntity
unicode.nim(135) toUTF8
Error: unhandled exception: value out of range: 2190508 [RangeError]
Edit2: Now with a fix: https://github.com/Araq/Nim/pull/2253
Sorry, yeah, I'm running the version I mentioned that's on Github: https://github.com/rspeer/wiki2text
The crash you describe is the one I was mentioning that happens in the Dutch data.