Hi to all, I'm new to NIM, but for what I've seen so far, I like it a lot. I really hope NIM's community and NIM adoption will grow up. I was wondering if I could use NIM for some specific tasks I have to deal with, and one of these is (huge) XML files parsing. I've seen a bit the online documentation and I started with something reasonably simple. I got the SwissProt.xml file, 109 MB size uncompressed XML that can be downloaded from: http://aiweb.cs.washington.edu/research/projects/xmltk/xmldata/www/repository.html I decided to extract only the content of element <Species>, along with the attribute "id" of its parent and element tag, and write the output to a text file. I'm forced to use stream mode and on-the-fly parsing, because my real target XML files are without any newline and are more than 10 GBs large.
Here's the code:
import streams, parsexml, times
var filename = "SwissProt.xml"
var s = newFileStream(filename, fmRead)
if s == nil: quit("cannot open the file " & filename)
var attrkey = ""
var attrval = ""
var elemstart = ""
var data = ""
var line = ""
let time = cpuTime()
let f = open("nim_output.csv", fmWrite)
var x: XmlParser
open(x, s, filename)
while true:
#walk through XML
case x.kind
of xmlAttribute:
attrkey = x.attrKey
if attrkey == "id":
attrval = x.attrValue
of xmlElementStart:
elemstart = x.elementName
of xmlCharData:
data = x.charData
if elemstart == "Species":
line = attrval & ";" & elemstart & ";" & data
f.writeLine(line)
of xmlEof: break # end of file reached
else: discard # ignore other events
x.next()
echo "Time taken: ", cpuTime() - time
x.close()
The output is like this:
100K_RAT;Species;Rattus norvegicus (Rat) 104K_THEPA;Species;Theileria parva 108_LYCES;Species;Lycopersicon esculentum (Tomato) 10KD_VIGUN;Species;Vigna unguiculata (Cowpea) 110K_PLAKN;Species;Plasmodium knowlesi 11S3_HELAN;Species;Helianthus annuus (Common sunflower)
(...)
However I'm not that happy with performance. I'm using NIM 1.2, and no special compilation options. On my latop (quite old DELL Latitude 5480 Intel Core i5 7th gen, 8 GB RAM, Windows 10) execution takes around 20 - 21 secs, while the same task in Python 3.8 using lxml library for XML pasing, and iterparse construct, takes around 8-9 secs. I don't expect NIM to be necessary faster (lmxl is a quite popular and efficient library written in C), but maybe in the same order of magnitude. Am I coding the wrong way for this kind of task? I also tried to replace copy-assignment like attrval = x.attrValue with shallowCopy, but I didn't gain that much (possibly 1 sec). Is there a fast way to walk through XML nodes / elements (expecially when I need to skip most of them)? Thank you.
what can help as well is putting the code into a function, so that the variables become local instead of global, which gives the compiler more optimisation opportunities, like so (I also changed some stylistic things :D):
import streams, parsexml, times
proc main() =
const filename = "SwissProt.xml"
let s = newFileStream(filename, fmRead)
if s == nil: quit("cannot open the file " & filename)
var
attrkey, attrval, elemstart, data, line: string
x: XmlParser
let
time = cpuTime()
f = open("nim_output.csv", fmWrite)
open(x, s, filename)
while true:
#walk through XML
case x.kind
of xmlAttribute:
attrkey = x.attrKey
if attrkey == "id":
attrval = x.attrValue
of xmlElementStart:
elemstart = x.elementName
of xmlCharData:
data = x.charData
if elemstart == "Species":
line = attrval & ";" & elemstart & ";" & data
f.writeLine(line)
of xmlEof: break # end of file reached
else: discard # ignore other events
x.next()
echo "Time taken: ", cpuTime() - time
x.close()
main()
FWIW, for me doofenstein's version ran in 1.24 seconds on an 11 year old laptop on Linux on your input and a gcc-10.3 PGO build ran in 1.05 seconds (with the data you linked to in your thread-opening post). On more modern server HW it ran in 0.4 seconds.
So, chances are high that if it's still taking 2 seconds for you then you have more to learn about compile strategies to get fast run-times.
Re: PGO there is more info in this thread.
And welcome!
Hmm, looks like opt:speed might interfere with debugging
can you show such example (ideally reduced) ?
I could see how that would be the case with -d:nimStackTraceOverride --import:libbacktrace --debugger:native (see also: --passc:"-fno-omit-frame-pointer -fno-optimize-sibling-calls" refs https://github.com/status-im/nim-libbacktrace/pull/10), but how would it affect --stacktrace:on, which uses instrumentation that shouldn't be affected by backend optimization options ?
(see also https://github.com/nim-lang/Nim/pull/13582 which relates to this topic)
IMHO it -d:debug stays the default the final message should very clearly and explicitly mention -d:release.
However my vote is to change the default to -d:release as @Araq suggested. Experienced users can always shift to -d:debug when needed.
Maybe a command like "nim faster" which would just give a short synopsis of d:release and d:danger. It has been my experience that people do poke at command line options with "--help and friends" even when they don't read the man pages or properly search for results on google/manual/stackoverflow.
If adopted, it should also be encouraged to add it to turorials: "Remember: to make your program run more quickly do 'nim faster'" - that's something that people are much more likely to notice and remember than the extensive list of specific relevant options.
The problem with nim faster is that it hijacks a slot presently used to select between backends..`nim c` vs. nim js vs. nim cpp.
What about:
Hint: 80712 lines; 1.693s; 147.246MiB peakmem [SuccessX]
proj: file.nim; out: file [SuccessX]
***SLOW, DEBUG BUILD***; -d:release to make code run faster. [BuildMode]
Also, @xigoi, there currently is no -d:debug to stay with. So, you almost certainly meant "stay with non-release mode by default", but I think everyone interpreted your post correctly based on your 2nd&3rd sentences. :-)SLOW, DEBUG BUILD; -d:release to make code run faster. [BuildMode]
Yeah, let's please use this solution.
I assume that stacktraces will be unaffected, but it depends on what you mean by debugging. For me, I consider debugging to be mostly using gdb with the native debugging compiler option. But if you're talking about print debugging and using nim's stacktraces, that should be fine.
From here again, it says -O1, -O2, -O3, -Ofast all have the _potential to affect native debugging (most likely if using the native debugging compiler option and trying to step through with gdb).