nimforum mirror - XML parsing performance

tcheran (orginal) [2021-04-24T15:42:34+02:00] view original

Hi to all, I'm new to NIM, but for what I've seen so far, I like it a lot. I really hope NIM's community and NIM adoption will grow up. I was wondering if I could use NIM for some specific tasks I have to deal with, and one of these is (huge) XML files parsing. I've seen a bit the online documentation and I started with something reasonably simple. I got the SwissProt.xml file, 109 MB size uncompressed XML that can be downloaded from: http://aiweb.cs.washington.edu/research/projects/xmltk/xmldata/www/repository.html I decided to extract only the content of element <Species>, along with the attribute "id" of its parent and element tag, and write the output to a text file. I'm forced to use stream mode and on-the-fly parsing, because my real target XML files are without any newline and are more than 10 GBs large.

Here's the code:

import streams, parsexml, times
var filename = "SwissProt.xml"

var s = newFileStream(filename, fmRead)
if s == nil: quit("cannot open the file " & filename)

var attrkey = ""
var attrval = ""
var elemstart = ""
var data = ""
var line = ""

let time = cpuTime()
let f = open("nim_output.csv", fmWrite)
var x: XmlParser
open(x, s, filename)
while true:
  #walk through XML
  case x.kind
  of xmlAttribute:
    attrkey = x.attrKey
    if attrkey == "id":
      attrval = x.attrValue
  
  of xmlElementStart:
    elemstart = x.elementName
  
  of xmlCharData:
    data = x.charData
    if elemstart == "Species":
      line = attrval & ";" & elemstart & ";" & data
      f.writeLine(line)
  of xmlEof: break # end of file reached
  else: discard # ignore other events
  x.next()
echo "Time taken: ", cpuTime() - time
x.close()

The output is like this:

100K_RAT;Species;Rattus norvegicus (Rat) 104K_THEPA;Species;Theileria parva 108_LYCES;Species;Lycopersicon esculentum (Tomato) 10KD_VIGUN;Species;Vigna unguiculata (Cowpea) 110K_PLAKN;Species;Plasmodium knowlesi 11S3_HELAN;Species;Helianthus annuus (Common sunflower)

(...)

However I'm not that happy with performance. I'm using NIM 1.2, and no special compilation options. On my latop (quite old DELL Latitude 5480 Intel Core i5 7th gen, 8 GB RAM, Windows 10) execution takes around 20 - 21 secs, while the same task in Python 3.8 using lxml library for XML pasing, and iterparse construct, takes around 8-9 secs. I don't expect NIM to be necessary faster (lmxl is a quite popular and efficient library written in C), but maybe in the same order of magnitude. Am I coding the wrong way for this kind of task? I also tried to replace copy-assignment like attrval = x.attrValue with shallowCopy, but I didn't gain that much (possibly 1 sec). Is there a fast way to walk through XML nodes / elements (expecially when I need to skip most of them)? Thank you.

SolitudeSF (orginal) [2021-04-24T17:49:16+02:00] view original

-d:danger

tcheran (orginal) [2021-04-24T20:00:40+02:00] view original

WOW! Fantastic !!! With -d:danger option the execution time of the posted code dropped to slightly more than 3 secs. If applied to the version using shallowCopy, even less, around 2 secs and something! Thank you SolitudeSF. P.S. I just hope that applying -d:danger did not modify the space-time continuum! ;)

ynfle (orginal) [2021-04-24T20:27:28+02:00] view original

-d:danger takes away bound checks and other safety mechanisms see https://nim-lang.org/docs/nimc.html#additional-compilation-switches. You could use any of``-d:release -opt:speed -d:lto``. Also try changing the garbage collector to --gc:arc (though I'd suggest using the latest version of nim, which 1.4.6, as there are many bug fixes along the way)

tcheran (orginal) [2021-04-24T20:45:57+02:00] view original

Thank you for your suggestions ynfle. Yeah, I probably should move to 1.4.6 and learn more about compile options.

doofenstein (orginal) [2021-04-24T22:01:40+02:00] view original

what can help as well is putting the code into a function, so that the variables become local instead of global, which gives the compiler more optimisation opportunities, like so (I also changed some stylistic things :D):

import streams, parsexml, times

proc main() =
  const filename = "SwissProt.xml"
  
  let s = newFileStream(filename, fmRead)
  if s == nil: quit("cannot open the file " & filename)
  
  var
    attrkey, attrval, elemstart, data, line: string
    x: XmlParser
  
  let
    time = cpuTime()
    f = open("nim_output.csv", fmWrite)
  
  open(x, s, filename)
  while true:
    #walk through XML
    case x.kind
    of xmlAttribute:
      attrkey = x.attrKey
      if attrkey == "id":
        attrval = x.attrValue
    
    of xmlElementStart:
      elemstart = x.elementName
    
    of xmlCharData:
      data = x.charData
      if elemstart == "Species":
        line = attrval & ";" & elemstart & ";" & data
        f.writeLine(line)
    of xmlEof: break # end of file reached
    else: discard # ignore other events
    x.next()
  echo "Time taken: ", cpuTime() - time
  x.close()
main()

tcheran (orginal) [2021-04-25T10:18:29+02:00] view original

You're a gentle community with rookies... I like it. :)

cblake (orginal) [2021-04-25T13:06:31+02:00] view original

FWIW, for me doofenstein's version ran in 1.24 seconds on an 11 year old laptop on Linux on your input and a gcc-10.3 PGO build ran in 1.05 seconds (with the data you linked to in your thread-opening post). On more modern server HW it ran in 0.4 seconds.

So, chances are high that if it's still taking 2 seconds for you then you have more to learn about compile strategies to get fast run-times.

Re: PGO there is more info in this thread.

And welcome!

enthus1ast (orginal) [2021-04-25T16:04:41+02:00] view original

@Araq this comes up very often lately, maybe a compile time hint ala: "for a release build compile with '-d:release' would be a good idea.

Araq (orginal) [2021-04-25T17:50:00+02:00] view original

It already says "Debug build", nobody reads it. The basic tutorials also mention -d:release.

timothee (orginal) [2021-04-26T22:34:59+02:00] view original

Hmm, looks like opt:speed might interfere with debugging

can you show such example (ideally reduced) ?

I could see how that would be the case with -d:nimStackTraceOverride --import:libbacktrace --debugger:native (see also: --passc:"-fno-omit-frame-pointer -fno-optimize-sibling-calls" refs https://github.com/status-im/nim-libbacktrace/pull/10), but how would it affect --stacktrace:on, which uses instrumentation that shouldn't be affected by backend optimization options ?

(see also https://github.com/nim-lang/Nim/pull/13582 which relates to this topic)

didlybom (orginal) [2021-04-27T00:29:26+02:00] view original

IMHO it -d:debug stays the default the final message should very clearly and explicitly mention -d:release.

However my vote is to change the default to -d:release as @Araq suggested. Experienced users can always shift to -d:debug when needed.

xigoi (orginal) [2021-04-27T06:35:31+02:00] view original

I vote for staying with -d:debug. You debug a program much more often than release it. And most compiled languages do it like this.

lscrd (orginal) [2021-04-27T09:50:08+02:00] view original

I agree. And if we choose release mode as default, some people will complain that there is not enough assistance when an error occurs.

cumulonimbus (orginal) [2021-04-27T10:01:33+02:00] view original

Maybe a command like "nim faster" which would just give a short synopsis of d:release and d:danger. It has been my experience that people do poke at command line options with "--help and friends" even when they don't read the man pages or properly search for results on google/manual/stackoverflow.

If adopted, it should also be encouraged to add it to turorials: "Remember: to make your program run more quickly do 'nim faster'" - that's something that people are much more likely to notice and remember than the extensive list of specific relevant options.

cblake (orginal) [2021-04-27T11:35:23+02:00] view original

The problem with nim faster is that it hijacks a slot presently used to select between backends..`nim c` vs. nim js vs. nim cpp.

What about:


Hint: 80712 lines; 1.693s; 147.246MiB peakmem [SuccessX]
proj: file.nim; out: file [SuccessX]
***SLOW, DEBUG BUILD***; -d:release to make code run faster. [BuildMode]

Also, @xigoi, there currently is no -d:debug to stay with. So, you almost certainly meant "stay with non-release mode by default", but I think everyone interpreted your post correctly based on your 2nd&3rd sentences. :-)

Araq (orginal) [2021-04-27T11:40:57+02:00] view original

SLOW, DEBUG BUILD; -d:release to make code run faster. [BuildMode]

Yeah, let's please use this solution.

jyapayne (orginal) [2021-04-27T15:42:55+02:00] view original

I assume that stacktraces will be unaffected, but it depends on what you mean by debugging. For me, I consider debugging to be mostly using gdb with the native debugging compiler option. But if you're talking about print debugging and using nim's stacktraces, that should be fine.

From here again, it says -O1, -O2, -O3, -Ofast all have the _potential to affect native debugging (most likely if using the native debugging compiler option and trying to step through with gdb).

cblake (orginal) [2021-04-27T16:21:08+02:00] view original

This PR might be a good start. It is fundamentally an empirical question whether people will pay attention more (to anything, really). Might be worth backporting to expand the range/immediacy of the psychological testing. :-)

Mirror of forum.nim-lang.org

7848 :: XML parsing performance