nimforum mirror - Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?

markebbert (orginal) [2019-08-17T20:31:29+02:00] view original

Hello,

I'm new to Nim, but was tempted to give it a go because I've heard it has the simplicity of Python and the speed of C. I sat down to write my first Nim script on last week, where I mimicked a script I had already written in Python. I was excited to see just how fast Nim would be. The script's purpose was to traverse a .vcf (Variant Call Format) file that is used to store genetic mutations from DNA sequencing. This particular .vcf file had ~300k lines and ~15k columns.

My original Python script took ~74min to parse the file, perform some simple logic, and print results to an output file. I was shocked that my Nim implementation took ~404min! I originally assumed I was abusing Nim in some way, but using the profiler, I was surprised to learn that 53% of the time was spent in strutil::split (I was splitting by tabs; 't'). I then wrote a minimal example in Nim, where all I do is read through the file and split lines on tabs, just to make sure I wasn't missing something. It still took ~400 minutes. Python takes ~30min using the "same" minimal example. I then learned that Nim's split iterator might speed things up a bit, so I tried that. It definitely helped, but still took ~178min. A Groovy version took ~11 minutes.

With all of the promise of Nim, I'm surprised that this basic function is so slow compared to Python. I'm mostly curious what's causing the difference. Can anyone explain? I've seen related posts here, but they were fairly old and didn't really delve into why.

I realize that Nim is still young, and I'm not trying to be critical, but I was surprised, given everything I've heard about Nim's speed. Am I missing something? Any upcoming updates that will address this?

Really appreciate any feedback.

Here are my three different versions:

Nim (using split; ~400min)

import zip/gzipfiles
import strutils

block:
  let vcf = newGzFileStream("../my_file.vcf.gz")
  
  var
    line: string
    toks: seq[string]
    sample_toks: seq[string]
  
  while not vcf.atEnd():
    line = vcf.readLine()
    
    if not line.startsWith('#'):
      
      toks = line.strip().split('\t')
        
        # Only need sample columns
        for i,sample in toks[9..<toks.len]:
            sample_toks = sample.split(':')

Nim (split iterator; ~178min)

import zip/gzipfiles
import strutils

block:
  let vcf = newGzFileStream("../my_file.vcf.gz")
  
  var
    line: string
    genotype: string
    i: int
  
  i = 0
  while not vcf.atEnd():
    line = vcf.readLine()
    
    if not line.startsWith('#'):
      
      for col in split(line, '\t'):
        
        # Only need sample columns
        if i >= 9:
          for fmt in split(col, ':'):
            genotype = fmt
            
            # Only need the first element (GT)
            break
        
        inc(i)

Python (~30min)

import sys
import gzip

def main(vcf):
    vcf = gzip.open(vcf, 'r')
    for line in vcf:
        if not line.startswith('#'):
            
            toks = line.strip().split('\t')
            
            for index,sample in enumerate(toks[9:]):
                sample_toks = sample.split(':')


if __name__ == "__main__":
    main(sys.argv[1])

Groovy (~11min)


import java.util.zip.GZIPInputStream
import java.io.FileInputStream
import java.util.ArrayList

def gzInputStream = new GZIPInputStream(new FileInputStream("../my_file.vcf.gz"))

gzInputStream.withReader { Reader reader ->
    def i = 0
    def geno
    reader.eachLine { String line ->
        if(!line.startsWith("#")){
            line.split().each{ col ->
                i = 0
                if(i >= 9){
                    geno = col.split(':')
                }
            }
        }
    }
}

Stefan_Salewski (orginal) [2019-08-17T21:29:01+02:00] view original

Have you compiled your Nim program with option -d:release or -d:danger?

Just to ensure that it is not a default debug built.

Of course Nim should be not slower than Python, but I guess it may be not really faster for plain string split. The reason is, that these basic operations are generally executed by optimized C code in Python.

siloamx (orginal) [2019-08-18T00:37:39+02:00] view original

This is not an idiomatic Python code. You can easily make your Python code faster using list comprehensions. Python loops outside the comprehensions are slow. You should prefer using comprehesions and map/filter functions always when it's possible.

import sys import gzip

def main(vcf):: vcf = gzip.open(vcf, 'r') toks = [ line.strip().split('t') for line in vcf if not line.startswith('#')]
: sample_toks = [ sample.split(':') for sample in toks[9:] ]
if __name__ == "__main__":: main(sys.argv[1])

Araq (orginal) [2019-08-18T09:49:00+02:00] view original

I use lexbase and strscans modules and never had performance problems. Parsing with split is naive, it hardly works, so we never really optimized it.

siloamx (orginal) [2019-08-18T10:39:29+02:00] view original

Maybe

'Parsing with split is naive'

but it will be very common esp. among users who migrate from Python/Javascript. It would be nice if you could optimize it.

cblake (orginal) [2019-08-18T11:10:56+02:00] view original

I agree with @siloamx. I would also point out that for many less compiler/parsing-sophisticated programmers, splitting is conceptually simple enough to make all the difference. Such programmers may never even have heard of "lexing". This is just to re-express @Araq's point about it being "naive" in maybe a slightly more accomodating way. Probably, they should learn, but maybe they want to focus on some simple problem. Formats aren't always in a strict/pure table format amenable to parsecsv.

I have some routines in cligen/mslice (https://github.com/c-blake/cligen) that may help for this. Some example usage is in examples/cols.nim. In the mmap mode of cols, I get ~50% the run-time of GNU awk for field splitting, though I have admittedly never tried with 15,000 columns.

One final, related point is that gunzip can be very slow at decompressing and is single-threaded. If the average column width is ~20 Bytes, 300k*15k*20 =~ 90 GB. Working with the uncompressed file directly or using something that can decompress in parallel like Zstd (https://en.wikipedia.org/wiki/Zstandard) may help a lot, especially if you have 4-12 idle cores as many do these days. On Linux, you should be able to just

let f = popen("pzstd -cdqp8 < myfile.zs".cstring, "r".cstring)

(You may have to, just once, zcat file.gz | pzstd -f -p8 -19 > file.zs and, of course, this will not help if you have a pile of .gz files only to be processed once.)

Araq (orginal) [2019-08-18T13:14:48+02:00] view original

You can optimize it all you want, in the end it's still naive because it's wrong: What if data spans multiple lines? What if there is some escape mechanism via \? What if the data can be encoded via %xx (byte in hex). There are not many file formats around that have no escape/quoting mechanisms. Either you write a lexer/parser or you use a premade library for the format in question.

cblake (orginal) [2019-08-18T14:22:31+02:00] view original

In fact, that VCF format does use \ escapes (https://samtools.github.io/hts-specs/VCFv4.3.pdf). So, basically all the above code examples are indeed wrong, and @Araq's advice is the right general advice (I did say they should probably learn).

That said, I also still think there are many situations both quick & dirty and otherwise where simple splitting makes sense and is not wrong. Why else provide it in strutils?

Araq (orginal) [2019-08-18T15:04:27+02:00] view original

I suspect what's really slow here is Nim's IO or unzipping and this should definitely be looked into and fixed.

Well, yes, strutils is for quick & dirty hacking, not for "quick runtimes". And optimizing splits is easier said than done, effectively you need 2 different versions so that split continues to work at compile-time. And then it still adds maintainance costs. Plus in the end all you do is to make somebody's bugs run faster.

cblake (orginal) [2019-08-18T15:18:27+02:00] view original

On slightly closer inspection of that spec, it seems that backslash quoting only happens in the ## comment sections ignored above. So, maybe they aren't buggy after all if that data is not important to the calculation in question.

A further point along the lines of "if we're going to provide split, maybe we should provide faster variants" is that the times when it is going to matter most will be for huge inputs where the columns may well have a more regular nature like numbers and specifically forbid more complex lexical structure. There it might A) be correct, B) performance might matter a lot in human terms, and C) the programmer might mostly be non-sophisticated with regard to even terminology like "lexing" or "vectorized memchr". This all seems to be the case with this VCF thing, but I feel like it's come up quite a few times over the years. It may not always be "bugs running faster". :-)

Anyway, I don't think "to force people to learn new terminology/techniques" is a very welcoming answer. So, I tried to provide something more welcoming. Even if their parsing is sloppy & error prone, I think naive programmers facing consequent errors on their own data sets rather than complaining about Nim library performance is better optics for Nim.

All that said, I think we agree 100% that we probably need more information from @markebbert to help him any more with his actual problem. Maybe it is IO. Maybe he didn't even compile with -d:danger. If he's on Linux, I would suggest him decompressing first and trying my mmap versions. 90 GB/(100 MB/s) =~ 900 seconds =~ 15 minutes. Heck, some people even have 90GB of RAM. :-)

markebbert (orginal) [2019-08-22T04:28:13+02:00] view original

Thanks @jyapayne. I think we (i.e., you) are really close. The first character of each line that exceeds the buffer was getting cut off (or maybe if the prior line exceeded buffer?). Looks like we're off by one at the same spot. I believe:

buffer.add data[pos+1 ..< pos+bufSize]

should be:

buffer.add data[pos ..< pos+bufSize]  # Without '+1'

Looks like pos already gets set to last + 1 in the prior conditional statement. What I'm not sure about is how this should affect the following statement. Should that be pos-1?

I plan to play more with @cblakes code, but I'm curious to see this through, out of curiosity.

Using your updated code, combined with my minor change, the script parses my large VCF in just under 12min, which is just a bit longer than the Groovy code. Sounds like popen from @cblakes will speed that up by decompressing in a second thread, but that Nim and Groovy are fairly similar unzipping and parsing in a single thread. Or did I misunderstand something?

brentp (orginal) [2019-08-22T15:12:52+02:00] view original

just to add my 0.02 here as I happened on this and I have avoided using gzip stuff from nim as it is too slow. Mark (hi!), I know you linked to hts-nim but that will give you multi-threaded decompression for bgzipped files and a VCF parser. If you want to parse the VCF yourself, you can use hts_lines which will do multi-threaded decompression and give you an iterator of lines.

I know this side-steps your original problem and it would be great to see the Stream stuff faster in nim, but, since you probably have htslib on your system, you can use hts-nim. I thought about wrapping zlib.h or bgzip.h directly with a lines iterator in its own module but havent done so yet.

jyapayne (orginal) [2019-08-22T16:28:44+02:00] view original

@markebbert Good catch! Thanks for debugging :P

Now that you mention it, the whole section

if data[last] == '\l':
  buffer.add data[pos+1 ..< pos+bufSize]
else:
  buffer.add data[pos ..< pos+bufSize]

can just be replaced with

buffer.add data[pos ..< pos+bufSize]

pos there should always be the right thing since it is adjusted when a newline is found.

@brentp Nice library! Multithreaded gzip decompression and a nice parser. Looks like it might be exactly what @markebbert needs.

markebbert (orginal) [2019-08-22T17:55:15+02:00] view original

Thanks @jyapayne. It's nice to have this code I can look back on for future use.

@brentp, fancy seeing you around here. I blame you for starting me on this little journey. :-P

Thanks for pointing me towards the multi-threaded .gz decompression in hts-nim. I will definitely use that. Did you read through all of the posts by @cblake? I'm intrigued by his libraries, and particularly intrigued by Zstd. I think Zstd is something we need to use in our field.

brentp (orginal) [2019-08-22T20:29:28+02:00] view original

hi @markebbert, re Zstd and htslib, see: https://github.com/samtools/htslib/issues/530

ping me if you have any questions on hts-nim.

cdunn2001 (orginal) [2019-08-26T00:17:17+02:00] view original

The slowness of the Nim streams library is frustrating, but I've learned to avoid FileStream. I simply readAll() into memory and then use StringStream, where the unbuffered implementation is fine.

mratsim (orginal) [2019-08-26T09:56:46+02:00] view original

We have an alternative streams implementation that uses memfiles for speed: nim-faststreams

cblake (orginal) [2019-08-27T20:03:20+02:00] view original

It looks like nim-faststreams is just to bridge the API gap between mmap IO and streams interfaces which is nice and all, but won't help for startProcess or other popen-like contexts where in this discussion streams slowness was a problem. { @mratsim didn't say it would, exactly, but I thought I should clarify (unless I'm wrong). }

More importantly, perhaps to @markebbert's surprise, it turns out that the bioinformatics guys are being smarter than pigz with this bgzip tool which supports parallel decompression. That very same example file that's been my running example since @jyapayne mentioned it is in fact bgzip d. It can decompress in only 11.1 seconds on the same machine with 4 threads. (I got it down to 3.5 seconds with over 25 threads but that same Zstd example gets down to 2.2s using similarly many threads). So, an mSlices + popen("bgzip -d --threads 4 < ALL.chrX.phase3_shapeit2_mvncall_integrated_v1b.20130502.genotypes.vcf.gz") version should be able to go in roughly 1 minute for that file.

As @brentp surely knows, his hts-nim library actually supports a set_threads and read_line that can deliver that, too. It's entirely possible @markebbert could have been getting that much faster decompress time all along (with python and groovy as well via a popen analogue). Or maybe his .gz is not from bgzip and he couldn't. The way for him to tell is to run file on the .gz and see if it has "Blocked GNU Zip Format (BGZF; gzip compatible), block length" instead of simply "gzip compressed data".

Of course, the compression ratio will still be 12x worse than Zstd. Besides the obvious space waste, rather than a tiny 45-75 MB/s you would need backing IO of 172..542 MB/s (depending on 4 or 25 threads). On the high end many thread case, you're looking at either a fast SSD or a RAID array of spinning disks to feed that analysis pipeline (assuming zero cost parsing, anyway, which I know his example calculation did not have, but that assumption simplifies since he didn't really show a full calculation).

brentp (orginal) [2019-10-03T18:43:10+02:00] view original

I finally wrote a quick wrapper for gzfile that looks a lot like nim's File.

You can use like

import gzfile
import strutils

var vcf:GZFile
doAssert vcf.open("test.vcf.gz")

for line in vcf.lines:
  if not line.startsWith('#'):
    # do something
    echo line

this also works for non-gzipped files. and supports writing as well. available via:

nimble install https://github.com/brentp/nim-gzfile

Mirror of forum.nim-lang.org

5103 :: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?