nimforum mirror - File iteration slow

nvill (orginal) [2015-08-25T15:33:09+02:00] view original

I am exploring the use of Nim in bioinformatics (love what I've seen so far). I wrote a couple of tests for typical tasks such as parsing files. I noticed that iterating over a file line-by-line and splitting strings in my tests is slower compared to other languages, which was quite a surprise. I'm using stable version 0.11.2 from the Arch Linux repositories.

The test file I've been using is in vcf format with ~2500 tab-separated fields per (non-comment) line. Decreasing the number of fields to something like 100, or even just iterating line-by-line, does not severely affect the results - Nim is still the slowest.

# about 968 Mb download
TESTFILE=testfile.vcf
curl ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz | gzip -dc |head -n 100000 > $TESTFILE

./vcftest $TESTFILE && \
awk -f vcftest.awk $TESTFILE && \
julia vcftest.jl $TESTFILE && \
./vcftestcpp $TESTFILE && \
python3 vcftest.py $TESTFILE

Nim elapsed 12.1297550201416 seconds
Awk elapsed 4 seconds
Julia elapsed 10.404911041259766 seconds
C++ elapsed 9.16342 seconds
Python elapsed 7.627493143081665 seconds

I've uploaded my code snippets (Nim, Python 3, Julia, C++11, awk) to pastebin for reference:

Nim (stable 0.11.2, compiled with -d:release, gcc 5.2.0)

Julia (stable 0.3.11)

Cpp (C++11, compiled with -O2)

Python3

gawk

Is my code dogmatic enough? I read this this thread, but it looks like this wasn't included in 0.11.2. Compiling the latest git version as of today, the Nim results improved to ~10 s which is good, but I think there is room for improvement.

Given the overall performance of Nim and the simplicity of the syntax, Nim could be a killer in bioinformatics!

Jehan (orginal) [2015-08-26T00:27:45+02:00] view original

It's not about I/O, I/O actually takes up only a small part of the time.

Most of the time is being spent in line.split('\t'), which does lots of memory allocations. There is at least one inefficiency here that leads to memory for each field being allocated twice.

Also, this is a case where the mark-and-sweep GC is faster than the reference-counting GC.

Fixing the performance problem in split() and switching to mark-and-sweep GC makes Nim roughly 40% faster than Python 3 on my machine.

Keep in mind that for allocation-heavy code, most languages will have comparable performance, since the memory allocator will dominate runtime.

kirbyfan64sos (orginal) [2015-08-26T00:45:06+02:00] view original

This is unnecessarily slow in any language. You should be doing something like (in Python):

if line[line.rfind('\t'):] == 'hello':

Jehan (orginal) [2015-08-26T01:20:24+02:00] view original

kirbyfan64sos: This is unnecessarily slow in any language.

I expect that any actual application would process all the fields, so this trick wouldn't work then.

jlp765 (orginal) [2015-08-26T05:51:39+02:00] view original

Make sure you compile with -d:release

It is MUCH slower otherwise, due to other checking that gets done.

If you can read all of the file into memory, then it is fairly fast doing one of the following two approaches (timed on Windows7, Linux may be different), and note, this is just getting lines of data, not the splitting by tabs mention previously:


  if open(F, fname):
    for line in F.readAll().splitLines():


  if open(F, fname):
    let x = F.readAll()
    for line in x.splitLines():

The root cause (as I understand it) of the slow reading by line is that underneath the hood, Nim handles files with three possible line delimeters: CR, CRLF, LF (old MAC, Win, Linux/UNIX)

For me on Windows, using the lines() iterator or readLine() is just over 10 times slower than the two methods above, which is significant for LARGE files

nvill (orginal) [2015-08-26T09:27:57+02:00] view original

Thank you for your suggestions. The Nim program was indeed compiled with -d:release. Reading the whole file into memory with readAll() in not feasible, since the file size in some cases can be >100 Gb. I don't believe the split() bit is the problem here. I added the "compare to last field" just to make sure that the whole thing wasn't removed in compiler optimization; as @Jehan notes, this was just a placeholder for any type of work. Modifying the code to just iterate through the file without splitting (and comparing the whole line to something like "hello" for reasons given above) confirms what @jlp765 claims; Nim is more than two times slower than any of the other languages.

Well, that's a bummer.

Araq (orginal) [2015-08-26T09:41:56+02:00] view original

Is that with devel? Two times slower?

nvill (orginal) [2015-08-26T09:51:00+02:00] view original

No, the latter test was using 0.11.2. Apparently stuff have happened since: a fresh compile gives performance some 25% behind C++ but ahead of the rest.

Edit:

Timings for simply iterating over file (with string comparison):

Nim (stable) elapsed 2.314162969589233 seconds
Awk elapsed 1 seconds
Julia elapsed 1.0157310962677002 seconds
C++ elapsed 0.24666 seconds
Python elapsed 0.9296040534973145 seconds
Nim (devel) elapsed 0.3295872211456299 seconds

But indeed splitting is still slow:

Nim (stable) elapsed 20.50117897987366 seconds
Awk elapsed 4 seconds
Julia elapsed 13.904340028762817 seconds
C++ elapsed 12.4989 seconds
Python elapsed 11.655924320220947 seconds
Nim (devel) elapsed 15.48185396194458 seconds

Jehan (orginal) [2015-08-26T10:38:17+02:00] view original

nvill: I don't believe the split() bit is the problem here.

I profiled it (and then modified the system library locally to improve performance). I'm absolutely positive that it is the problem; reading the file takes only a fraction of the overall time. The primary problem is that memory for a string is allocated for each field when it is created, then the string is copied into the result seq, which creates (and allocates memory for) another copy of the string.

nvill (orginal) [2015-08-26T10:50:32+02:00] view original

Ok, this makes sense. Modifying the C++ code to use push_back() instead of emplace_back() (which I suppose is comparable to what you describe?) drops the timings from 12.5 s to 53.5 s.

nvill (orginal) [2015-08-26T15:34:12+02:00] view original

Thanks for your efforts! This last version was actually slower than the standard lines() using the git version compiler. Note that execution fails with SIGSEGV: Illegal storage access. (Attempt to read from nil?) when using --gc:markandsweep.

Devel compiler

Nim elapsed new 16.41312313079834 seconds
Nim elapsed old 16.05750608444214 seconds

Stable

Nim elapsed new 16.30394983291626 seconds
Nim elapsed old 18.26562213897705 seconds

Araq (orginal) [2015-08-26T15:49:29+02:00] view original

I'm sure this will arouse the ire of Araq

Not really, I don't do line based parsing + split (especially not when I care about performance), that's just a hack for when you cannot write a real lexer. :P

Jehan (orginal) [2015-08-26T15:59:23+02:00] view original

The implementation of lines in the devel branch has already been much improved over the old version. If you want more speedup for I/O, you can use the memfiles module, which gives you memory-mapped access to files:

import strutils
import times
import os
from memfiles import open, lines

proc fastsplit(s: string, sep: char): seq[string] =
  var count = 1
  for ch in s:
    if ch == sep:
      count += 1
  result = newSeq[string](count)
  var fieldNum = 0
  var start = 0
  for i in 0..high(s):
    if s[i] == sep:
      result[fieldNum] = s[start..i-1]
      start = i+1
      fieldNum += 1
  result[fieldNum] = s[start..^1]

proc run(fname: string) =
  var file = memfiles.open(fname, fmRead)
  for line in memfiles.lines(file):
    if line[line.low] == '#':
      continue
    let fields = line.fastsplit('\t')
    if fields[fields.high] == "hello":
      echo fields[fields.high]

proc main() =
  let start = epochtime()
  run(paramStr(1))
  let stop = epochtime()
  stderr.write("Nim elapsed ", stop-start, " seconds\n")

main()

nvill (orginal) [2015-08-27T10:08:31+02:00] view original

Hm, turns out @Araq was right, using lexers is superior. I tested once more using the parsecsv module (why didn't I think of that?) with this modified run() proc:

proc run(fname: string) =
  var infile = newFileStream(fname, fmRead)
  var vcf: CsvParser
  open(vcf, infile, fname, separator='\t')
  
  while readRow(vcf):
    if vcf.row[vcf.row.low][0] == '#':
      continue
    if vcf.row[vcf.row.high] == "hello":
      echo vcf.row[vcf.row.high]

And these are the results

Nim (stable) elapsed 20.65725302696228 seconds
Nim (devel) elapsed 15.37827706336975 seconds
Nim csv (stable) elapsed 3.765568017959595 seconds
Nim csv (devel) elapsed 4.059754848480225 seconds

Again, very impressive results.

com (orginal) [2015-08-29T19:10:14+02:00] view original

I also work in the field of bioinformatics and do lots of file parsing. Nim could become very popular in this field due to its simple syntax, speed and the useful modules in the standard library (although some are still a work in progress). Parsing large files in various formats is a routine task in bioinformatics. Many bioinformaticians, like those in other scientific computing fields, do not have CS backgrounds, so simple, yet fast string procedures (such as split()) are very useful. A while back I wrote some procedures to parse various file formats and compared the times with similar functions written in Python. My findings were similar to nvill's original post. For parsing file lines, split() seemed slow (using -d:release with version 0.11.2) compared to naive code written in Python.

joe (orginal) [2015-08-30T04:00:27+02:00] view original

It would be nice to have split create a "lazy array" call it lazysplit.

It would use something like @Jehan's fastsplit but just get and store the delimiter positions. so call it like:

let fields = line.lazysplit('t') and it would return something that responds to [] (or iter) and only copy the data to a string when referenced (using it's stored positions). So if you only use small x% of the fields you get a big speedup.

I bet @Jehan could code that up :-)

-joe

Jehan (orginal) [2015-08-30T05:01:29+02:00] view original

@joe: This is actually not particularly hard, though you have to be careful with shallow semantics. You can squeeze out some extra performance in the following example by using shallow strings, but then you have to be extra careful with aliasing. Also, if you use all or most of the fields, this is actually going to be slower. Only worth the effort if you actually do use only a small fraction of the fields.

import strutils
import times
import os

type
  LazyStringSeqEntry = tuple[first, last: int, cached: string]
  LazyStringSeq = object {.shallow.}
    whole: string
    parts: seq[LazyStringSeqEntry]

proc `[]`(ls: LazyStringSeq, i: int): string =
  var p: seq[LazyStringSeqEntry]
  shallowCopy p, ls.parts
  if p[i].cached.isNil:
    p[i].cached = ls.whole[p[i].first..p[i].last]
  p[i].cached

template len(ls: LazyStringSeq): int = len(ls.parts)

template high(ls: LazyStringSeq): int = high(ls.parts)

template low(ls: LazyStringSeq): int = 0

proc lazysplit(s: string, sep: char): LazyStringSeq =
  var count = 1
  for ch in s:
    if ch == sep:
      count += 1
  result = LazyStringSeq(whole: s, parts: newSeq[LazyStringSeqEntry](count))
  var fieldNum = 0
  var start = 0
  for i in 0..high(s):
    if s[i] == sep:
      result.parts[fieldnum].first = start
      result.parts[fieldnum].last = i-1
      start = i+1
      fieldNum += 1
  result.parts[fieldNum].first = start
  result.parts[fieldNum].last = high(s)

proc run(fname: string) =
  for line in lines(fname):
    if line[line.low] == '#':
      continue
    let fields = line.lazysplit('\t')
    if fields[fields.high] == "hello":
      echo fields[fields.high]

proc main() =
  let start = epochtime()
  run(paramStr(1))
  let stop = epochtime()
  stderr.write("Nim elapsed ", stop-start, " seconds\n")

main()

Araq (orginal) [2015-08-31T08:51:58+02:00] view original

You could also simply use the split iterator to get a lazy split operation...

Jehan (orginal) [2015-08-31T19:06:29+02:00] view original

Iterators don't give you random access.

Mirror of forum.nim-lang.org

1580 :: File iteration slow