I am exploring the use of Nim in bioinformatics (love what I've seen so far). I wrote a couple of tests for typical tasks such as parsing files. I noticed that iterating over a file line-by-line and splitting strings in my tests is slower compared to other languages, which was quite a surprise. I'm using stable version 0.11.2 from the Arch Linux repositories.
The test file I've been using is in vcf format with ~2500 tab-separated fields per (non-comment) line. Decreasing the number of fields to something like 100, or even just iterating line-by-line, does not severely affect the results - Nim is still the slowest.
# about 968 Mb download TESTFILE=testfile.vcf curl ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz | gzip -dc |head -n 100000 > $TESTFILE ./vcftest $TESTFILE && \ awk -f vcftest.awk $TESTFILE && \ julia vcftest.jl $TESTFILE && \ ./vcftestcpp $TESTFILE && \ python3 vcftest.py $TESTFILE
Nim elapsed 12.1297550201416 seconds Awk elapsed 4 seconds Julia elapsed 10.404911041259766 seconds C++ elapsed 9.16342 seconds Python elapsed 7.627493143081665 seconds
I've uploaded my code snippets (Nim, Python 3, Julia, C++11, awk) to pastebin for reference:
Is my code dogmatic enough? I read this this thread, but it looks like this wasn't included in 0.11.2. Compiling the latest git version as of today, the Nim results improved to ~10 s which is good, but I think there is room for improvement.
Given the overall performance of Nim and the simplicity of the syntax, Nim could be a killer in bioinformatics!
It's not about I/O, I/O actually takes up only a small part of the time.
Most of the time is being spent in line.split('\t'), which does lots of memory allocations. There is at least one inefficiency here that leads to memory for each field being allocated twice.
Also, this is a case where the mark-and-sweep GC is faster than the reference-counting GC.
Fixing the performance problem in split() and switching to mark-and-sweep GC makes Nim roughly 40% faster than Python 3 on my machine.
Keep in mind that for allocation-heavy code, most languages will have comparable performance, since the memory allocator will dominate runtime.
if line[line.rfind('\t'):] == 'hello':
kirbyfan64sos: This is unnecessarily slow in any language.
I expect that any actual application would process all the fields, so this trick wouldn't work then.
Make sure you compile with -d:release
It is MUCH slower otherwise, due to other checking that gets done.
If you can read all of the file into memory, then it is fairly fast doing one of the following two approaches (timed on Windows7, Linux may be different), and note, this is just getting lines of data, not the splitting by tabs mention previously:
if open(F, fname):
for line in F.readAll().splitLines():
or
if open(F, fname):
let x = F.readAll()
for line in x.splitLines():
The root cause (as I understand it) of the slow reading by line is that underneath the hood, Nim handles files with three possible line delimeters: CR, CRLF, LF (old MAC, Win, Linux/UNIX)
For me on Windows, using the lines() iterator or readLine() is just over 10 times slower than the two methods above, which is significant for LARGE files
Thank you for your suggestions. The Nim program was indeed compiled with -d:release. Reading the whole file into memory with readAll() in not feasible, since the file size in some cases can be >100 Gb. I don't believe the split() bit is the problem here. I added the "compare to last field" just to make sure that the whole thing wasn't removed in compiler optimization; as @Jehan notes, this was just a placeholder for any type of work. Modifying the code to just iterate through the file without splitting (and comparing the whole line to something like "hello" for reasons given above) confirms what @jlp765 claims; Nim is more than two times slower than any of the other languages.
Well, that's a bummer.
No, the latter test was using 0.11.2. Apparently stuff have happened since: a fresh compile gives performance some 25% behind C++ but ahead of the rest.
Edit:
Timings for simply iterating over file (with string comparison):
Nim (stable) elapsed 2.314162969589233 seconds Awk elapsed 1 seconds Julia elapsed 1.0157310962677002 seconds C++ elapsed 0.24666 seconds Python elapsed 0.9296040534973145 seconds Nim (devel) elapsed 0.3295872211456299 seconds
But indeed splitting is still slow:
Nim (stable) elapsed 20.50117897987366 seconds Awk elapsed 4 seconds Julia elapsed 13.904340028762817 seconds C++ elapsed 12.4989 seconds Python elapsed 11.655924320220947 seconds Nim (devel) elapsed 15.48185396194458 seconds
nvill: I don't believe the split() bit is the problem here.
I profiled it (and then modified the system library locally to improve performance). I'm absolutely positive that it is the problem; reading the file takes only a fraction of the overall time. The primary problem is that memory for a string is allocated for each field when it is created, then the string is copied into the result seq, which creates (and allocates memory for) another copy of the string.
Thanks for your efforts! This last version was actually slower than the standard lines() using the git version compiler. Note that execution fails with SIGSEGV: Illegal storage access. (Attempt to read from nil?) when using --gc:markandsweep.
Devel compiler
Nim elapsed new 16.41312313079834 seconds Nim elapsed old 16.05750608444214 seconds
Stable
Nim elapsed new 16.30394983291626 seconds Nim elapsed old 18.26562213897705 seconds
I'm sure this will arouse the ire of Araq
Not really, I don't do line based parsing + split (especially not when I care about performance), that's just a hack for when you cannot write a real lexer. :P
The implementation of lines in the devel branch has already been much improved over the old version. If you want more speedup for I/O, you can use the memfiles module, which gives you memory-mapped access to files:
import strutils
import times
import os
from memfiles import open, lines
proc fastsplit(s: string, sep: char): seq[string] =
var count = 1
for ch in s:
if ch == sep:
count += 1
result = newSeq[string](count)
var fieldNum = 0
var start = 0
for i in 0..high(s):
if s[i] == sep:
result[fieldNum] = s[start..i-1]
start = i+1
fieldNum += 1
result[fieldNum] = s[start..^1]
proc run(fname: string) =
var file = memfiles.open(fname, fmRead)
for line in memfiles.lines(file):
if line[line.low] == '#':
continue
let fields = line.fastsplit('\t')
if fields[fields.high] == "hello":
echo fields[fields.high]
proc main() =
let start = epochtime()
run(paramStr(1))
let stop = epochtime()
stderr.write("Nim elapsed ", stop-start, " seconds\n")
main()
proc run(fname: string) =
var infile = newFileStream(fname, fmRead)
var vcf: CsvParser
open(vcf, infile, fname, separator='\t')
while readRow(vcf):
if vcf.row[vcf.row.low][0] == '#':
continue
if vcf.row[vcf.row.high] == "hello":
echo vcf.row[vcf.row.high]
And these are the results
Nim (stable) elapsed 20.65725302696228 seconds Nim (devel) elapsed 15.37827706336975 seconds Nim csv (stable) elapsed 3.765568017959595 seconds Nim csv (devel) elapsed 4.059754848480225 seconds
Again, very impressive results.
It would be nice to have split create a "lazy array" call it lazysplit.
It would use something like @Jehan's fastsplit but just get and store the delimiter positions. so call it like:
let fields = line.lazysplit('t') and it would return something that responds to [] (or iter) and only copy the data to a string when referenced (using it's stored positions). So if you only use small x% of the fields you get a big speedup.
I bet @Jehan could code that up :-)
@joe: This is actually not particularly hard, though you have to be careful with shallow semantics. You can squeeze out some extra performance in the following example by using shallow strings, but then you have to be extra careful with aliasing. Also, if you use all or most of the fields, this is actually going to be slower. Only worth the effort if you actually do use only a small fraction of the fields.
import strutils
import times
import os
type
LazyStringSeqEntry = tuple[first, last: int, cached: string]
LazyStringSeq = object {.shallow.}
whole: string
parts: seq[LazyStringSeqEntry]
proc `[]`(ls: LazyStringSeq, i: int): string =
var p: seq[LazyStringSeqEntry]
shallowCopy p, ls.parts
if p[i].cached.isNil:
p[i].cached = ls.whole[p[i].first..p[i].last]
p[i].cached
template len(ls: LazyStringSeq): int = len(ls.parts)
template high(ls: LazyStringSeq): int = high(ls.parts)
template low(ls: LazyStringSeq): int = 0
proc lazysplit(s: string, sep: char): LazyStringSeq =
var count = 1
for ch in s:
if ch == sep:
count += 1
result = LazyStringSeq(whole: s, parts: newSeq[LazyStringSeqEntry](count))
var fieldNum = 0
var start = 0
for i in 0..high(s):
if s[i] == sep:
result.parts[fieldnum].first = start
result.parts[fieldnum].last = i-1
start = i+1
fieldNum += 1
result.parts[fieldNum].first = start
result.parts[fieldNum].last = high(s)
proc run(fname: string) =
for line in lines(fname):
if line[line.low] == '#':
continue
let fields = line.lazysplit('\t')
if fields[fields.high] == "hello":
echo fields[fields.high]
proc main() =
let start = epochtime()
run(paramStr(1))
let stop = epochtime()
stderr.write("Nim elapsed ", stop-start, " seconds\n")
main()