Hi everyone,
I am trying to pass a file (file.bim)with 21,283,105 lines line-by-line, apply a filter, and then save the results to a Table and would like to know if my following solution can be optimized further:
import std/tables
import std/strutils
proc parse_bim(bim: string, chrom: int): Table[string, seq[string]] =
var
snp: seq[string] = @[]
a1: seq[string] = @[]
a2: seq[string] = @[]
for line in lines bim:
let x = line.split('\t')
if parseInt(x[0]) == chrom:
snp.add(x[1])
a1.add(x[4])
a2.add(x[5])
let vld_hash = {"SNP": snp, "A1": a1, "A2": a2}.toTable
return vld_hash
let bim_file = parse_bim("file.bim", 1) # 1,653,512 line are on chrom 1
This takes ~8s in Nim and ~13s in Python
Compiling: nim c -d:release -d:danger --gc:arc parse_file.nim
Thanks!
My feeling is that you are using a few operations, which are all not very fast. split() allocates a few strings, which takes some time, and add() for sequences takes also some time. Maybe your parseInt() is not really necessary, may a plain string compare work?
Note that parsing files was discussed in the Manning book, and I recently added a section to my book too: https://ssalewski.de/nimprogramming.html#_parsing_data_files_in_parallel. And I think there was a blog post called "Fast CVS parsing" or so many years ago, but I was not able to find it again.
import parseutils, tables
proc parse_bim(bim: string, chrom: int): Table[string, seq[string]] =
var
snp: seq[string] = @[]
a1: seq[string] = @[]
a2: seq[string] = @[]
i = 0
while i < bim.len:
var x: int
i += parseInt(bim, x, i)
if x == chrom:
i += skipWhile(bim, {'\t'}, i)
var a, b, c: string
i += parseUntil(bim, a, '\t', i)
i += skipWhile(bim, {'\t'}, i)
i += parseUntil(bim, b, '\t', i)
i += skipWhile(bim, {'\t'}, i)
i += parseUntil(bim, c, {'\n', '\r'}, i)
snp.add a
a1.add b
a2.add c
else:
i += skipUntil(bim, {'\n', '\r'}, i)
i += skipWhile(bim, {'\n', '\r'}, i)
let vld_hash = {"SNP": snp, "A1": a1, "A2": a2}.toTable
return vld_hash
let bim_file = parse_bim("file.bim", 1) # 1,653,512 line are on chrom 1
this is my attempt which should cut down significantly on the amount of allocations. Theoretically you could preallocate the seqs if you have some heuristic like average line length.
The whole code isn't very pretty, though once views are stabilised and become integrated into the stdlib, it should be possible to write code very similar to your original one with very little allocations as well.
Thanks for the link. I think actually I was remembering this one,
https://nim-lang.org/blog/2017/05/25/faster-command-line-tools-in-nim.html https://www.reddit.com/r/programming/comments/6dct6e/faster_command_line_tools_in_nim/
it is not from Mr. Felsing, which I initially supposed. But I can not remember details, I may read it again later. And well, there is also something from Miram: https://narimiran.github.io/2021/01/11/nim-parsing.html
@Stefan_Salewski - could be..This topic arises constantly. I was figuring "chrom" was short for "chromosome" and LLN13's context was bioinformatics.
@LLN13, FWIW this is a concrete realization of my interpretation of Stefan's abstract ideas that may help you. I find it clearer (subjective-ish) than doofenstein's. It may also be faster if data lines are much wider than the 6 columns you access. I haven't even tried to measure it, but speed stuff really depends up on your data files.
import std/strutils
proc parse_bim(bim: string, chrom: int):
(seq[string], seq[string], seq[string]) =
let chrom = $chrom & "\t"
for line in lines bim:
if line.startsWith(chrom):
var i = 0 # only split needed columns
for col in line.split('\t'):
if i == 1: result[0].add col
elif i == 4: result[1].add col
elif i == 5: result[2].add col; break
inc i
let (snp, a1, a2) = parse_bim("file.bim", 1)
Common "mistake" by people coming from Python. Tables/dictionaries are actually quite an expensive data structure to create, tuples are more or less free. So unless you need to add keys on runtime you should probably stick with a tuple (or just a normal object, depending on what you're doing with it afterwards). Comparing speeds between Python and Nim requires you to write your code in idiomatic style in both languages, otherwise you won't get the speed of the actual language.
I'd be curious to see how fast a parser generator like npeg would fare in a scenario like this, do you have a test file I could try it on?
Note that parsing files was discussed in the Manning book, and I recently added a section to my book too[...]
what about using streams and an iterator?
import streams
proc pgmRead(strm: FileStream, filePos: int=0): auto=
var data: char
strm.setPosition(filePos)
return iterator: (int, char)=
while not strm.atEnd:
let idx = getPosition(strm)
read(strm, data)
yield (idx, data)
strm.close()
full code for parsing pgm https://gist.github.com/ingoogni/b422eaa251e9c9192071be94d8269de9
Note that parsing files was discussed in the Manning book
Thanks for mentioning! In case you want to have a look at the example code from Nim in Action: https://github.com/dom96/nim-in-action-code/tree/master/Chapter6/WikipediaStats.
Happy to answer any questions about it.
@PMunch - my hunch about bioinformatics was right. A test file I was using is called "1kg_phase1_all.bim" which is widely distributed. I got mine out of here but I think this file is widely distributed as "1kg_phase1_all" turns up a lot of hits. The data seems to be produced by a program called "plink" specifically to be easily "split parsed". The md5sum of that data I used is 4087b5280f40a93025d5d26d777ce6e8 with 1228635806 bytes in 39728178 lines.
Using that, I re-did the "likely Python being ported" to this:
def parse_bim(bim, chrom):
snp = []; a1 = []; a2 = []
for line in open(bim):
x = line.split('\t')
if x[0] == chrom:
snp.append(x[1]); a1.append(x[4]); a2.append(x[5])
return (snp, a1, a2)
assert len(parse_bim("file.bim", "1")[1]) == 3007196
That took about 10.5 seconds on my test machine.
Then I optimized the Python a little to:
def rowsFor(bim, chrom):
snp = []; a1 = []; a2 = []; n = 0
found = False # plink emits sorted by chromosome
start = chrom + "\t"
for line in open(bim):
if line.startswith(start):
found = True
x = line.strip().split('\t')
yield (x[1], x[4], x[5], n)
n += 1
elif found: break
for (snp, a1, a2, n) in rowsFor("file.bim", "1"):
if n == 3007195: print(snp, a1, a2) # print last
This ran in about 1.78 s on Python3.9.12 (and 1.57s on Python2.7.18). { Yes, yes..I have heard for many years how someday py3 would catch up to py2 in performance. It now seems that day may never come, but this is Nim Forum, so moving on... }
I then optimized my original Nim suggestion (tuned for a beginner) to this:
import std/memfiles, system/ansi_c
iterator rowsFor(bim, chrom: string):
(MemSlice, MemSlice, MemSlice, int) =
var result: (MemSlice, MemSlice, MemSlice, int)
let chrom = chrom & "\t"
var found = false # plink emits sorted by chromosome
let (p, n) = (cast[pointer](chrom[0].addr), chrom.len)
var mf = memfiles.open(bim)
var line: MemFile # dummy obj to use memSlices on
for ln in mf.memSlices:
if ln.size > n and cmemcmp(ln.data, p, n.csize_t) == 0:
found = true
line.mem = ln.data; line.size = ln.size
var i = 0 # only split needed columns
for col in line.memSlices('\t'):
if i == 1: result[0] = col
elif i == 4: result[1] = col
elif i == 5: result[2] = col; break
inc i
yield result
inc result[3]
elif found: break # all found in one block of lines
mf.close
for (snp, a1, a2, n) in "file.bim".rowsFor("1"):
if n == 3_007_195: echo $snp," ",$a1," ",$a2 # print last
which runs in 0.124s (--mm:arc -d:danger) on the same machine - over 14X faster than the fastest Python. A next step might be parallel parsing, but this is already about 84x faster than the original Python.
I don't know if @LLN13 is still reading any of this, but how I usually like to put it is: Nim responds well to optimization effort. I would be curious to see @PMunch's npeg/parser generator ideas play out if this test data file inspires him. :-)
Thank you very much for testing. A version doing parallel processing with ORC would be nice still -- not because your version is not already fast enough, but because people care a lot about parallel processing in these days.
As you know, my attempts with ORC and parallel processing were a bit disappointing, the parallel version was slower: https://ssalewski.de/nimprogramming.html#_parsing_data_files_in_parallel
Well only with ORC, but I think ORC was advertised a lot in a blog post some months ago, so people may use that. Unfortunately Mr. treeform missed the point: https://forum.nim-lang.org/t/9045 Much later I got a message of Mr. Yardanico: https://github.com/StefanSalewski/NimProgrammingBook/commit/dfb8a76467631cd17b85034e5ac5784a81e1f8e0#r71445238 But I am working on other stuff currently, and I have still no idea what the --tlsEmulation:off which he mentioned really does. I know that some people reported about speed issues with ORC and --threads:on. Unfortunately I still do not know IF these new ViewTypes are planned for Nim 2.0, maybe they will fix all that?
@Stefan_Salewski - Below is a program that parses data like your book csvdata.txt. I use a cligen/ API to assist splitting a file into max-size chunks/parts that I did for someone in another thread and some MSlice utility code.
import std/[os, times], cligen/[mfile, mslice]
type
Res = tuple[subMax: int; line: MSlice]
ThrDat = tuple[part: MSlice; resP: ptr Res]
proc work(td: ThrDat) {.thread.} =
var r: Res # LOCAL RESULT AVOIDS CACHE THRASH
var eoNum = 0 # LOCAL END-OF-NUM AVOIDS CACHE THRASH
for ln in td.part.mSlices('\n'):
if ln.len > 0 and ln[0] != '#':
var i = 0
for col in ln.mSlices(','):
if i == 5: # sample calc: line with max pop
let pop = col.parseInt(eoNum)
if pop > r.subMax: r.subMax = pop; r.line = ln
inc i
td.resP[] = r # COPY ANSWER BACK
proc parse(mf: MFile, p: int): Res =
var ts = newSeq[Thread[ThrDat]](p)
var res = newSeq[Res](p)
let parts = p.nSplit(mf.toMSlice)
for i in 0 ..< parts.len:
createThread ts[i], work, (parts[i], res[i].addr)
joinThreads ts
for r in res:
if r.subMax > result.subMax: result = r
let t0 = epochTime()
let mf = mopen("csvdata.txt")
if mf == nil: quit "cannot open csvdata.txt", 1
let r = parse(mf, max(1, paramCount()))
echo epochTime() - t0," ",r.line #; mf.close #if you want
4-core i7-6700k at 4.7GHz times on one of your 1e6.int files { Linux 5.17.3, gcc-11.3, nim devel from last night 2a53fe07f --threads:on --mm:arc -d:useMalloc -d:danger } :
$ ./prog 1
0.0376124 Qepwn,Nfinmgasi,Onudsd,Wxhn,35825.40,4999991
$ ./prog {1..4}
0.0109856 Qepwn,Nfinmgasi,Onudsd,Wxhn,35825.40,4999991
So, 37.61/10.985=3.42X faster using 4 vs 1 core. About 85.6% of the ideal 4X which is..not bad. Parallel scale-up is rarely perfect. I see similar scaling on a 16-core AMD2950X box I tried (0.06801s vs 0.005863s = 11.6x faster = 72.5%*16).
If you need an object at the end, you could always lift your parser into some new parseRow routine and apply that to $r.line. As is so often the case, whether not creating many unused intermediate strings counts as "cheating" or "optimization" all depends upon broader context. Skipping "really unneeded" work and saving answers (e.g. caching/pre-parsing/etc.) are both worth teaching, though.
You're welcome. It's mostly an adaptation of cligen/examples/linect.nim or adix/tests/wf.nim. Relative to your book's "read whole file, then chunk" idea, my nSplit does one less "pass over the data" by leveraging mmap/random access to the file. Both approaches in the large rely on statistical regularity of record sizes to avoid switching overhead from more fine-grained parallel dispatch.
The parsing in general (from Linux perf) seems ~50% just memchr (avx2 vectorized). That is usually close to the best one can get, though in this case it is only about 4 GB/sec (about 10% of my DIMM bw).
It probably bears re-mention here that pre-parsing into random access binary formats pays more perf dividends for re-analysis of the same data, making "parsing performance" a less than perfect example. DB systems have done binary formats for 60 years with query languages "for non-programmers" (that "for" has always struck me as a real stretch..LOL). For programmer consumers, you can get most/all that perf with little work from something like nio or Vindaar's nimHDF5. Irregularities of DNA strings (that began this discussion) do tend to limit those benefits.