nimforum mirror - Optimize parsing large file line-by-line

LLN13 (orginal) [2022-04-18T18:29:25+02:00] view original

Hi everyone,

I am trying to pass a file (file.bim)with 21,283,105 lines line-by-line, apply a filter, and then save the results to a Table and would like to know if my following solution can be optimized further:


import std/tables
import std/strutils

proc parse_bim(bim: string, chrom: int): Table[string, seq[string]] =
    var
        snp: seq[string] = @[]
        a1: seq[string] = @[]
        a2: seq[string] = @[]
    
    for line in lines bim:
        let x = line.split('\t')
        if parseInt(x[0]) == chrom:
            snp.add(x[1])
            a1.add(x[4])
            a2.add(x[5])
    
    let vld_hash = {"SNP": snp, "A1": a1, "A2": a2}.toTable
    
    return vld_hash

let bim_file = parse_bim("file.bim", 1) # 1,653,512 line are on chrom 1

This takes ~8s in Nim and ~13s in Python

Compiling: nim c -d:release -d:danger --gc:arc parse_file.nim

Thanks!

Stefan_Salewski (orginal) [2022-04-18T20:11:47+02:00] view original

My feeling is that you are using a few operations, which are all not very fast. split() allocates a few strings, which takes some time, and add() for sequences takes also some time. Maybe your parseInt() is not really necessary, may a plain string compare work?

Note that parsing files was discussed in the Manning book, and I recently added a section to my book too: https://ssalewski.de/nimprogramming.html#_parsing_data_files_in_parallel. And I think there was a blog post called "Fast CVS parsing" or so many years ago, but I was not able to find it again.

doofenstein (orginal) [2022-04-18T20:20:15+02:00] view original

import parseutils, tables

proc parse_bim(bim: string, chrom: int): Table[string, seq[string]] =
    var
        snp: seq[string] = @[]
        a1: seq[string] = @[]
        a2: seq[string] = @[]
        
        i = 0
    
    while i < bim.len:
        var x: int
        i += parseInt(bim, x, i)
        if x == chrom:
            i += skipWhile(bim, {'\t'}, i)
            
            var a, b, c: string
            i += parseUntil(bim, a, '\t', i)
            i += skipWhile(bim, {'\t'}, i)
            i += parseUntil(bim, b, '\t', i)
            i += skipWhile(bim, {'\t'}, i)
            i += parseUntil(bim, c, {'\n', '\r'}, i)
            snp.add a
            a1.add b
            a2.add c
        else:
            i += skipUntil(bim, {'\n', '\r'}, i)
        
        i += skipWhile(bim, {'\n', '\r'}, i)
    
    let vld_hash = {"SNP": snp, "A1": a1, "A2": a2}.toTable
    
    return vld_hash

let bim_file = parse_bim("file.bim", 1) # 1,653,512 line are on chrom 1

this is my attempt which should cut down significantly on the amount of allocations. Theoretically you could preallocate the seqs if you have some heuristic like average line length.

The whole code isn't very pretty, though once views are stabilised and become integrated into the stdlib, it should be possible to write code very similar to your original one with very little allocations as well.

cblake (orginal) [2022-04-18T21:35:22+02:00] view original

You may mean this thread: https://forum.nim-lang.org/t/5103

Stefan_Salewski (orginal) [2022-04-18T21:47:07+02:00] view original

Thanks for the link. I think actually I was remembering this one,

https://nim-lang.org/blog/2017/05/25/faster-command-line-tools-in-nim.html https://www.reddit.com/r/programming/comments/6dct6e/faster_command_line_tools_in_nim/

it is not from Mr. Felsing, which I initially supposed. But I can not remember details, I may read it again later. And well, there is also something from Miram: https://narimiran.github.io/2021/01/11/nim-parsing.html

cblake (orginal) [2022-04-18T23:00:47+02:00] view original

@Stefan_Salewski - could be..This topic arises constantly. I was figuring "chrom" was short for "chromosome" and LLN13's context was bioinformatics.

@LLN13, FWIW this is a concrete realization of my interpretation of Stefan's abstract ideas that may help you. I find it clearer (subjective-ish) than doofenstein's. It may also be faster if data lines are much wider than the 6 columns you access. I haven't even tried to measure it, but speed stuff really depends up on your data files.

import std/strutils

proc parse_bim(bim: string, chrom: int):
       (seq[string], seq[string], seq[string]) =
  let chrom = $chrom & "\t"
  for line in lines bim:
    if line.startsWith(chrom):
      var i = 0 # only split needed columns
      for col in line.split('\t'):
        if   i == 1: result[0].add col
        elif i == 4: result[1].add col
        elif i == 5: result[2].add col; break
        inc i

let (snp, a1, a2) = parse_bim("file.bim", 1)

doofenstein (orginal) [2022-04-18T23:30:39+02:00] view original

oh stringifying chrom and checking for that, that's clever!

LLN13 (orginal) [2022-04-19T00:10:13+02:00] view original

Ah, yeah. I thought about preallocating but unfortunately it would be difficult to estimate the line length. Thanks for the code! I'm new to the language and this is useful.

LLN13 (orginal) [2022-04-19T00:14:01+02:00] view original

I'll try using a tuple. I only used a Table because the Python solution I have uses a dictionary and I wanted to compare the speeds. Thanks!

PMunch (orginal) [2022-04-19T10:31:02+02:00] view original

Common "mistake" by people coming from Python. Tables/dictionaries are actually quite an expensive data structure to create, tuples are more or less free. So unless you need to add keys on runtime you should probably stick with a tuple (or just a normal object, depending on what you're doing with it afterwards). Comparing speeds between Python and Nim requires you to write your code in idiomatic style in both languages, otherwise you won't get the speed of the actual language.

I'd be curious to see how fast a parser generator like npeg would fare in a scenario like this, do you have a test file I could try it on?

ingo (orginal) [2022-04-19T15:09:29+02:00] view original

Note that parsing files was discussed in the Manning book, and I recently added a section to my book too[...]

what about using streams and an iterator?


import streams

proc pgmRead(strm: FileStream, filePos: int=0): auto=
  var data: char
  strm.setPosition(filePos)
  return iterator: (int, char)=
    while not strm.atEnd:
      let idx = getPosition(strm)
      read(strm, data)
      yield (idx, data)
    strm.close()

full code for parsing pgm https://gist.github.com/ingoogni/b422eaa251e9c9192071be94d8269de9

dom96 (orginal) [2022-04-19T16:17:51+02:00] view original

Note that parsing files was discussed in the Manning book

Thanks for mentioning! In case you want to have a look at the example code from Nim in Action: https://github.com/dom96/nim-in-action-code/tree/master/Chapter6/WikipediaStats.

Happy to answer any questions about it.

cblake (orginal) [2022-04-26T23:51:01+02:00] view original

@PMunch - my hunch about bioinformatics was right. A test file I was using is called "1kg_phase1_all.bim" which is widely distributed. I got mine out of here but I think this file is widely distributed as "1kg_phase1_all" turns up a lot of hits. The data seems to be produced by a program called "plink" specifically to be easily "split parsed". The md5sum of that data I used is 4087b5280f40a93025d5d26d777ce6e8 with 1228635806 bytes in 39728178 lines.

Using that, I re-did the "likely Python being ported" to this:

def parse_bim(bim, chrom):
  snp = []; a1 = []; a2 = []
  for line in open(bim):
    x = line.split('\t')
    if x[0] == chrom:
      snp.append(x[1]); a1.append(x[4]); a2.append(x[5])
  return (snp, a1, a2)
assert len(parse_bim("file.bim", "1")[1]) == 3007196

That took about 10.5 seconds on my test machine.

Then I optimized the Python a little to:

def rowsFor(bim, chrom):
  snp = []; a1 = []; a2 = []; n = 0
  found = False     # plink emits sorted by chromosome
  start = chrom + "\t"
  for line in open(bim):
    if line.startswith(start):
      found = True
      x = line.strip().split('\t')
      yield (x[1], x[4], x[5], n)
      n += 1
    elif found: break

for (snp, a1, a2, n) in rowsFor("file.bim", "1"):
  if n == 3007195: print(snp, a1, a2) # print last

This ran in about 1.78 s on Python3.9.12 (and 1.57s on Python2.7.18). { Yes, yes..I have heard for many years how someday py3 would catch up to py2 in performance. It now seems that day may never come, but this is Nim Forum, so moving on... }

I then optimized my original Nim suggestion (tuned for a beginner) to this:

import std/memfiles, system/ansi_c

iterator rowsFor(bim, chrom: string):
    (MemSlice, MemSlice, MemSlice, int) =
  var result: (MemSlice, MemSlice, MemSlice, int)
  let chrom = chrom & "\t"
  var found = false     # plink emits sorted by chromosome
  let (p, n) = (cast[pointer](chrom[0].addr), chrom.len)
  var mf = memfiles.open(bim)
  var line: MemFile     # dummy obj to use memSlices on
  for ln in mf.memSlices:
    if ln.size > n and cmemcmp(ln.data, p, n.csize_t) == 0:
      found = true
      line.mem = ln.data; line.size = ln.size
      var i = 0         # only split needed columns
      for col in line.memSlices('\t'):
        if   i == 1: result[0] = col
        elif i == 4: result[1] = col
        elif i == 5: result[2] = col; break
        inc i
      yield result
      inc result[3]
    elif found: break   # all found in one block of lines
  mf.close

for (snp, a1, a2, n) in "file.bim".rowsFor("1"):
  if n == 3_007_195: echo $snp," ",$a1," ",$a2 # print last

which runs in 0.124s (--mm:arc -d:danger) on the same machine - over 14X faster than the fastest Python. A next step might be parallel parsing, but this is already about 84x faster than the original Python.

I don't know if @LLN13 is still reading any of this, but how I usually like to put it is: Nim responds well to optimization effort. I would be curious to see @PMunch's npeg/parser generator ideas play out if this test data file inspires him. :-)

Stefan_Salewski (orginal) [2022-04-27T08:08:40+02:00] view original

Thank you very much for testing. A version doing parallel processing with ORC would be nice still -- not because your version is not already fast enough, but because people care a lot about parallel processing in these days.

As you know, my attempts with ORC and parallel processing were a bit disappointing, the parallel version was slower: https://ssalewski.de/nimprogramming.html#_parsing_data_files_in_parallel

Well only with ORC, but I think ORC was advertised a lot in a blog post some months ago, so people may use that. Unfortunately Mr. treeform missed the point: https://forum.nim-lang.org/t/9045 Much later I got a message of Mr. Yardanico: https://github.com/StefanSalewski/NimProgrammingBook/commit/dfb8a76467631cd17b85034e5ac5784a81e1f8e0#r71445238 But I am working on other stuff currently, and I have still no idea what the --tlsEmulation:off which he mentioned really does. I know that some people reported about speed issues with ORC and --threads:on. Unfortunately I still do not know IF these new ViewTypes are planned for Nim 2.0, maybe they will fix all that?

cblake (orginal) [2022-04-28T00:40:29+02:00] view original

@Stefan_Salewski - Below is a program that parses data like your book csvdata.txt. I use a cligen/ API to assist splitting a file into max-size chunks/parts that I did for someone in another thread and some MSlice utility code.

import std/[os, times], cligen/[mfile, mslice]
type
  Res    = tuple[subMax: int; line: MSlice]
  ThrDat = tuple[part: MSlice; resP: ptr Res]

proc work(td: ThrDat) {.thread.} =
  var r: Res      # LOCAL RESULT AVOIDS CACHE THRASH
  var eoNum = 0   # LOCAL END-OF-NUM AVOIDS CACHE THRASH
  for ln in td.part.mSlices('\n'):
    if ln.len > 0 and ln[0] != '#':
      var i = 0
      for col in ln.mSlices(','):
        if i == 5: # sample calc: line with max pop
          let pop = col.parseInt(eoNum)
          if pop > r.subMax: r.subMax = pop; r.line = ln
        inc i
  td.resP[] = r   # COPY ANSWER BACK

proc parse(mf: MFile, p: int): Res =
  var ts    = newSeq[Thread[ThrDat]](p)
  var res   = newSeq[Res](p)
  let parts = p.nSplit(mf.toMSlice)
  for i in 0 ..< parts.len:
    createThread ts[i], work, (parts[i], res[i].addr)
  joinThreads ts
  for r in res:
    if r.subMax > result.subMax: result = r

let t0 = epochTime()
let mf = mopen("csvdata.txt")
if mf == nil: quit "cannot open csvdata.txt", 1
let r = parse(mf, max(1, paramCount()))
echo epochTime() - t0," ",r.line #; mf.close #if you want

4-core i7-6700k at 4.7GHz times on one of your 1e6.int files { Linux 5.17.3, gcc-11.3, nim devel from last night 2a53fe07f --threads:on --mm:arc -d:useMalloc -d:danger } :


$ ./prog 1
0.0376124 Qepwn,Nfinmgasi,Onudsd,Wxhn,35825.40,4999991
$ ./prog {1..4}
0.0109856 Qepwn,Nfinmgasi,Onudsd,Wxhn,35825.40,4999991

So, 37.61/10.985=3.42X faster using 4 vs 1 core. About 85.6% of the ideal 4X which is..not bad. Parallel scale-up is rarely perfect. I see similar scaling on a 16-core AMD2950X box I tried (0.06801s vs 0.005863s = 11.6x faster = 72.5%*16).

If you need an object at the end, you could always lift your parser into some new parseRow routine and apply that to $r.line. As is so often the case, whether not creating many unused intermediate strings counts as "cheating" or "optimization" all depends upon broader context. Skipping "really unneeded" work and saving answers (e.g. caching/pre-parsing/etc.) are both worth teaching, though.

Bernice98 (orginal) [2022-04-28T01:44:36+02:00] view original

thanks for this, i was stuck with this too

Stefan_Salewski (orginal) [2022-04-28T07:56:51+02:00] view original

Thank you very much for the provided code. I did just a short test, and at least with -d:danger the results are GREAT for all GC variants (arc, orc, refc). Nearly a factor 4 more performance when run on 4 cores instead of a single one. I will study your code in more detail in the next few days.

cblake (orginal) [2022-04-28T13:53:37+02:00] view original

You're welcome. It's mostly an adaptation of cligen/examples/linect.nim or adix/tests/wf.nim. Relative to your book's "read whole file, then chunk" idea, my nSplit does one less "pass over the data" by leveraging mmap/random access to the file. Both approaches in the large rely on statistical regularity of record sizes to avoid switching overhead from more fine-grained parallel dispatch.

The parsing in general (from Linux perf) seems ~50% just memchr (avx2 vectorized). That is usually close to the best one can get, though in this case it is only about 4 GB/sec (about 10% of my DIMM bw).

It probably bears re-mention here that pre-parsing into random access binary formats pays more perf dividends for re-analysis of the same data, making "parsing performance" a less than perfect example. DB systems have done binary formats for 60 years with query languages "for non-programmers" (that "for" has always struck me as a real stretch..LOL). For programmer consumers, you can get most/all that perf with little work from something like nio or Vindaar's nimHDF5. Irregularities of DNA strings (that began this discussion) do tend to limit those benefits.

bobg (orginal) [2024-05-18T15:50:37+02:00] view original

The link points to some elegant code. Thanks much.

Mirror of forum.nim-lang.org

9115 :: Optimize parsing large file line-by-line