nimforum mirror - CSV file parsing

Stefan_Salewski (orginal) [2022-03-25T21:15:49+01:00] view original

Yesterday I wrote the section about CSV file parsing, with regex, pegs, split() and parseutils. Not really that interesting, I think I read similar stuff already years ago in various blog posts, and it was sketched also in the Manning book. But well, took only a few hours to write, and maybe it is still interesting for beginners. Unfortunately I do not really understand the large performance difference of the last parallel example for default GC and ARC/ORC. I don't think that my code is that bad or wrong. Is the threadpool just not that well designed for ARC? Should I try the threads module with channels instead? Well, maybe I will remove that section again, was not much work to write, and it is not really necessary. https://ssalewski.de/nimprogramming.html#_parsing_data_files_in_parallel

treeform (orginal) [2022-03-25T23:07:45+01:00] view original

I don't know if this will work but some ideas to speed this up:

I don't know if threading is good match for single file CSV parsing. For me rather then have one 1 TB CSV file, I would rather generate a number of GB sized CSV files and do a file per thread.

Most of the time is probably be spent in reading the file from disk, make sure you are not timing that part.

You split the work into 1000 line chunks, maybe better idea is to split the file into N chunks where N is the number of cores so there is no real task switching.

Splitting strings into smaller strings will probably going to be slow, I would keep the original buffer and pass indexes into the buffer where you want threads to start and to stop reading.

Using split lines and anything that creates strings that are not part of final output is probably also slowish.

I don't think its possible to split CSV file if it has string escaping... as you never know if the data is inside a data string or not.

Big slow down in my CSV library is the parsing the header and potentially re-ordering the columns, I don't think your parser does that.

What is your opinion on my general purpose CSV parsing lib?

https://github.com/treeform/tabby

Stefan_Salewski (orginal) [2022-03-26T07:56:03+01:00] view original

Unfortunately I do not really understand the large performance difference of the last parallel example for default GC and ARC/ORC.

That was my only question. I hide it a bit, for good reasons :-).

Most of your other remarks seems to be true, I mentioned some of the points in my text already. Remember that it is mostly a books for kids, maybe from 14 to 19 years -- this section is from the last part called "advanced Nim", but presenting a state of the art CVS parser would be too much. My intent was just to sketch how tasks can be parallelized, and at the same time compare regex, peg, split() and parseutils parsing. Of course you know all that, but not all beginners do. The problem is, that I can give the kids currently no hint for the reason why ARC/ORC is much slower in this case. I have to admit that I have not yet started thinking about it at all -- I wrote the text and examples of that section on Thursday, and did a 3 hours proofreading yesterday. Some minutes before I intended to ship that section, I did a few final test, including compiling with ARC, and got that disappointing results. Maybe that is a good motivation for me to learn some more about ARC, move semantic and all that. Until now I have only used ARC and its destructors, but never cared for details too much.

Stefan_Salewski (orginal) [2022-03-26T09:45:37+01:00] view original

Yes, removing splitLines() helps a lot, runtime is now down to 100 ms. So I can let that section in the book. Question is still why it is so slow with splitLines() and ARC.

New code is something like

# nim c --threads:on --gc:arc -d:release t.nim

import std/threadpool
import std/parseutils
from strutils import splitLines
const
  FileName = "csvdata.txt"
  BlockSize = 1024 * 1024 - 1

type
  Res = object
    dist, count, state, vreg: string
    area: float
    pop: int

proc candidate(lines: string): Res =
  var res: Res
  #result = new Res
  result = Res(area: NegInf, pop: int.low)
  var l: string
  var j: int
  while true:
    j += parseUntil(lines, l, '\n', j) + 1
    if l.len == 0:
      break
  #for l in lines.splitLines():
    if l.len > 0 and l[0] != '#': # skip first two and all other comment lines
      var i: int
      i += parseUntil(l, res.dist, ',', i) + 1
      i += parseUntil(l, res.count, ',', i) + 1
      i += parseUntil(l, res.state, ',', i) + 1
      i += parseUntil(l, res.vreg, ',', i) + 1
      i += parseFloat(l, res.area, i) + 1
      i += parseInt(l, res.pop, i) + 1
      if res.pop > result.pop:
        result = res

proc main =
  var flowVarSeq: seq[FlowVar[Res]]
  var buf = newString(BlockSize)
  var f: File = open(FileName, fmRead)
  while not f.endOfFile:
    var buf = newString(BlockSize)
    let res = f.readChars(buf)
    if f.endOfFile:
      buf.setLen(res)
    else:
      var i = res - 1
      while buf[i] != '\n':
        dec(i)
      inc(i)
      f.setFilePos(i - res, fspCur)
      buf.setLen(buf.len + i - res)
    flowVarSeq.add(spawn candidate(buf))
    assert buf[buf.high] == '\n'
    #buf.setLen(BlockSize)
  
  f.close
  
  var final = Res(area: NegInf, pop: int.low)
  for c in flowVarSeq:
    let h = ^c
    if h.pop > final.pop:
      final = h
  
  echo "Final result:", final

main()

Stefan_Salewski (orginal) [2022-03-26T11:41:19+01:00] view original

Thinking about it, I am just wondering if iterators that are returning strings have to do an allocation for each yield? As strings in Nim have value semantics, one single initial allocation may be enough? Or maybe I am wrong.

Looking at https://github.com/nim-lang/Nim/blob/version-1-6/lib/system/io.nim#L915

may support my guess:

var f = open(filename, bufSize=8000)
  try:
    var res = newStringOfCap(80)
    while f.readLine(res): yield res
  finally:
    close(f)

Looks like a reuse of string res for me. So maybe the splitLines() iterator does not reuse strings and so may be slow for ARC? From my memory, splitLines() uses substr(), and I guess substr() may allocate?

Stefan_Salewski (orginal) [2022-03-26T13:25:22+01:00] view original

Funny fact, this user observed the same factor 5 slowdown for ARC:

https://forum.nim-lang.org/t/6968

Stefan_Salewski (orginal) [2022-03-26T17:59:37+01:00] view original

As noted in the other thread, with -d:useMalloc we can fix the issue for --gc:arc and get runtimes of about 120 ms, equal to refc.

But for --gc:orc -d:useMalloc does not help that much. For the initial code version, and the code from above with splitlines() replaced by parseUntil, I get runtimes from 310 to 200 ms. The large variance is interesting, which seem to occur only for ORC, and was reported by others as well, see https://forum.nim-lang.org/t/7755#49311.

Mirror of forum.nim-lang.org

9045 :: CSV file parsing