What is your opinion on my general purpose CSV parsing lib?
Unfortunately I do not really understand the large performance difference of the last parallel example for default GC and ARC/ORC.
That was my only question. I hide it a bit, for good reasons :-).
Most of your other remarks seems to be true, I mentioned some of the points in my text already. Remember that it is mostly a books for kids, maybe from 14 to 19 years -- this section is from the last part called "advanced Nim", but presenting a state of the art CVS parser would be too much. My intent was just to sketch how tasks can be parallelized, and at the same time compare regex, peg, split() and parseutils parsing. Of course you know all that, but not all beginners do. The problem is, that I can give the kids currently no hint for the reason why ARC/ORC is much slower in this case. I have to admit that I have not yet started thinking about it at all -- I wrote the text and examples of that section on Thursday, and did a 3 hours proofreading yesterday. Some minutes before I intended to ship that section, I did a few final test, including compiling with ARC, and got that disappointing results. Maybe that is a good motivation for me to learn some more about ARC, move semantic and all that. Until now I have only used ARC and its destructors, but never cared for details too much.
Yes, removing splitLines() helps a lot, runtime is now down to 100 ms. So I can let that section in the book. Question is still why it is so slow with splitLines() and ARC.
New code is something like
# nim c --threads:on --gc:arc -d:release t.nim
import std/threadpool
import std/parseutils
from strutils import splitLines
const
FileName = "csvdata.txt"
BlockSize = 1024 * 1024 - 1
type
Res = object
dist, count, state, vreg: string
area: float
pop: int
proc candidate(lines: string): Res =
var res: Res
#result = new Res
result = Res(area: NegInf, pop: int.low)
var l: string
var j: int
while true:
j += parseUntil(lines, l, '\n', j) + 1
if l.len == 0:
break
#for l in lines.splitLines():
if l.len > 0 and l[0] != '#': # skip first two and all other comment lines
var i: int
i += parseUntil(l, res.dist, ',', i) + 1
i += parseUntil(l, res.count, ',', i) + 1
i += parseUntil(l, res.state, ',', i) + 1
i += parseUntil(l, res.vreg, ',', i) + 1
i += parseFloat(l, res.area, i) + 1
i += parseInt(l, res.pop, i) + 1
if res.pop > result.pop:
result = res
proc main =
var flowVarSeq: seq[FlowVar[Res]]
var buf = newString(BlockSize)
var f: File = open(FileName, fmRead)
while not f.endOfFile:
var buf = newString(BlockSize)
let res = f.readChars(buf)
if f.endOfFile:
buf.setLen(res)
else:
var i = res - 1
while buf[i] != '\n':
dec(i)
inc(i)
f.setFilePos(i - res, fspCur)
buf.setLen(buf.len + i - res)
flowVarSeq.add(spawn candidate(buf))
assert buf[buf.high] == '\n'
#buf.setLen(BlockSize)
f.close
var final = Res(area: NegInf, pop: int.low)
for c in flowVarSeq:
let h = ^c
if h.pop > final.pop:
final = h
echo "Final result:", final
main()
Thinking about it, I am just wondering if iterators that are returning strings have to do an allocation for each yield? As strings in Nim have value semantics, one single initial allocation may be enough? Or maybe I am wrong.
Looking at https://github.com/nim-lang/Nim/blob/version-1-6/lib/system/io.nim#L915
may support my guess:
var f = open(filename, bufSize=8000)
try:
var res = newStringOfCap(80)
while f.readLine(res): yield res
finally:
close(f)
Looks like a reuse of string res for me. So maybe the splitLines() iterator does not reuse strings and so may be slow for ARC? From my memory, splitLines() uses substr(), and I guess substr() may allocate?
Funny fact, this user observed the same factor 5 slowdown for ARC:
As noted in the other thread, with -d:useMalloc we can fix the issue for --gc:arc and get runtimes of about 120 ms, equal to refc.
But for --gc:orc -d:useMalloc does not help that much. For the initial code version, and the code from above with splitlines() replaced by parseUntil, I get runtimes from 310 to 200 ms. The large variance is interesting, which seem to occur only for ORC, and was reported by others as well, see https://forum.nim-lang.org/t/7755#49311.