I'm attempting to process a very large CSV file (300GB) and about 66GB into the file raises an raiseEInvalidCsv/CsvError exception and the program exits. I'd like to catch this error, ignore it and continue processing the next row. I tried to catch/discard it but that doesn't work:
var x: CsvParser
open(x, s, paramStr(1))
while readRow(x):
try:
if len(x.row) != 17:
continue
else:
echo join(x.row, "|")
except CsvError:
discard
Is there a good way to ignore errors and continue processing the CSV file?
You may be able to re-structure your code like:
import std/[os, syncio, streams, strutils, parsecsv]
var x: CsvParser
open(x, newFileStream(stdin), paramStr(1))
while true:
try:
if not readRow(x): break
if len(x.row) == 17:
echo join(x.row, "|")
except CsvError:
discard
You may also find useful https://github.com/c-blake/nio/blob/main/utils/c2tsv.nim or its same folder cousin c2dsv.c for the Nim-impaired. While I do log errors, I do not raise any exceptions, but depending upon how corrupt your CSV is it may work better (or worse).
The idea of those is to transform a quoted-escaped RFC 4180-ish CSV inputs into an output soundly parsable with regular stdin.lines and a split iterator on hardTAB ('\t') or whatever. If they do work for you, on most multi-core settings (which you probably have if you have 300 GB), you get a little parallelism almost for free using a 2-stage pipeline since the transformation runs on one core while the line-breaking/field-breaking on a second. Alternatively, if you happen to have 300 GB spare space for the output then once you can soundly break on newlines, you can frame it into 8 parts (or however many cores you have) with nSplit which was inspired by an older Forum thread.
cligen also has some zero copy interfaces for this kind of IO which might help as recently discussed here.