For some applications, it is needed to read large (10-100Go+) compressed files line by line.
Moreover, it is needed to be versatile in letting the program decide if a file is compressed or not (has ".gz" at the end) and decide whether to open it as a File or as a GZFile.
Specifically, being able to write the following function would be very useful and much easier than playing with streams:
proc myopen(filename: string, mode: FileMode=fmRead): File =
# Look for ".gz" at the end of "filename"
# to decide if a File or GZFile should be returned
if filename[^3..^1] == ".gz":
return gzopen(filename)
else:
return open(filename)
The zip/gzipfiles module offers the possibility to create GZFileStream objects, but I'd argue that having to play with streams is making it harder on new NIM users than it could (should?) be.
Having GZFile objects (that behave like File) and gzopen proc (that behaves like open for files) would make it much easier to deal with this case.
I think this woud be very useful to a lot of data scientists attracted to NIM and these people, who already use Python a lot, are a bit pool of potential NIM enthusiasts.
Thanks. I looked at nimarchive. I cannot see how this can be a simpler approach than using zip/gzipfiles and Streams.
Do you have sample code using nimarchive to read gzip files line by line?
I put together a GZipInputStream implementation here: https://gist.github.com/aboisvert/c08e63727d0a3c5de53afa04498e9a90
You can use it to read gzip files line-by-line as such:
let filestream = newFileStream(filename, fmRead)
let gzip = newGZipInputStream(filestream)
for s in gzip.lines:
echo s
Hope it helps.