I'm using the following approach to read a file line by line. This is a bogus example that only returns each line, but you can imagine I am doing some work on each line.
iterator line_iterator*(input_filename: string): string =
var infile: File
if open(infile, input_filename):
defer: close(infile)
var line: string
while not endOfFile(infile):
# Do some real work here
yield infile.readline.strip
I would like to do exactly the same but reading lines within a gzip-compressed file (eg: file.gz).
I'm trying to use import zip/gzipfiles and then use File, GzFile, GzFileStream... but I'm failing.
What would my code above look like if I was iterating over the lines of a .gz file?
I have found this answer to a similar question but is there a way to not use Streams?
Open your Gz file with the os proc, store it completely in a string buffer, use zlib uncompress
I recommend you just use streams to avoid the extra uncompressed buffer (and maybe even the result buffer if you only work line by line).
Storing the whole file will not work very well for the application, where the input file could be 10 or 100 Gb :/
I need an iterator function (that I will create) that reads the file line by line inside the gzip archive and yields each meaningfull chunk (DNA sequences and their associated names and quality) on the fly.
Since you mention DNA seqs, you might want to take a look at nim-hts. It's easy on unix systems to get htslib, and not too difficult on windows using msys2
You could probably write an iterator around his BGZ type.
Hi @mratism!
the zip/gzipfiles is working properly but it is working only for gz-file. In my case I am getting already gz-string from http response but unfortunately your recommended zlib uncompress doesn't work for gz-string getting: Error: unhandled exception: zlib version mismatch! [ZlibStreamError]
please, please help thank you