nimforum mirror - Read gzip-compressed file line by line

enormandeau (orginal) [2018-10-10T21:47:48+02:00] view original

I'm using the following approach to read a file line by line. This is a bogus example that only returns each line, but you can imagine I am doing some work on each line.

iterator line_iterator*(input_filename: string): string =
  var infile: File
  
  if open(infile, input_filename):
    defer: close(infile)
    var line: string
    
    while not endOfFile(infile):
      # Do some real work here
      yield infile.readline.strip

I would like to do exactly the same but reading lines within a gzip-compressed file (eg: file.gz).

I'm trying to use import zip/gzipfiles and then use File, GzFile, GzFileStream... but I'm failing.

What would my code above look like if I was iterating over the lines of a .gz file?

enormandeau (orginal) [2018-10-10T22:05:43+02:00] view original

I have found this answer to a similar question but is there a way to not use Streams?

https://forum.nim-lang.org/t/3488#21813

mratsim (orginal) [2018-10-10T23:12:20+02:00] view original

Open your Gz file with the os proc, store it completely in a string buffer, use zlib uncompress

I recommend you just use streams to avoid the extra uncompressed buffer (and maybe even the result buffer if you only work line by line).

enormandeau (orginal) [2018-10-11T14:58:56+02:00] view original

Storing the whole file will not work very well for the application, where the input file could be 10 or 100 Gb :/

I need an iterator function (that I will create) that reads the file line by line inside the gzip archive and yields each meaningfull chunk (DNA sequences and their associated names and quality) on the fly.

pqflx3 (orginal) [2018-10-12T03:09:00+02:00] view original

Since you mention DNA seqs, you might want to take a look at nim-hts. It's easy on unix systems to get htslib, and not too difficult on windows using msys2

You could probably write an iterator around his BGZ type.

enormandeau (orginal) [2018-10-12T16:23:57+02:00] view original

I did look at nim-hts a bit. I'll look some more for this BGZ type as you suggest. Thanks!

alexsad (orginal) [2019-01-14T11:25:54+01:00] view original

Hi @mratism!

the zip/gzipfiles is working properly but it is working only for gz-file. In my case I am getting already gz-string from http response but unfortunately your recommended zlib uncompress doesn't work for gz-string getting: Error: unhandled exception: zlib version mismatch! [ZlibStreamError]

please, please help thank you

dom96 (orginal) [2019-01-16T02:35:24+01:00] view original

This might help: https://github.com/nim-lang/zip/issues/23

alexsad (orginal) [2019-01-30T11:03:49+01:00] view original

Thanks @dom96!

see please my PR ti fix it: https://github.com/nim-lang/zip/pull/35

Mirror of forum.nim-lang.org

4299 :: Read gzip-compressed file line by line