I have a program that so far can read its input either from stdin or from a file given on the command line.
It is parsed line by line in an iterator, as follows:
iterator parseFastqs(input: File): Fastq =
var
nameLine: string
nucLine: string
quaLine: string
while not input.endOfFile:
nameLine = input.readLine()
nucLine = input.readLine()
discard input.readLine()
quaLine = input.readLine()
yield makeFastq(nameLine, nucLine, quaLine)
This iterator is then passed to a procedure that further deals with the Fastq things. So it appears that I need to wrap it into a closure iterator. I use the following solution found on stackoverflow:
# https://stackoverflow.com/a/35697846
template toClosure(inlineIterator): auto =
## Wrap an inline iterator in a first-class closure iterator.
iterator closureIterator: type(inlineIterator) {.closure.} =
for elem in inlineIterator:
yield elem
closureIterator
(I was not able do directly make it a closure iterator. See https://stackoverflow.com/q/48260487/1878788)
So after processing the command-line options, I proceed as follows:
when isMainModule:
let args = docopt(doc)
var
inFastqFilename: Value
# [declare other variables]
# [process args and set variable]
let inputFqs =
if not inFastqFilename:
toClosure(parseFastqs(stdin))
else:
let fh = open($inFastqFilename)
toClosure(parseFastqs(fh))
# [pass inputFqs and other variables to the function that does the main processing]
Now I want to be able to process gzipped input, still either from stdin or from a command-line passed file. I wanted to use this: https://github.com/nim-lang/zip/blob/master/zip/gzipfiles.nim
This seems to be based on streams rather than files. The two objects (File and Stream) seem to have things in common in their interface, such as the possibility to be parsed line by line using readLine. However, on other aspects, they differ, which make them not completely interchangeable: With a File, I can use endOfFile while with a Stream, I need to replace this with atEnd.
So, first questions: What are streams, what are they used for?
The documentation is not really helpful for me:
This module provides a stream interface and two implementations thereof: the FileStream and the StringStream which implement the stream interface for Nim file objects (File) and strings. Other modules may provide other implementations for this standard stream interface.
Anyway, I tried to use streams instead of files, and my code seemed to work provided I changed endOfFile to atEnd:
let inputFqs =
if not inFastqFilename:
toClosure(parseFastqs(newFileStream(stdin)))
else:
let fh = open($inFastqFilename)
toClosure(parseFastqs(newFileStream(fh)))
But when I try to add gzipped input in the game:
let inputFqs =
if not inFastqFilename:
if openGZ:
# Not sure this is the correct way to deal with
# the absence of a newGZFileStream accepting a File:
toClosure(parseFastqs(newGZFileStream("stdin")))
else:
toClosure(parseFastqs(newFileStream(stdin)))
else:
if openGZ: # Error on this line
toClosure(parseFastqs(newGZFileStream($inFastqFilename)))
else:
let fh = open($inFastqFilename)
toClosure(parseFastqs(newFileStream(fh)))
I run into troubles:
Error: type mismatch: got (iterator (): Fastq{.closure, locks: 0.}) but expected 'iterator (): Fastq{.closure, gcsafe, locks: 0.}'
So here is my second question: What is happening, and what can I do about it?
So, first questions: What are streams, what are they used for?
A Stream is an abstraction for anything that provide an input stream of bytes. A stream of bytes is just a sequence of bytes, obtained one at a time. On top of this, many operations can be implemented, such as reading full lines, etc.
So here is my second question: What is happening, and what can I do about it?
I'm not going to try to explain everything that's happening in your code but instead provide you with a working (reduced) example of how I would go about doing something similar to what you're doing, which is basically trying to pass iterators to procs. (Hopefully I understood your need correctly.)
import os, streams, zip/gzipfiles
# Convert a stream into an iterator of lines
proc linesIterator(stream: Stream): iterator(): string =
result = iterator(): string =
while not stream.atEnd:
yield stream.readLine()
# Convert an iterator of strings into a seq of strings
# (contrived illustration of passing an iterator to a proc)
proc readAllLines(iter: iterator(): string): seq[string] =
result = newSeq[string]()
for line in iter():
result.add(line)
# Main code
if paramCount() < 1:
echo "Usage: stream_example [filename]"
quit(1)
let filename = paramStr(1)
let stream: Stream =
if filename[^3 .. ^1] == ".gz":
newGZFileStream(filename)
else:
newFileStream(filename, fmRead)
if stream == nil:
echo "Unable to open file: " & filename
quit(1)
let lines = linesIterator(stream)
let allLines = readAllLines(lines)
for line in allLines:
echo line
stream.close()
As you can see above, the stream can be either a FileStream or a GZipFileStream, the rest of the program doesn't care because both are streams.
Note that the linesIterator proc returns an iterator instead of being an iterator itself. (It is a closure iterator because it captures variables from the environment, namely the stream parameter but we don't have to explicitly annotate this; Nim implicitly figures this out.)
PS: You'll need nimble to compile this program because of the dependency on the zip package.
@bli
Perhaps the NimData package will be useful for you. This package has the ability to read data from a gzip archive.
Thanks for your help. Your example helped me get rid of the need to use the toClosure template, and the code using a GZFileStream compiles. (I also had to use newGZFileStream("/dev/stdin") when reading from stdin.)
It seems that the code working with streams is a bit slower than the version that worked on files. Is it something expected ?
It seems that when I read gzipped data, I have "something more" in the stream than when the input is first decompressed outside of nim. I noticed this because I found data corresponding to an extra empty Fastq in my output when dealing with gzipped input. I tried to confirm this by inserting assertions:
proc fastqParser(stream: Stream): iterator(): Fastq =
result = iterator(): Fastq =
var
nameLine: string
nucLine: string
quaLine: string
while not stream.atEnd():
# while not input.endOfFile:
nameLine = stream.readLine()
#TODO: Why is there an extra empty Fastq when reading gzipped input?
doAssert(not stream.atEnd(), "stream ended after nameLine: " & nameLine)
nucLine = stream.readLine()
doAssert(not stream.atEnd(), "stream ended after nucLine: " & nucLine)
discard stream.readLine()
doAssert(not stream.atEnd(), "stream ended after quaLine: " & quaLine)
quaLine = stream.readLine()
yield makeFastq(nameLine, nucLine, quaLine)
This results in Error: unhandled exception: not atEnd(stream) stream ended after nameLine: [AssertionError], both when reading from stdin and when reading from a command-line given file.
The python version doesn't generate one extra empty Fastq, but maybe the gzipped data is still responsible for the problem, and the gzip reading capability of python is more robust.
because I found data corresponding to an extra empty Fastq
Perhaps because different line-ending between Unix and Windows?
You're reading a line while Stream usually handling byte by byte.
The gzipped data was generated on Linux by extracting lines from a larger gzipped file, using zcat, head and then gzip. When I feed the program with decompressed data, I use a gunzipped version of this, or zcat it to stdin, and everything is fine, while also reading a Stream line by line.
I filed an issue: https://github.com/nim-lang/zip/issues/26