It would be nice if readLine was faster. Right now it uses fgetc to read character by character, which is the reason it's so slow:
proc readLine(f: TFile, line: var TaintedString): bool =
# of course this could be optimized a bit; but IO is slow anyway...
# and it was difficult to get this CORRECT with Ansi C's methods
setLen(line.string, 0) # reuse the buffer!
while true:
var c = fgetc(f)
if c < 0'i32:
if line.len > 0: break
else: return false
if c == 10'i32: break # LF
if c == 13'i32: # CR
c = fgetc(f) # is the next char LF?
if c != 10'i32: ungetc(c, f) # no, put the character back
break
add line.string, chr(int(c))
result = true
The simple solution seemed to be using fgets to read an entire line:
proc fgets(c: cstring, n: int, f: TFile): cstring {.importc: "fgets", header: "<stdio.h>", tags: [FReadIO].}
proc myReadLine(f: TFile, line: var TaintedString): bool =
var buf {.noinit.}: array[8192, char]
setLen(line.string, 0)
result = true
while true:
if fgets(buf, 8192, f) == nil:
result = false
break
let l = cstring(buf).len-1
if buf[l] == '\l':
buf[l] = '\0'
add(line, cstring(buf))
break
add(line, cstring(buf))
This code only works with line feeds, carriage returns are are ignored. Unfortunately fgets is platform dependent in how it handles LF/CR, which could probably be fixed. But in Nimrod we want l, r, rl to count as newlines. So fgets would read past single l characters, not regarding them as a newline.
I'm not sure how to solve this. My first idea was to scan the buffer fgets filled, to check for a single l and then cut off at that point. What to do with the rest we already have read in? Seeking back in the file would probably not work for stdin.
My only idea now is to have a buffer for each file and do our own reads into that, so that we can roll our own fgets.
Any ideas?
My only idea now is to have a buffer for each file and do our own reads into that, so that we can roll our own fgets.
That is how you should do it. Store the buffer in the TFile type. Take a look at the implementation for sockets, it buffers recv.
This implementation isn't actually any faster for me for files with relatively short lines (such as /usr/share/dict/words).
The underlying problem is that operations on ANSI C file objects lock the file object before the operation begins and after it ends. While this prevents crashes from multiple threads operating on the same file object (important especially for stdout/stderr), it's a waste of time when operating on a file object that only one thread ever accesses. For such files, one can use getc_unlocked() in lieu of fgetc() to obtain a measurable speedup (this won't work on Windows, because this is part of POSIX, not ANSI C).
Note that allocating space for the lines in memory may also cause some additional overhead.
@Jehan: Unfortunately it's still slower for me in most cases, even with getc_unlocked
I just started a small buffered implementation, comments welcome: https://github.com/def-/nimrod-unsorted/blob/master/bufferedfile.nim
The important part:
while stdin.readLine(line): # 4.31 s
for line in bstdin: # 1.72 s (because the string is copied every time, TODO: how to prevent?)
while bstdin.readLine(line): # 0.74 s
for line in bstdin: # 1.72 s (because the string is copied every time, TODO: how to prevent?)
The compiler doesn't like top level statements, put it into a main proc and the copy disappears. (If not, submit a bug report.)
The compiler doesn't like top level statements, put it into a main proc and the copy disappears. (If not, submit a bug report.)
I heard of that a few times, but have never actually observed it. Thanks, it's fast now.
The compiler doesn't like top level statements, put it into a main proc
The Nimrod compiler itself I guess -- interesting, I should remember this when testing.
The buffered example is interesting...
My first idea for reading strings which may be terminated by CRFL, LF or CR was reading chars until we have a CR or LF and always one more. And when this additional character is neither CR or LF, then deliver it as first character for next read. I may try that just for fun as an exercice some day... (Of course it is not that simple, we can have some empty lines following each other, so some more logic is necessary...)
For the locking problem mentioned by Jehan -- may it be possible to lock a file for a longer period, and then use getc_unlocked() for fast reading characters?
PS: Recently I found something called "Nimrod by example" -- is there a good reason that it is not mentioned on nimrod-lang.org?
For the locking problem mentioned by Jehan -- may it be possible to lock a file for a longer period, and then use getc_unlocked() for fast reading characters?
flockfile, ftrylockfile, funlockfile
Stefan Salewski wrote:
My first idea for reading strings which may be terminated by CRFL, LF or CR was reading chars until we have a CR or LF and always one more.
Oh I see, that can never work, when we have old files which have only CR for line termination, fgets will not like that.
So I took a second look on you code again:
const
bufferSize* = 8192 ## size of a buffered file's buffer
type
BufferedFile* = object
file*: TFile
buffer*: array[0..bufferSize, char] # TODO: Make this noinit
Is it really your intent that buffer has odd size? Or did your mean array[0..bufferSize-1, char]? I would be surprised if we do it really that way in nimrod?
Is it really your intent that buffer has odd size? Or did your mean array[0..bufferSize-1, char]? I would be surprised if we do it really that way in nimrod?
Thanks, I did array[bufferSize, char] now