nimforum mirror - Optimizing readLine

def (orginal) [2014-07-19T16:43:33+02:00] view original

It would be nice if readLine was faster. Right now it uses fgetc to read character by character, which is the reason it's so slow:


proc readLine(f: TFile, line: var TaintedString): bool =
  # of course this could be optimized a bit; but IO is slow anyway...
  # and it was difficult to get this CORRECT with Ansi C's methods
  setLen(line.string, 0) # reuse the buffer!
  while true:
    var c = fgetc(f)
    if c < 0'i32:
      if line.len > 0: break
      else: return false
    if c == 10'i32: break # LF
    if c == 13'i32:  # CR
      c = fgetc(f) # is the next char LF?
      if c != 10'i32: ungetc(c, f) # no, put the character back
      break
    add line.string, chr(int(c))
  result = true

The simple solution seemed to be using fgets to read an entire line:


proc fgets(c: cstring, n: int, f: TFile): cstring {.importc: "fgets", header: "<stdio.h>", tags: [FReadIO].}

proc myReadLine(f: TFile, line: var TaintedString): bool =
  var buf {.noinit.}: array[8192, char]
  setLen(line.string, 0)
  result = true
  while true:
    if fgets(buf, 8192, f) == nil:
      result = false
      break
    let l = cstring(buf).len-1
    if buf[l] == '\l':
      buf[l] = '\0'
      add(line, cstring(buf))
      break
    add(line, cstring(buf))

This code only works with line feeds, carriage returns are are ignored. Unfortunately fgets is platform dependent in how it handles LF/CR, which could probably be fixed. But in Nimrod we want l, r, rl to count as newlines. So fgets would read past single l characters, not regarding them as a newline.

I'm not sure how to solve this. My first idea was to scan the buffer fgets filled, to check for a single l and then cut off at that point. What to do with the rest we already have read in? Seeking back in the file would probably not work for stdin.

My only idea now is to have a buffer for each file and do our own reads into that, so that we can roll our own fgets.

Any ideas?

dom96 (orginal) [2014-07-20T12:46:22+02:00] view original

My only idea now is to have a buffer for each file and do our own reads into that, so that we can roll our own fgets.

That is how you should do it. Store the buffer in the TFile type. Take a look at the implementation for sockets, it buffers recv.

Jehan (orginal) [2014-07-20T17:08:35+02:00] view original

This implementation isn't actually any faster for me for files with relatively short lines (such as /usr/share/dict/words).

The underlying problem is that operations on ANSI C file objects lock the file object before the operation begins and after it ends. While this prevents crashes from multiple threads operating on the same file object (important especially for stdout/stderr), it's a waste of time when operating on a file object that only one thread ever accesses. For such files, one can use getc_unlocked() in lieu of fgetc() to obtain a measurable speedup (this won't work on Windows, because this is part of POSIX, not ANSI C).

Note that allocating space for the lines in memory may also cause some additional overhead.

Varriount (orginal) [2014-07-21T02:55:02+02:00] view original

@Jehan Windows has _fgetc_unlocked (don't know why it's named differently)

def (orginal) [2014-07-24T15:48:24+02:00] view original

@Jehan: Unfortunately it's still slower for me in most cases, even with getc_unlocked

I just started a small buffered implementation, comments welcome: https://github.com/def-/nimrod-unsorted/blob/master/bufferedfile.nim

The important part:


while stdin.readLine(line): # 4.31 s
for line in bstdin: # 1.72 s (because the string is copied every time, TODO: how to prevent?)
while bstdin.readLine(line): # 0.74 s

Araq (orginal) [2014-07-24T17:32:27+02:00] view original


for line in bstdin: # 1.72 s (because the string is copied every time, TODO: how to prevent?)

The compiler doesn't like top level statements, put it into a main proc and the copy disappears. (If not, submit a bug report.)

def (orginal) [2014-07-25T00:02:58+02:00] view original

The compiler doesn't like top level statements, put it into a main proc and the copy disappears. (If not, submit a bug report.)

I heard of that a few times, but have never actually observed it. Thanks, it's fast now.

Stefan_Salewski (orginal) [2014-07-25T11:31:22+02:00] view original

The compiler doesn't like top level statements, put it into a main proc

The Nimrod compiler itself I guess -- interesting, I should remember this when testing.

The buffered example is interesting...

My first idea for reading strings which may be terminated by CRFL, LF or CR was reading chars until we have a CR or LF and always one more. And when this additional character is neither CR or LF, then deliver it as first character for next read. I may try that just for fun as an exercice some day... (Of course it is not that simple, we can have some empty lines following each other, so some more logic is necessary...)

For the locking problem mentioned by Jehan -- may it be possible to lock a file for a longer period, and then use getc_unlocked() for fast reading characters?

PS: Recently I found something called "Nimrod by example" -- is there a good reason that it is not mentioned on nimrod-lang.org?

def (orginal) [2014-07-25T14:05:44+02:00] view original

For the locking problem mentioned by Jehan -- may it be possible to lock a file for a longer period, and then use getc_unlocked() for fast reading characters?

flockfile, ftrylockfile, funlockfile

Jehan (orginal) [2014-07-25T16:17:39+02:00] view original

Also, if your code is single-threaded, you can forgo locking entirely.

Varriount (orginal) [2014-07-25T18:16:32+02:00] view original

Stefan_Salewski - It's a relatively new site, I'll put in an issue to add it to the site now.

Stefan_Salewski (orginal) [2014-07-26T17:00:49+02:00] view original

Stefan Salewski wrote:

My first idea for reading strings which may be terminated by CRFL, LF or CR was reading chars until we have a CR or LF and always one more.

Oh I see, that can never work, when we have old files which have only CR for line termination, fgets will not like that.

So I took a second look on you code again:

const
  bufferSize* = 8192 ## size of a buffered file's buffer

type
  BufferedFile* = object
    file*: TFile
    buffer*: array[0..bufferSize, char] # TODO: Make this noinit

Is it really your intent that buffer has odd size? Or did your mean array[0..bufferSize-1, char]? I would be surprised if we do it really that way in nimrod?

def (orginal) [2014-07-26T17:41:14+02:00] view original

Is it really your intent that buffer has odd size? Or did your mean array[0..bufferSize-1, char]? I would be surprised if we do it really that way in nimrod?

Thanks, I did array[bufferSize, char] now

Mirror of forum.nim-lang.org

503 :: Optimizing readLine