BossHogg (orginal) [2017-02-07T14:50:41+01:00] view original

Hi,

I posted a thread on Reddit but it got no reaction, so I post it (with additions) more directly here.

Here is the original post (post + 2 comments of mine, which are the only comments :-) ): <https://www.reddit.com/r/nim/comments/5sbq93/question_about_standard_library_file_functions/> . I'll repeat it in sections A and B.

A/ readChar()

readChar() is defined in sysio.nim as:

proc readChar*(f: File): char = result = char(c_fgetc(f))

If I understood correctly, Nim's char is an unsigned 8-bit type, as C's unsigned char. But fgetc() returns an int, which has the value of an unsigned char if a character could be read, or EOF (==-1 in most standards) to indicate that either an error occurred or the end of file was reached.

So we lose that information. If the cast is C-style, it means we will get an invalid character and believe it is valid; if Nim casts are different, I guess the cast/conversion will crash.

I imagine that that function should at least test if the result is EOF, and signal it. Then, better, determine and differentiate if it is an error (and raise an exception) or a real end-of-file.

Test

Step 1 : make a file that contains a single char and nothing else:

$ echo -n "a" > file1char.txt

Check that it is really one char long:


$  stat -c "%s" file1char.txt
1

Step 2: Nim program (named "test_readChar"):

var f=open("file1char.txt")

var c : char
try:
  c=readChar(f)
except:
  echo "First readChar raised an exception"

echo c

try:
  c=readChar(f)
except:
  echo "Second readChar raised an exception"

echo c

Step 3: Result (0.16.0):


$ ./test_readChar
a
ÿ

No exception was raised. It returned a "valid" character as second read... for some definition of "valid" :-)

B/ readLine()

readLine() has probably problems with errors and end-of-files too. At some point, it does:

if c_fgets(addr line.string[pos], sp, f) == nil:
       line.string.setLen(0)
       return false

First problem is for the overloaded readLine() function that is defined as:

proc readLine(f: File): TaintedString =
   result = TaintedString(newStringOfCap(80))
   if not readLine(f, result): raiseEIO("EOF reached")

So it raises an exception signalling EOF every time the other readLine() returns false. And the other readLine() returns false every time fgets() returns NULL. Problem: fgets() returning NULL not only happens when end-of-file is reached, but also when an error happened. Nim's readLine() will hide it and say end-of-file was reached when it was not.

Second problem is that c_fgets() can be called several times in a loop, if the first call didn't reach an end-of-line character, in order to fetch lines that are longer that the allocated buffer. What happens when the file ends right after the number of characters read? The previous run was fine, no end-of-line was detected in the string, so it tries to read more text by calling fgets() again, fgets() immediately hits EOF and returns NULL, and readLine() exits by clearing the string it read and returning false. The string is lost.

I don't think that is the expected behaviour.

Test

(There is nothing special about the number 9 that is picked there, it could be 13 or 40 or whatever. The purpose is just to set one fixed value.)

Step 1: make a file that contains 9 chars and nothing else:


 $ echo -n "abcdefghi" > file9char.txt

Check that it is really one char long:


 $  stat -c "%s" file9char.txt
 9

Step 2: Nim program (named "test_readLine"):

var f=open("file9char.txt")

var line = newString(10)
try:
  let b=readLine(f, line)
  echo "b=", b
except:
  echo "First readLine raised an exception"

echo line

try:
  line=readLine(f)
  let b=readLine(f, line)
  echo "b=", b
except:
  echo "Second readLine raised an exception"

echo line

Step 3: Result:


$ ./test_readLine
b=false

Second readLine raised an exception

The second readLine raised an exception. OK.

But the first one returned false and the string line is empty. That's not OK.

If you allocate a string of different length with newString(), (for example 9 or 11), you will get the expected proper result:


$ ./test_readLine
b=true
abcdefghi
Second readLine raised an exception
abcdefghi

(Well, I am not sure line shouldn't rather be cleared after the second readLine(), but that's yet another matter.)

C/ readLine(f: File)

Is raising an IOError for an end-of-file a good idea? Is reaching the end of file an error, or a normal and expected behaviour?

D/ endOfFile()

Why isn't C's feof() used?

If getc() makes an error, endOfFile will signal an end-of-file, which is wrong. It doesn't differentiate between file reading error and an end-of-file, it returns an end-of-file in both cases.

The return value of ungetc() is discard ed, so the function endOfFile can return a file in a broken condition without mentioning it.

E/ readBuffer()

We cannot rely on its return value to differentiate between and end-of-file and an error. fread() doing a short read can mean both.

F/ rawFileSize()

The values for the whence argument of fseek() are hardcoded. Unless I am mistaken, neither C standards nor POSIX standards fix these values, so they could be whatever. So it would be better to define SEEK_CUR and the other 2 ones somewhere in a platform-dependent module, and use these constants in the library instead of hardcoded integer literals.

Sorry for the lengthy post :-)

Araq (orginal) [2017-02-07T17:20:38+01:00] view original

These are all horrible and I created https://github.com/nim-lang/Nim/issues/5349 for it. That said, "EOF" was considered to be the same error as "error during read" (since in both cases not much can be done about the situation) and readChar shouldn't be used for anything as it's too slow (usually needs to acquire a lock).

BossHogg (orginal) [2017-02-07T18:12:00+01:00] view original

readChar shouldn't be used for anything as it's too slow (usually needs to acquire a lock).

I believe my original use case was for parsing e-mails: the lines can be "folded" (if the line starts with a whitespace, it is the continuation of the previous). So in order to unfold them, I would have read a line, normally with readLine, and then readChar() the next character to act accordingly and then seek back one character with setFilePos(). I imagine the performance would not have been a problem for that.

And the other reason that had me look inside the code of those functions is that I was looking for a function with similar functionality as POSIX's getdelim(): being able to specify the new-line character or string, because readLine() & co make their own assumption about it, that may not always suit a specific use case (of course it still suits a large majority of use cases).

That said, "EOF" was considered to be the same error as "error during read" (since in both cases not much can be done about the situation)

Thing is, there is nothing to be done when you reach the end of the file, it is a perfectly normal situation. We just just have to know that it is the end, so that we don't try to needlessly read further, and we can say the reading process is over and proceed to the next process.

By contrast, if there is a real reading error, it means that the file has been deleted, is unreachable, was modified by another program, etc.: something serious, unexpected and abnormal. Action must be taken: inform the user, retry, exit, recreate a file, etc.

Krux02 (orginal) [2017-02-07T19:28:39+01:00] view original

What do you think about this:

import macros

# some file mocking
const EOF = -1'i32

var i = 0
let data = [int32('H'), int32('i'), EOF]

proc c_fgetc(f: File): int32 =
  result = data[i]
  i += 1

# here the interesting part starts

type ReadCharResult = distinct int32

proc eof(arg: ReadCharResult): bool =
  int32(arg) == EOF

converter char(arg: ReadCharResult): char =
  if arg.eof:
    raise newException(IOError, "end of file reached")
  else:
    result = system.char(int32(arg))

proc readChar*(f: File): ReadCharResult =
  result = ReadCharResult(c_fgetc(f))

macro matchChar(arg: ReadCharResult): bool =
  arg.expectKind nnkSym
  let idt = ident(arg.repr)
  result = quote do:
    (var `idt`: system.char = cast[system.char](`arg`); not `arg`.eof)
  result = result[0]


var file : File

while true:
  let v = file.readChar
  if matchChar(v):
    echo "got a char: ", v
  else:
    echo "end of file :'("
    break

i = 0 # just rewinding the fake file

let v0: char = file.readChar
echo "assigned char works: ", v0
let v1: char = file.readChar
echo "assigned char works: ", v1
let v2: char = file.readChar # fails as expected, becaues there is no third char in stream
echo "assigned char works: ", v2

It introduces a result type, and a converter, so that old code can still work without adjustments. Not sure if you like it though.

I am not very happy with the names I have chosen, if you have any better alternatives for the following names, I am very open to that.


char (for the converter)
matchChar
ReadCharResultType

cblake (orginal) [2017-02-07T19:49:52+01:00] view original

@BossHogg, if your file can be memory mapped (i.e., is a seekable real file, not a pipe/FIFO/etc.) then the memfiles.memSlices iterator is not far from the functionality of getdelim(). There are a couple of example uses in memfiles.lines. A memory mapped file also makes peeking ahead for wrapped lines both easy-ish and efficient.

BossHogg (orginal) [2017-02-07T21:45:27+01:00] view original

@cblake

I never use mmap() so that's not a reflex I have, but why not.

This being said, it seems that memSlices only takes a char as delimiter, and not a string. That means that if I want to match only CRLF ("\r\l" in Nim-speak, if I am not mistaken), I cannot do it with that function.

cblake (orginal) [2017-02-07T22:05:20+01:00] view original

The defaults are set up for the \r to be optional on a line-by-line basis. If you want strict \r\l only as a delimiter, then I think you can just pass eat='\0' and manually check that the last byte of the slice is '\r'. It should be an easy exercise to adapt one of the lines() iterators into, say, strictDOSlines() in your code and then use that. As for such an iterator being in the standard lib memfiles, well, "line ending agnosticism" rather than strictness is generally Araq's preference, but it never hurts to ask. Oh, and when you can use it, memory mapping is typically faster/easier than buffered IO.

BossHogg (orginal) [2017-02-07T22:05:42+01:00] view original

@Krux02

I am not competent with all those converter, macro, quote, etc. so maybe your message was more intended for Araq :-)

Nevertheless, concerning the content, you solution seems to fix the problem of returning invalid characters as valid. But there is still the question that concerns that whole family of file IO functions: the Nim user cannot tell if it is really an end-of-file or if an error happened. The library functions should (IMO), when they get an EOF as return value (or any other error+end-of-file indicator), check what really happened with ferror() or feof(). And they probably need to raise 2 different exceptions depending on that distinction.

BossHogg (orginal) [2017-02-07T22:17:30+01:00] view original

As for such an iterator being in the standard lib memfiles, well, "line ending agnosticism" rather than strictness is generally Araq's preference, but it never hurts to ask.

It would be to lines() what split() is to splitLines() :-)

Oh, and when you can use it, memory mapping is typically faster/easier than buffered IO.

In the case I just want to read the beginning of the file (e.g. read only headers of an e-mail), if the OS prefetches 128 ko when I just need to read a few ko at most, I doubt it will be faster. (That's a wild guess of mine, I did not do any measurement, and as I said I have no experience of mmap().)

Krux02 (orginal) [2017-02-07T22:57:41+01:00] view original

@BossHogg: converter is a function that get's implicitly injected in places, whare the argument type is available, but the result type of the converter is expected. Namingly it is called in this line:

let v0: char = file.readChar

the macro with the quote creates an expression that reintroduces a symbol with the same name as the argument symbol, the shadows the old symbol, and the new symbol whill now have the type char. The value of that expression is a boolean, so that it can be used in if staements:

let v = file.readChar
if matchChar(v):
  echo "got a char: ", v

you see that matchChar is in an if statement, and that branch is only taken, when v actually contains a value. The v that is used as the argument of echo, is not the same as the one declared in the let statement, it is a new one that actually has the type char.

cblake (orginal) [2017-02-07T23:09:06+01:00] view original

I was thinking more something along these lines that wraps the existing memSlices than full-on string splitting.

import memfiles

iterator linesDOS*(mf: MemFile, buf: var TaintedString): TaintedString =
  var ms: MemSlice
  for msNew in memSlices(mf, eat='\0'):                   # strict \n-terminated
    if cast[cstring](msNew.data)[msNew.size - 1] == '\r': # \r\n-terminated
      if ms.data != nil:
        ms.size += msNew.size + 1                         # +1 for the \n
      else:
        ms = msNew
      buf.setLen(ms.size)
      copyMem(addr(buf[0]), ms.data, ms.size)
      buf[ms.size] = '\0'
      yield buf
      ms.data = nil                                       # Mark ms empty
    else:                                                 # Missing \r; accumulate
      if ms.data != nil:
        ms.size += msNew.size + 1
      else:                                               # First accumulant; initialize
        ms = msNew
  if ms.data != nil:
    echo "There is a final unterminated line..Uh..."

iterator linesDOS*(mfile: MemFile): TaintedString {.inline.} =
  var buf = TaintedString(newStringOfCap(80))
  for line in linesDOS(mfile, buf):
    yield buf

when isMainModule:
  for line in linesDOS(memfiles.open("/dev/stdin")):
    echo "line: ", line

Arguably, this might be better factored into a memSlicesDOS iterator returning the aggregated MemSlice and then two other iterators with a provided buffer variable or no provided buffer. And obviously, you want a more general strategy for incomplete final lines.

BossHogg (orginal) [2017-02-08T17:50:15+01:00] view original

@cblake

Yes, with this method, I also hit the problem of the last line. Rewriting memSlices is probably not much longer anyway.

There is no isLast or hasNext for iterators in Nim, is there? BTW, does Nim have something to ease processing the first and/or last element in a for loop (they often need a bit different processing than other elements) ?

Here's what I've done with an overloading of memSlices:

iterator memSlices*(mfile: MemFile, delim: string): MemSlice {.inline.} =
  let delim_len = delim.len
  let delim_start: pointer = cast[cstring](delim)
  var ms: MemSlice
  ms.data = mfile.mem
  var remaining = mfile.size
  while remaining > 0:
    var ending_index = -1
    for i in 0..(remaining-delim_len):
        if equalMem(cast[pointer](cast[int](ms.data) +% i), delim_start, delim_len):
            ending_index=i
            break
    if ending_index < 0:                            # unterminated final slice
      ms.size = remaining                           # Weird case..check eat?
      yield ms
      break
    ms.size = ending_index                          # delim is NOT included
    yield ms
    ms.data = cast[pointer](cast[int](ms.data) +% ending_index +% delim_len)   # skip delim
    remaining = mfile.size - (cast[int](ms.data) - cast[int](mfile.mem))


iterator lines*(mfile: MemFile, buf: var TaintedString, delim: string): TaintedString {.inline.} =
  for ms in memSlices(mfile, delim):
    buf.setLen(ms.size)
    copyMem(addr(buf[0]), ms.data, ms.size)
    buf[ms.size] = '\0'
    yield buf

cblake (orginal) [2017-02-08T21:09:15+01:00] view original

Yeah. I haven't tried out your code, but an API idea might be to have the caller instruct the iterator what to do with any improper/undelimited record at the end -- includeUnterminated=False or something. Only the caller really knows what is desired.

One observation, algo performance-wise, is that the fastest substring searches are usually achieved by using a fast libc SSE/AVX vectorized memchr to find the least likely character in the substring. Sometimes one knows this from vague properties of the distribution. For example, framing the email messages in UUCP-style mailbox with '^From ' kinds of delimiting, memchr for the capital 'F' skips over most text most of the time. This idea may not work in a corpus of emails in ALL CAPS ALL THE TIME, but such a corpus should be rare. :-) Anyway, this idea is just sort of the simplest/first step into a wider world of efficient substring search algorithms that you might enjoy reading about some day.

BossHogg (orginal) [2017-02-08T22:39:30+01:00] view original

Anyway, this idea is just sort of the simplest/first step into a wider world of efficient substring search algorithms that you might enjoy reading about some day.

Mmmyeah... or I'll call grep :-D (seriously, I did it for another program because it was faster than my lousy simplistic algorithms.)

cblake (orginal) [2017-02-09T02:14:09+01:00] view original

Some greps can indeed be quite good. :-)

Really, I think this algorithmic intelligence should be able to be just used from Nim strutils. Nim strings know their length. Many/most of those read-only procs like strutils.find "should" be able to work on types like MemSlice or really some kind of openarray[char] -- a pointer and length pair. Scare quotes for the "should" since the current Nim openarray does not include arrays of dynamic/run-time extent but with non-heap/GC backing memory like a memory map, or some other blob of memory.

There is ongoing work on concepts that could probably profitably replace openarray in many use cases. So, maybe some day in the not too distant future, as they say. Concept support may already be good enough to start an ambitious project along those lines, but yeah, for you/your situation, just re-doing memSlices is a fine approach, esp. if absolute performance is not a priority.

Araq (orginal) [2017-02-09T06:56:07+01:00] view original

@BossHogg The IO routines have been fixed but I wouldn't mind another pair of eyes on the code.

BossHogg (orginal) [2017-02-09T15:30:47+01:00] view original

@Araq

So, I read the version from:

https://github.com/nim-lang/Nim/blob/devel/lib/system/sysio.nim

I hope that was the right one.

proc checkErr(f: File) =
  if c_ferror(f) != 0:
    c_clearerr(f)
    raiseEIO("Unknown IO Error")

proc readBuffer(f: File, buffer: pointer, len: Natural): int =
  result = c_fread(buffer, 1, len, f)
  checkErr(f)

Won't raise an exception for EOF.

No need to call checkErr() everytime (not needed when result==len).

proc readChar(f: File): char =
  let x = c_fgetc(f)
  if x == -1: raiseEOF()
  checkErr(f)
  result = char(x)

EOF is not necessarily -1; testing x < 0 is better.

You raise EOF exception even if there was no end-of-file but an IO error.

I did not check readLine(), I think it better to clarify our thoughts before:

The way most stdio functions work is the following: they have a way to say something special happened. But they only have a single way for both end-of-file and IO error. So the way to go is:

(Eventually make sure the EOF and Error flags are not set, with clearerr().)

Run the function.

Check the eof/error indication: fgetc() returned EOF? fread() returned a number less than nmemb? fgets() returned NULL? ...

If that check matches, check whether it was an end-of-file or an IO Error. You can check with feof() or ferror(). Raise the according exception. Like (pseudocode): if ferror(): raiseIOerror else: raiseEOFerror.

(Eventually make sure the EOF and Error flags are not set, with clearerr().)

I guess 3. could be a function like your checkErr() but then it should always raise something because you call it when you already know something special happened, you just don't know yet which special thing.

NB: I don't know what target Nim aims at (C89? C99? POSIX? the same ones with extensions? some specific platform?), but EOF is not defined as -1 in C99 standard, just as a negative value. So it should be a platform-dependent #define (well, the equivalent for Nim), not a hardcoded value; or if the normal return value is always positive, you can check for negative values.

Araq (orginal) [2017-02-09T16:50:51+01:00] view original

Won't raise an exception for EOF.

Nor should it, EOF is only raised if the proc interface was so poor it couldn't signal it in any other way. Eventually we will get io.nim that uses a proper sum type for this.

I guess 3. could be a function like your checkErr() but then it should always raise something because you call it when you already know something special happened, you just don't know yet which special thing.

Ok, that's helpful, thanks.

NB: I don't know what target Nim aims at (C89? C99? POSIX? the same ones with extensions? some specific platform?), but EOF is not defined as -1 in C99 standard, just as a negative value. So it should be a platform-dependent #define (well, the equivalent for Nim), not a hardcoded value; or if the normal return value is always positive, you can check for negative values.

We target the supported C compilers (Visual, clang, gcc), not the C "standard", so I don't care much about "-1 in theory could be something else".

BossHogg (orginal) [2017-02-10T21:44:10+01:00] view original

Nor should it, EOF is only raised if the proc interface was so poor it couldn't signal it in any other way.

That's not very regular, but as long as the documentation clearly states it, I guess it's OK.

NB: I don't know what target Nim aims at (C89? C99? POSIX? the same ones with extensions? some specific platform?), but EOF is not defined as -1 in C99 standard, just as a negative value. So it should be a platform-dependent #define (well, the equivalent for Nim), not a hardcoded value; or if the normal return value is always positive, you can check for negative values.
We target the supported C compilers (Visual, clang, gcc), not the C "standard", so I don't care much about "-1 in theory could be something else".

(That's not a matter of compiler but of library (glibc, musl, etc.).)

I cannot spot any obvious mistake in the read* functions any more.

What about the write* functions now? Starting with

proc write(f: File, i: int)
  when sizeof(int) == 8:
    c_fprintf(f, "%lld", i)
  else:
    c_fprintf(f, "%ld", i)

this and the following procedures ignore the return values of fprintf() and similar functions, and they don't call checkErr() either.

Mirror of forum.nim-lang.org

2787 :: Remarks on standard library file functions (system, sysio)

A/ readChar()

Test

B/ readLine()

Test

C/ readLine(f: File)

D/ endOfFile()

E/ readBuffer()

F/ rawFileSize()