Hi,
I posted a thread on Reddit but it got no reaction, so I post it (with additions) more directly here.
Here is the original post (post + 2 comments of mine, which are the only comments :-) ): <https://www.reddit.com/r/nim/comments/5sbq93/question_about_standard_library_file_functions/> . I'll repeat it in sections A and B.
readChar() is defined in sysio.nim as:
proc readChar*(f: File): char = result = char(c_fgetc(f))
If I understood correctly, Nim's char is an unsigned 8-bit type, as C's unsigned char. But fgetc() returns an int, which has the value of an unsigned char if a character could be read, or EOF (==-1 in most standards) to indicate that either an error occurred or the end of file was reached.
So we lose that information. If the cast is C-style, it means we will get an invalid character and believe it is valid; if Nim casts are different, I guess the cast/conversion will crash.
I imagine that that function should at least test if the result is EOF, and signal it. Then, better, determine and differentiate if it is an error (and raise an exception) or a real end-of-file.
$ echo -n "a" > file1char.txt
Check that it is really one char long:
$ stat -c "%s" file1char.txt
1
var f=open("file1char.txt")
var c : char
try:
c=readChar(f)
except:
echo "First readChar raised an exception"
echo c
try:
c=readChar(f)
except:
echo "Second readChar raised an exception"
echo c
$ ./test_readChar
a
ΓΏ
No exception was raised. It returned a "valid" character as second read... for some definition of "valid" :-)
readLine() has probably problems with errors and end-of-files too. At some point, it does:
if c_fgets(addr line.string[pos], sp, f) == nil:
line.string.setLen(0)
return false
First problem is for the overloaded readLine() function that is defined as:
proc readLine(f: File): TaintedString =
result = TaintedString(newStringOfCap(80))
if not readLine(f, result): raiseEIO("EOF reached")
So it raises an exception signalling EOF every time the other readLine() returns false. And the other readLine() returns false every time fgets() returns NULL. Problem: fgets() returning NULL not only happens when end-of-file is reached, but also when an error happened. Nim's readLine() will hide it and say end-of-file was reached when it was not.
Second problem is that c_fgets() can be called several times in a loop, if the first call didn't reach an end-of-line character, in order to fetch lines that are longer that the allocated buffer. What happens when the file ends right after the number of characters read? The previous run was fine, no end-of-line was detected in the string, so it tries to read more text by calling fgets() again, fgets() immediately hits EOF and returns NULL, and readLine() exits by clearing the string it read and returning false. The string is lost.
I don't think that is the expected behaviour.
(There is nothing special about the number 9 that is picked there, it could be 13 or 40 or whatever. The purpose is just to set one fixed value.)
$ echo -n "abcdefghi" > file9char.txt
Check that it is really one char long:
$ stat -c "%s" file9char.txt
9
var f=open("file9char.txt")
var line = newString(10)
try:
let b=readLine(f, line)
echo "b=", b
except:
echo "First readLine raised an exception"
echo line
try:
line=readLine(f)
let b=readLine(f, line)
echo "b=", b
except:
echo "Second readLine raised an exception"
echo line
$ ./test_readLine
b=false
Second readLine raised an exception
The second readLine raised an exception. OK.
But the first one returned false and the string line is empty. That's not OK.
If you allocate a string of different length with newString(), (for example 9 or 11), you will get the expected proper result:
$ ./test_readLine
b=true
abcdefghi
Second readLine raised an exception
abcdefghi
(Well, I am not sure line shouldn't rather be cleared after the second readLine(), but that's yet another matter.)
Is raising an IOError for an end-of-file a good idea? Is reaching the end of file an error, or a normal and expected behaviour?
We cannot rely on its return value to differentiate between and end-of-file and an error. fread() doing a short read can mean both.
The values for the whence argument of fseek() are hardcoded. Unless I am mistaken, neither C standards nor POSIX standards fix these values, so they could be whatever. So it would be better to define SEEK_CUR and the other 2 ones somewhere in a platform-dependent module, and use these constants in the library instead of hardcoded integer literals.
Sorry for the lengthy post :-)
readChar shouldn't be used for anything as it's too slow (usually needs to acquire a lock).
I believe my original use case was for parsing e-mails: the lines can be "folded" (if the line starts with a whitespace, it is the continuation of the previous). So in order to unfold them, I would have read a line, normally with readLine, and then readChar() the next character to act accordingly and then seek back one character with setFilePos(). I imagine the performance would not have been a problem for that.
And the other reason that had me look inside the code of those functions is that I was looking for a function with similar functionality as POSIX's getdelim(): being able to specify the new-line character or string, because readLine() & co make their own assumption about it, that may not always suit a specific use case (of course it still suits a large majority of use cases).
That said, "EOF" was considered to be the same error as "error during read" (since in both cases not much can be done about the situation)
Thing is, there is nothing to be done when you reach the end of the file, it is a perfectly normal situation. We just just have to know that it is the end, so that we don't try to needlessly read further, and we can say the reading process is over and proceed to the next process.
By contrast, if there is a real reading error, it means that the file has been deleted, is unreachable, was modified by another program, etc.: something serious, unexpected and abnormal. Action must be taken: inform the user, retry, exit, recreate a file, etc.
What do you think about this:
import macros
# some file mocking
const EOF = -1'i32
var i = 0
let data = [int32('H'), int32('i'), EOF]
proc c_fgetc(f: File): int32 =
result = data[i]
i += 1
# here the interesting part starts
type ReadCharResult = distinct int32
proc eof(arg: ReadCharResult): bool =
int32(arg) == EOF
converter char(arg: ReadCharResult): char =
if arg.eof:
raise newException(IOError, "end of file reached")
else:
result = system.char(int32(arg))
proc readChar*(f: File): ReadCharResult =
result = ReadCharResult(c_fgetc(f))
macro matchChar(arg: ReadCharResult): bool =
arg.expectKind nnkSym
let idt = ident(arg.repr)
result = quote do:
(var `idt`: system.char = cast[system.char](`arg`); not `arg`.eof)
result = result[0]
var file : File
while true:
let v = file.readChar
if matchChar(v):
echo "got a char: ", v
else:
echo "end of file :'("
break
i = 0 # just rewinding the fake file
let v0: char = file.readChar
echo "assigned char works: ", v0
let v1: char = file.readChar
echo "assigned char works: ", v1
let v2: char = file.readChar # fails as expected, becaues there is no third char in stream
echo "assigned char works: ", v2
It introduces a result type, and a converter, so that old code can still work without adjustments. Not sure if you like it though.
I am not very happy with the names I have chosen, if you have any better alternatives for the following names, I am very open to that.
char (for the converter)
matchChar
ReadCharResultType
@cblake
I never use mmap() so that's not a reflex I have, but why not.
This being said, it seems that memSlices only takes a char as delimiter, and not a string. That means that if I want to match only CRLF ("\r\l" in Nim-speak, if I am not mistaken), I cannot do it with that function.
@Krux02
I am not competent with all those converter, macro, quote, etc. so maybe your message was more intended for Araq :-)
Nevertheless, concerning the content, you solution seems to fix the problem of returning invalid characters as valid. But there is still the question that concerns that whole family of file IO functions: the Nim user cannot tell if it is really an end-of-file or if an error happened. The library functions should (IMO), when they get an EOF as return value (or any other error+end-of-file indicator), check what really happened with ferror() or feof(). And they probably need to raise 2 different exceptions depending on that distinction.
As for such an iterator being in the standard lib memfiles, well, "line ending agnosticism" rather than strictness is generally Araq's preference, but it never hurts to ask.
It would be to lines() what split() is to splitLines() :-)
Oh, and when you can use it, memory mapping is typically faster/easier than buffered IO.
In the case I just want to read the beginning of the file (e.g. read only headers of an e-mail), if the OS prefetches 128 ko when I just need to read a few ko at most, I doubt it will be faster. (That's a wild guess of mine, I did not do any measurement, and as I said I have no experience of mmap().)
@BossHogg: converter is a function that get's implicitly injected in places, whare the argument type is available, but the result type of the converter is expected. Namingly it is called in this line:
let v0: char = file.readChar
the macro with the quote creates an expression that reintroduces a symbol with the same name as the argument symbol, the shadows the old symbol, and the new symbol whill now have the type char. The value of that expression is a boolean, so that it can be used in if staements:
let v = file.readChar
if matchChar(v):
echo "got a char: ", v
you see that matchChar is in an if statement, and that branch is only taken, when v actually contains a value. The v that is used as the argument of echo, is not the same as the one declared in the let statement, it is a new one that actually has the type char.I was thinking more something along these lines that wraps the existing memSlices than full-on string splitting.
import memfiles
iterator linesDOS*(mf: MemFile, buf: var TaintedString): TaintedString =
var ms: MemSlice
for msNew in memSlices(mf, eat='\0'): # strict \n-terminated
if cast[cstring](msNew.data)[msNew.size - 1] == '\r': # \r\n-terminated
if ms.data != nil:
ms.size += msNew.size + 1 # +1 for the \n
else:
ms = msNew
buf.setLen(ms.size)
copyMem(addr(buf[0]), ms.data, ms.size)
buf[ms.size] = '\0'
yield buf
ms.data = nil # Mark ms empty
else: # Missing \r; accumulate
if ms.data != nil:
ms.size += msNew.size + 1
else: # First accumulant; initialize
ms = msNew
if ms.data != nil:
echo "There is a final unterminated line..Uh..."
iterator linesDOS*(mfile: MemFile): TaintedString {.inline.} =
var buf = TaintedString(newStringOfCap(80))
for line in linesDOS(mfile, buf):
yield buf
when isMainModule:
for line in linesDOS(memfiles.open("/dev/stdin")):
echo "line: ", line
Arguably, this might be better factored into a memSlicesDOS iterator returning the aggregated MemSlice and then two other iterators with a provided buffer variable or no provided buffer. And obviously, you want a more general strategy for incomplete final lines.@cblake
Yes, with this method, I also hit the problem of the last line. Rewriting memSlices is probably not much longer anyway.
There is no isLast or hasNext for iterators in Nim, is there? BTW, does Nim have something to ease processing the first and/or last element in a for loop (they often need a bit different processing than other elements) ?
Here's what I've done with an overloading of memSlices:
iterator memSlices*(mfile: MemFile, delim: string): MemSlice {.inline.} =
let delim_len = delim.len
let delim_start: pointer = cast[cstring](delim)
var ms: MemSlice
ms.data = mfile.mem
var remaining = mfile.size
while remaining > 0:
var ending_index = -1
for i in 0..(remaining-delim_len):
if equalMem(cast[pointer](cast[int](ms.data) +% i), delim_start, delim_len):
ending_index=i
break
if ending_index < 0: # unterminated final slice
ms.size = remaining # Weird case..check eat?
yield ms
break
ms.size = ending_index # delim is NOT included
yield ms
ms.data = cast[pointer](cast[int](ms.data) +% ending_index +% delim_len) # skip delim
remaining = mfile.size - (cast[int](ms.data) - cast[int](mfile.mem))
iterator lines*(mfile: MemFile, buf: var TaintedString, delim: string): TaintedString {.inline.} =
for ms in memSlices(mfile, delim):
buf.setLen(ms.size)
copyMem(addr(buf[0]), ms.data, ms.size)
buf[ms.size] = '\0'
yield buf
Yeah. I haven't tried out your code, but an API idea might be to have the caller instruct the iterator what to do with any improper/undelimited record at the end -- includeUnterminated=False or something. Only the caller really knows what is desired.
One observation, algo performance-wise, is that the fastest substring searches are usually achieved by using a fast libc SSE/AVX vectorized memchr to find the least likely character in the substring. Sometimes one knows this from vague properties of the distribution. For example, framing the email messages in UUCP-style mailbox with '^From ' kinds of delimiting, memchr for the capital 'F' skips over most text most of the time. This idea may not work in a corpus of emails in ALL CAPS ALL THE TIME, but such a corpus should be rare. :-) Anyway, this idea is just sort of the simplest/first step into a wider world of efficient substring search algorithms that you might enjoy reading about some day.
Anyway, this idea is just sort of the simplest/first step into a wider world of efficient substring search algorithms that you might enjoy reading about some day.
Mmmyeah... or I'll call grep :-D (seriously, I did it for another program because it was faster than my lousy simplistic algorithms.)
Some greps can indeed be quite good. :-)
Really, I think this algorithmic intelligence should be able to be just used from Nim strutils. Nim strings know their length. Many/most of those read-only procs like strutils.find "should" be able to work on types like MemSlice or really some kind of openarray[char] -- a pointer and length pair. Scare quotes for the "should" since the current Nim openarray does not include arrays of dynamic/run-time extent but with non-heap/GC backing memory like a memory map, or some other blob of memory.
There is ongoing work on concepts that could probably profitably replace openarray in many use cases. So, maybe some day in the not too distant future, as they say. Concept support may already be good enough to start an ambitious project along those lines, but yeah, for you/your situation, just re-doing memSlices is a fine approach, esp. if absolute performance is not a priority.
@Araq
So, I read the version from:
https://github.com/nim-lang/Nim/blob/devel/lib/system/sysio.nim
I hope that was the right one.
proc checkErr(f: File) =
if c_ferror(f) != 0:
c_clearerr(f)
raiseEIO("Unknown IO Error")
proc readBuffer(f: File, buffer: pointer, len: Natural): int =
result = c_fread(buffer, 1, len, f)
checkErr(f)
proc readChar(f: File): char =
let x = c_fgetc(f)
if x == -1: raiseEOF()
checkErr(f)
result = char(x)
I did not check readLine(), I think it better to clarify our thoughts before:
The way most stdio functions work is the following: they have a way to say something special happened. But they only have a single way for both end-of-file and IO error. So the way to go is:
I guess 3. could be a function like your checkErr() but then it should always raise something because you call it when you already know something special happened, you just don't know yet which special thing.
NB: I don't know what target Nim aims at (C89? C99? POSIX? the same ones with extensions? some specific platform?), but EOF is not defined as -1 in C99 standard, just as a negative value. So it should be a platform-dependent #define (well, the equivalent for Nim), not a hardcoded value; or if the normal return value is always positive, you can check for negative values.
Won't raise an exception for EOF.
Nor should it, EOF is only raised if the proc interface was so poor it couldn't signal it in any other way. Eventually we will get io.nim that uses a proper sum type for this.
I guess 3. could be a function like your checkErr() but then it should always raise something because you call it when you already know something special happened, you just don't know yet which special thing.
Ok, that's helpful, thanks.
NB: I don't know what target Nim aims at (C89? C99? POSIX? the same ones with extensions? some specific platform?), but EOF is not defined as -1 in C99 standard, just as a negative value. So it should be a platform-dependent #define (well, the equivalent for Nim), not a hardcoded value; or if the normal return value is always positive, you can check for negative values.
We target the supported C compilers (Visual, clang, gcc), not the C "standard", so I don't care much about "-1 in theory could be something else".
Nor should it, EOF is only raised if the proc interface was so poor it couldn't signal it in any other way.
That's not very regular, but as long as the documentation clearly states it, I guess it's OK.
NB: I don't know what target Nim aims at (C89? C99? POSIX? the same ones with extensions? some specific platform?), but EOF is not defined as -1 in C99 standard, just as a negative value. So it should be a platform-dependent #define (well, the equivalent for Nim), not a hardcoded value; or if the normal return value is always positive, you can check for negative values.
We target the supported C compilers (Visual, clang, gcc), not the C "standard", so I don't care much about "-1 in theory could be something else".
(That's not a matter of compiler but of library (glibc, musl, etc.).)
I cannot spot any obvious mistake in the read* functions any more.
What about the write* functions now? Starting with
proc write(f: File, i: int)
when sizeof(int) == 8:
c_fprintf(f, "%lld", i)
else:
c_fprintf(f, "%ld", i)
this and the following procedures ignore the return values of fprintf() and similar functions, and they don't call checkErr() either.