nimforum mirror - How to speed up reading from file

RonaldoFootballer (orginal) [2015-01-29T06:56:07+01:00] view original

Hi All, I installed Nim yesterday and have been just trying to write very small pieces of code to get to know the language/libraries better. i have used python before and the fact that Nim uses similar syntax is a big draw for me. Added to that is the fact that it compiles to C code which helps in performance. Once I get better at the language, I would like to contribute to its development in whatever way I can.

One very small(micro) piece of code I wrote was to read a csv line by line and compute the average line length. The program is pasted below. The program takes around 60 secs to complete on my machine (Windows 7 64 bit). The file has around 6.8 million lines. I use the csv file at work as part of a larger application. When I read the file through scala language and checked the time, it took only around 5 secs to run. I am not comparing languages, but just want to know if there is anyway to speed up the code? I used -d:release flag to compile.

import times

when isMainModule:

let t0 = cpuTime() var count:int64 = 0 var totallen:int64 = 0

for line in "test.csv".lines:: totallen += line.len count += 1

echo("Avg Line Length = ",totallen div count) echo(cpuTime() - t0)

Araq (orginal) [2015-01-29T10:04:04+01:00] view original

putting the code in a 'main' proc should really help.

'lines' is known for its bad performance on some OSes. Patches are welcome.

For CSV parsing we have a proper module though which uses streams which do not have this performance problem.

Welcome!

def (orginal) [2015-01-29T10:45:16+01:00] view original

I ran into the same issue: http://forum.nim-lang.org/t/503

RonaldoFootballer (orginal) [2015-01-29T17:48:03+01:00] view original

Thanks for your replies. I used the parsecsv module and it helped speed up the execution a lot.

putting the code in a 'main' proc should really help.

By this, I assume that the code should be placed inside a proc. On my system, this did not help in speeding up the program.

Varriount (orginal) [2015-01-30T09:25:37+01:00] view original

What's the new execution speed? Any chance you could help find any bottlenecks?

RonaldoFootballer (orginal) [2015-01-30T14:02:57+01:00] view original

Th program runs in 9 secs now (down from 60 secs). I haven't been able to look into why using lines iterator was slow on my machine. I looked at the code for the iterator. It was just opening the file and reading line by line. I tried increasing the buffersize parameter and also the initial capacity for TaintedString. I have to admit I havent checked how these parameters were used. Just wanted to see if having a bigger buffer might reduce number of times file is accessed. But that didn't help. So guess I will have to check the implementation of getLine function. Will try to look at it as soon as I have some free time.

Jehan (orginal) [2015-01-30T19:00:57+01:00] view original

Varriount: Any chance you could help find any bottlenecks?

The bottlenecks are probably still the same as in the original thread (the one that def linked). I.e., the iterator reading the file character-by-character, which can incur significant per character overhead. The solution would be to use a buffer to read multiple characters at a time.

Note that file streams by themselves do not solve this problem, either (though one could do a BufferedFileStream that would, which is basically what you have in Java/Scala). If anything, the readLine implementation for file streams is even slower, since it also reads the file character by character, and uses a less efficient way of doing that. The reason that parsecsv is faster is that the underlying lexbase-based scanner does not read line-by-line, but in 8k chunks.

Mirror of forum.nim-lang.org

804 :: How to speed up reading from file