As a new user, I'd love to know about I/O optimization techniques for Nim.
Hence, I have this simple function for https://www.fluentcpp.com/2017/09/25/expressive-cpp17-coding-challenge/.
import os, strutils
proc main() =
var
pageSize: int = 4096
input = open(paramStr(1), fmRead, pageSize)
output = open(paramStr(4), fmWrite, pageSize)
changeAt = 0
let
toChange = $paramStr 2
changeWith = $paramStr 3
var i: int = 0
for column in input.readLine().split(','):
if column == toChange:
changeAt = i
inc i
let columns = i
for row in input.lines():
i = 0
for entry in row.split(','):
if i != changeAt:
output.write entry
else:
output.write changeWith
if i + 1 != columns:
output.write(',')
inc i
output.writeLine("")
main()
While the program is not expressive, I intend to use this to explore the possible optimizations in I/O. My question why changing pageSize does not have any apparent effect on the program? And how can I find an optimal pageSize for the system? Also, any other tips would be appreciated.
As a side note, the following (somewhat similar) python program is actually faster than the Nim program above:
import sys
with open(sys.argv[1]) as fin:
head = fin.readline()
col = head.split(',').index(sys.argv[2])
with open(sys.argv[4], 'w') as fout:
fout.write(head)
for i in (x.split(',') for x in fin.readlines()):
i[col] = sys.argv[3]
fout.write(','.join(i))
It should be clear that your code is not IO bound.
You are using split(), which creates a seq of substrings for each call. Of course such operation is expensive, as for each call a new seq is allocated, and for each entry in the seq a string is allocated. If performance is really a concern, one may consider avoiding split proc -- maybe there are other procs available in stringutils, or you may invent one.
Have you compiled with -d:release? I assume that Pythons split() is coded internally in C, so we should not assume that Nim's split() is much faster. Maybe Python has some smart internal optimization?
When you want to improve the performance of the Nim code:
The blog entry about profiling etc. might help. Here's the short version:
If your'e on Windows: can't help you. Maybe try Linux Subsystem for Windows 10.
If your'e on MacOs or Linux: install Valgrind
Then use these shell commands:
nim c -d:release <source file>
valgrind --tool=callgrind -v ./<program file> <arguments>
kcachegrind callgrind.out.<some number> # or maybe qcachegrind
The (k/q)cachegrind GUI applications shows all procs with their relative share of used CPU time. For your code and a quite big input file (>10MB), it shows file operations actually use up a minority of CPU time. As @Stefan_Salewski said. The main problem is in the call level of the main proc.As you may have found out yourself already, when I was talking about replace() proc I had in mind this type of replace:
var s = "This is a test"
s[5 .. 6] = "was"
echo s
For finding the start and end position you may use find proc searching for ',' or whatever is desired.
And that's one of the reasons why I like functional programming... Firstly, you don't really need a new seq for the job you're doing. Your should utilize iterating over sections separated by ',' but you can modify them in-place (or even better --- directly write them to the file!).
Here is a little benchmark (100 iterations) for a considerable file (23 KB) with very few (9) very short (50-60 B for a line) columns:
Here is the code:
import os, strutils
proc main() =
var
pageSize: int = 4096
input = open(paramStr(1), fmRead, pageSize)
output = open(paramStr(4), fmWrite, pageSize)
changeAt = 0
let
toChange = $paramStr 2
changeWith = $paramStr 3
var row = input.readLine()
output.writeLine(row)
block:
var
prev = 0
last = -1
for i in 0 .. row.len-1:
prev = last
last = row.find(',', start=last+1)
if row[prev+1..last-1] == toChange:
changeAt = i
break
while input.readLine(row): # note: the string buffer is reused
var
prev = 0
last = -1
for i in 0 .. changeAt:
prev = last
last = row.find(',', start=last+1) # note: no slicing, just indexing
output.writeLine(row[0..prev], changeWith, row[last..row.len]) # note: single writeLine
main()
Of course one can make it simpler. What's more... it doesn's seem slower at all. The only thing I miss here is Rust's "pattern-letting" so I could utilize a structure instead of a tuple... Nevertheless, here it comes:
import os, strutils
iterator sections(input: string, sep: char = ','): (int,int,int) =
var
prev = 0
last = -1
i = -1
while true:
inc i
prev = last
last = input.find(sep, start=last+1)
yield (i, prev+1, last-1)
if last == -1:
break
proc section(input: string, nr: int, sep: char = ','): (int,int) =
var
prev = 0
last = -1
for i in 0..nr:
prev = last
last = input.find(sep, start=last+1)
result = (prev+1, last-1)
proc findSection(input, name: string, sep: char = ','): (int,int,int) =
for i,lb,hb in input.sections():
if input[lb..hb] == name:
return (i,lb,hb)
proc main() =
var
pageSize: int = 4096
input = open(paramStr(1), fmRead, pageSize)
output = open(paramStr(4), fmWrite, pageSize)
let
toChange = $paramStr 2
changeWith = $paramStr 3
var row = input.readLine()
output.writeLine(row)
let changeAt = row.findSection(toChange)[0]
while input.readLine(row):
let (lb, hb) = row.section(changeAt)
output.writeLine(row[0..lb-1], changeWith, row[hb+1..row.len])
main()
Thank you, functional programming!
Too bad Nim isn't more functional as then we could write:
var lines = input.lines() # iterator over file yielding...
let header = lines.next().unwrap() # ...iterators over strings (here `lines` iterator is modified)
let changeAt = header.findSection(toChange).idx # here iterator over strings gives (int,int,int)
output.writeLine(header)
lines.mapIt: # for any line (over iterator)...
let (lb,hb) = it.sectionize(). # ...iterate over it's sections...
nth(changeAt) # ...take the one we want...
# do stuff
output.writeLine(it[0..lb-1], # (it would also be nice to have an iterator slice)
changeWith,
it[hb+1..it.len]) # and iterator's `len`
Please notice the only var here would be unnecessary if iterator could be split in a "return yielded & rest" way. Here, an iterator is rather small anyway so we wouldn't mind it in terms of performance and let is nicer for reasoning than var.
I see that my readLine was messing with my buffer optimizations. Thanks KevinGolding. Udiknedormin's code is actually the fastest. Still, it's worth mentioning that pypy is considerably fast. I made a csv with 10000000 rows and 10 columns with
import os, strutils
let
rows = parseInt($paramStr 1)
columns = parseInt($paramStr 2)
var output = open($paramStr(3), fmWrite, 4096)
for _ in 1 .. rows:
for column in 1 .. columns - 1:
output.write("field", column, ",")
output.write("field", columns, "\n")
Kevin's code is similar to that of python. Here is another benchmark.
[user0@user0-pc expressive]$ >kevingolding.csv && time ./kevingolding biginput.csv field8 TEST kevingolding.csv
real 0m16.727s
user 0m13.651s
sys 0m0.867s
[user0@user0-pc expressive]$ >nim.csv && time ./nim biginput.csv field8 TEST nim.csv # mine
real 0m8.392s
user 0m6.484s
sys 0m0.803s
[user0@user0-pc expressive]$ >python.csv && time pypy python.py biginput.csv field8 TEST python.csv
real 0m8.020s
user 0m5.048s
sys 0m1.395s
[user0@user0-pc expressive]$ >udiknedromin.csv && time ./udiknedromin biginput.csv field8 TEST udiknedromin.csv
real 0m6.236s
user 0m3.082s
sys 0m0.881s
I had a similar problem here and i ended up using the parsecsv module.
import parsecsv, streams, os, strutils
proc main() =
var
pageSize: int = 4096
input = newFileStream(paramStr(1), bufSize = pageSize)
output = open(paramStr(4), fmWrite, pageSize)
changeAt: int
parser: CsvParser
let
toChange = $paramStr 2
changeWith = $paramStr 3
open(parser, input, paramStr(1), quote = '\0')
parser.readHeaderRow()
changeAt = parser.headers.find(toChange)
output.writeLine parser.headers.join(",")
while readRow(parser):
parser.row[changeAt] = changeWith
output.writeLine parser.row.join(",")
main()
I'm surprised no one mentionned this blog post: Faster command line tools in Nim which is all about optimizing csv IO.
I'm also very interested in the benchmarks of the relative solutions :).
@mratsim Thank you. I haven't read this blog post before, actually.
Actually, it seems my version is faster than the one using parsecsv, at least for my benchmark.
However, it should be noted that parsecsv does more as it allows escaping etc.
Thanks to this I found a nice thing I'd like to have in parsecsv so I made a pull request.
Here is the code of parsecsv version (edit: now shorter after the PR was accepted):
import os, strutils, parsecsv
proc main() =
var
pageSize: int = 4096
output = open(paramStr(4), fmWrite, pageSize)
let
toChange = $paramStr 2
changeWith = $paramStr 3
var csv: CsvParser
open(csv, paramStr(1), separator=',')
csv.readHeaderRow()
while csv.readRow():
csv.myRowEntry(toChange) = changeWith
output.writeLine(csv.row.join(","))
main()