nimforum mirror - Optimizing file I/O

aedt (orginal) [2018-02-18T05:34:10+01:00] view original

As a new user, I'd love to know about I/O optimization techniques for Nim.

Hence, I have this simple function for https://www.fluentcpp.com/2017/09/25/expressive-cpp17-coding-challenge/.


import os, strutils

proc main() =
  var
    pageSize: int = 4096
    input = open(paramStr(1), fmRead, pageSize)
    output = open(paramStr(4), fmWrite, pageSize)
    changeAt = 0
  
  let
    toChange = $paramStr 2
    changeWith = $paramStr 3
  
  var i: int = 0
  for column in input.readLine().split(','):
    if column == toChange:
        changeAt = i
    inc i
  
  let columns = i
  
  for row in input.lines():
    i = 0
    for entry in row.split(','):
        if i != changeAt:
            output.write entry
        else:
          output.write changeWith
        if i + 1 != columns:
            output.write(',')
        inc i
    output.writeLine("")

main()

While the program is not expressive, I intend to use this to explore the possible optimizations in I/O. My question why changing pageSize does not have any apparent effect on the program? And how can I find an optimal pageSize for the system? Also, any other tips would be appreciated.

As a side note, the following (somewhat similar) python program is actually faster than the Nim program above:


import sys

with open(sys.argv[1]) as fin:
  head = fin.readline()
  col = head.split(',').index(sys.argv[2])
  with open(sys.argv[4], 'w') as fout:
    fout.write(head)
    for i in (x.split(',') for x in fin.readlines()):
      i[col] = sys.argv[3]
      fout.write(','.join(i))

Stefan_Salewski (orginal) [2018-02-18T10:21:05+01:00] view original

It should be clear that your code is not IO bound.

You are using split(), which creates a seq of substrings for each call. Of course such operation is expensive, as for each call a new seq is allocated, and for each entry in the seq a string is allocated. If performance is really a concern, one may consider avoiding split proc -- maybe there are other procs available in stringutils, or you may invent one.

Have you compiled with -d:release? I assume that Pythons split() is coded internally in C, so we should not assume that Nim's split() is much faster. Maybe Python has some smart internal optimization?

When you want to improve the performance of the Nim code:

You may limit split() proc to do only as many splits as necessary, I think there is an optional parameter.

You may use find() proc to find location of substitution and then apply replace() to replace substring at that location. Replace allocates one new string for the result, that should be much faster than split() with many allocation. And when you avoid split, you also avoid join() which should also give some small performance increase. If you need still more speed, you may use a custom find() proc which finds position of nth ',' which only one call, and you may use a custom replace proc which reuses a preallocated string buffer instead returning a new one for each call. When that is not enough, you may wait for an answer of the smart people or read doms book. :-)

Lando (orginal) [2018-02-18T11:50:33+01:00] view original

The blog entry about profiling etc. might help. Here's the short version:

If your'e on Windows: can't help you. Maybe try Linux Subsystem for Windows 10.

If your'e on MacOs or Linux: install Valgrind

Then use these shell commands:

nim c -d:release <source file>
valgrind --tool=callgrind -v ./<program file> <arguments>
kcachegrind callgrind.out.<some number>    # or maybe qcachegrind

The (k/q)cachegrind GUI applications shows all procs with their relative share of used CPU time. For your code and a quite big input file (>10MB), it shows file operations actually use up a minority of CPU time. As @Stefan_Salewski said. The main problem is in the call level of the main proc.

Stefan_Salewski (orginal) [2018-02-18T14:37:04+01:00] view original

As you may have found out yourself already, when I was talking about replace() proc I had in mind this type of replace:

var s = "This is a test"
s[5 .. 6] = "was"
echo s

For finding the start and end position you may use find proc searching for ',' or whatever is desired.

Araq (orginal) [2018-02-18T17:23:46+01:00] view original

Your Python program runs once over the input, your Nim program twice. If I read it correctly.

Udiknedormin (orginal) [2018-02-20T01:19:07+01:00] view original

And that's one of the reasons why I like functional programming... Firstly, you don't really need a new seq for the job you're doing. Your should utilize iterating over sections separated by ',' but you can modify them in-place (or even better --- directly write them to the file!).

Here is a little benchmark (100 iterations) for a considerable file (23 KB) with very few (9) very short (50-60 B for a line) columns:

Python: 2.08 s

org Nim: 1.44 s

KevinGolding's Nim: 1.97 s

my Nim version: 0.33 s

Here is the code:

import os, strutils

proc main() =
  var
    pageSize: int = 4096
    input = open(paramStr(1), fmRead, pageSize)
    output = open(paramStr(4), fmWrite, pageSize)
    changeAt = 0
  
  let
    toChange = $paramStr 2
    changeWith = $paramStr 3
  
  var row = input.readLine()
  output.writeLine(row)
  
  block:
    var
      prev = 0
      last = -1
    for i in 0 .. row.len-1:
      prev = last
      last = row.find(',', start=last+1)
      if row[prev+1..last-1] == toChange:
        changeAt = i
        break
  
  while input.readLine(row):  # note: the string buffer is reused
    var
      prev = 0
      last = -1
    for i in 0 .. changeAt:
      prev = last
      last = row.find(',', start=last+1)  # note: no slicing, just indexing
    
    output.writeLine(row[0..prev], changeWith, row[last..row.len])  # note: single writeLine

main()

Of course one can make it simpler. What's more... it doesn's seem slower at all. The only thing I miss here is Rust's "pattern-letting" so I could utilize a structure instead of a tuple... Nevertheless, here it comes:

import os, strutils

iterator sections(input: string, sep: char = ','): (int,int,int) =
  var
    prev = 0
    last = -1
    i = -1
  while true:
    inc i
    prev = last
    last = input.find(sep, start=last+1)
    yield (i, prev+1, last-1)
    if last == -1:
      break

proc section(input: string, nr: int, sep: char = ','): (int,int) =
  var
    prev = 0
    last = -1
  for i in 0..nr:
    prev = last
    last = input.find(sep, start=last+1)
  result = (prev+1, last-1)

proc findSection(input, name: string, sep: char = ','): (int,int,int) =
  for i,lb,hb in input.sections():
    if input[lb..hb] == name:
      return (i,lb,hb)

proc main() =
  var
    pageSize: int = 4096
    input = open(paramStr(1), fmRead, pageSize)
    output = open(paramStr(4), fmWrite, pageSize)
  
  let
    toChange = $paramStr 2
    changeWith = $paramStr 3
  
  var row = input.readLine()
  output.writeLine(row)
  
  let changeAt = row.findSection(toChange)[0]
  
  while input.readLine(row):
    let (lb, hb) = row.section(changeAt)
    output.writeLine(row[0..lb-1], changeWith, row[hb+1..row.len])

main()

Thank you, functional programming!

Too bad Nim isn't more functional as then we could write:

var lines = input.lines()           # iterator over file yielding...
let header = lines.next().unwrap()  # ...iterators over strings (here `lines` iterator is modified)
let changeAt = header.findSection(toChange).idx  # here iterator over strings gives (int,int,int)

output.writeLine(header)

lines.mapIt:                       # for any line (over iterator)...
  let (lb,hb) = it.sectionize().   # ...iterate over it's sections...
                   nth(changeAt)   # ...take the one we want...
                                   # do stuff
  output.writeLine(it[0..lb-1],    #   (it would also be nice to have an iterator slice)
                   changeWith,
                   it[hb+1..it.len])  # and iterator's `len`

Please notice the only var here would be unnecessary if iterator could be split in a "return yielded & rest" way. Here, an iterator is rather small anyway so we wouldn't mind it in terms of performance and let is nicer for reasoning than var.

aedt (orginal) [2018-02-20T12:35:06+01:00] view original

I see that my readLine was messing with my buffer optimizations. Thanks KevinGolding. Udiknedormin's code is actually the fastest. Still, it's worth mentioning that pypy is considerably fast. I made a csv with 10000000 rows and 10 columns with


import os, strutils

let
  rows = parseInt($paramStr 1)
  columns = parseInt($paramStr 2)

var output = open($paramStr(3), fmWrite, 4096)

for _ in 1 .. rows:
  for column in 1 .. columns - 1:
    output.write("field", column, ",")
  output.write("field", columns, "\n")

Kevin's code is similar to that of python. Here is another benchmark.


[user0@user0-pc expressive]$ >kevingolding.csv && time ./kevingolding biginput.csv field8 TEST kevingolding.csv

real	0m16.727s
user	0m13.651s
sys	0m0.867s
[user0@user0-pc expressive]$ >nim.csv && time ./nim biginput.csv field8 TEST nim.csv    # mine

real	0m8.392s
user	0m6.484s
sys	0m0.803s
[user0@user0-pc expressive]$ >python.csv && time pypy python.py biginput.csv field8 TEST python.csv

real	0m8.020s
user	0m5.048s
sys	0m1.395s
[user0@user0-pc expressive]$ >udiknedromin.csv && time ./udiknedromin biginput.csv field8 TEST udiknedromin.csv

real	0m6.236s
user	0m3.082s
sys	0m0.881s

nvill (orginal) [2018-02-21T10:37:33+01:00] view original

I had a similar problem here and i ended up using the parsecsv module.

import parsecsv, streams, os, strutils

proc main() =
  var
    pageSize: int = 4096
    input = newFileStream(paramStr(1), bufSize = pageSize)
    output = open(paramStr(4), fmWrite, pageSize)
    changeAt: int
    parser: CsvParser
  
  let
    toChange = $paramStr 2
    changeWith = $paramStr 3
  
  open(parser, input, paramStr(1), quote = '\0')
  
  parser.readHeaderRow()
  
  changeAt = parser.headers.find(toChange)
  
  output.writeLine parser.headers.join(",")
  
  while readRow(parser):
    parser.row[changeAt] = changeWith
    output.writeLine parser.row.join(",")

main()

mratsim (orginal) [2018-02-21T12:40:49+01:00] view original

I'm surprised no one mentionned this blog post: Faster command line tools in Nim which is all about optimizing csv IO.

I'm also very interested in the benchmarks of the relative solutions :).

Udiknedormin (orginal) [2018-02-22T02:02:09+01:00] view original

@mratsim Thank you. I haven't read this blog post before, actually.

Actually, it seems my version is faster than the one using parsecsv, at least for my benchmark.

Python: 2.08 s

org Nim: 1.44 s

KevinGolding's Nim: 1.97 s

my Nim version: 0.33 s

Nim using parsecsv: 0.50 s

However, it should be noted that parsecsv does more as it allows escaping etc.

Thanks to this I found a nice thing I'd like to have in parsecsv so I made a pull request.

Here is the code of parsecsv version (edit: now shorter after the PR was accepted):

import os, strutils, parsecsv

proc main() =
  var
    pageSize: int = 4096
    output = open(paramStr(4), fmWrite, pageSize)
  
  let
    toChange = $paramStr 2
    changeWith = $paramStr 3
  
  var csv: CsvParser
  open(csv, paramStr(1), separator=',')
  
  csv.readHeaderRow()
  
  while csv.readRow():
    csv.myRowEntry(toChange) = changeWith
    output.writeLine(csv.row.join(","))

main()

Mirror of forum.nim-lang.org

3557 :: Optimizing file I/O