Hello everyone,
I have recently started learning about Nim, and I really like it so far. The elegance of the syntax plus the flexibility of the template/macro system are really cool.
Just for fun, I did a small exercise where I produced a CSV file of the numbers 0 through 2,999,999 in three columns, and then I read in that CSV file and computed the sum of all the numbers. The code I used is as follows:
# Code used to produce csv file
import std/strutils
import std/sugar
import std/sequtils
const numRows = 1000000
var csvOut = ""
for i in 0..<numRows:
let a = 3 * i
let s = @[a, a+1, a+2]
let row = s.join(",") & "\n"
csvOut &= row
echo csvOut[0..100]
writeFile("./nim.csv", csvOut)
# Code used to parse CSV file
import std/strutils
import std/sequtils
let csv = readFile("./nim.csv")
proc sum(s: seq[int]): int = s.foldl(a + b)
var totalSum = 0
for l in csv.splitLines():
if l.len > 1:
let rowSum = l.split(",").mapIt(it.parseInt).sum()
totalSum += rowSum
echo "Expected total = ", 2999999 * 3000000 / 2
echo "Actual total = ", totalSum
I also wrote the same example in Rust for comparison:
use std::fs;
fn main() {
let csv = fs::read_to_string("./nim.csv").unwrap();
let mut total: i64 = 0;
for l in csv.lines() {
let row_sum: i64 = l.split(",").map(|x| x.parse::<i64>().unwrap()).sum();
total += row_sum;
}
println!("Actual result = {}", total);
let n: i64 = 2999999;
let expected_total = n * (n + 1) / 2;
println!("Expected result = {}", expected_total);
}
What I found was that the Rust code runs in about 100ms, whereas the Nim code (compiled with -d:release) runs in about 600ms.
I understand of course that the primary goal of Nim is not to be the fastest language, but I'm just curious, why is the Rust example significantly faster here? Are they able to use certain special optimizations in this case?
For context, most of my programming experience is with Python and Kotlin, and I'm not particularly knowledgeable about low-level languages, optimization, etc.
The very short answer is: Performance in these kinds of microbenchmarks depends on all sorts of implementation details. Nim can be just as fast as Rust and vice versa. In this case the CSV parser of the Nim stdlib isn't intended to be fast, but convenient. In addition using split and mapIt in this context is not fast in Nim, due to the way these are implemented in the stdlib. More data copying & more looping.
If you want a faster version, use memory mapped files, e.g. (using the cligen https://github.com/c-blake/cligen for convenience, as its interface for memory mapped files is nicer than the stdlib version):
import cligen / [mfile, mslice]
let mf = mopen("./nim.csv")
var totalSum = 0
for row in rows(mf, initSep(",")):
for el in row:
totalSum += parseInt(el)
mf.close()
echo "Expected total = ", 2999999 * 3000000 / 2
echo "Actual total = ", totalSum
But note: you can now do the same in Rust and it'll probably also get faster. But it's just a race to a more and more similar and low level solution.
Take this Rust:
for l in csv.lines() {
What is l here? It's a &str, or a string slice: logically, a pointer and a length. At this point your program has nim.csv in a single String and you're operating over slices of that file without further allocation or copying.
Now take this Nim:
for l in csv.splitLines():
What is l here? It's a string. Logically: an independent array of chars and a length. In order to provide you with it Nim must allocate new storage for the string, copy part of nim.csv into that string, and then deallocate it moments later when you're done with it, which is three steps that don't occur in the Rust code at all. Even if they might be faster than you think, even if in a much more complex program you might find your Rust program doing a lot of explicit cloning to get Strings anyway, 'nothing' is generally faster than 'something'.
Nim has view types that might be used similarly to Rust's &str, but those are deep in the experimental section of the manual. You might also have a splitLines that returns pairs of indices into nim.csv, but more work would be needed to get your familiar string ops to accept such a thing. From above, cligen's already put in the work: https://github.com/c-blake/cligen/blob/master/cligen/mslice.nim#L28
Also, with support for custom allocators, you could use a 'temp allocator', effectively removing the 'allocate' and 'deallocate' steps by only copying into a ring buffer that is (unsafe:) large enough that you don't overwrite old string data while those strings are still in use. For little line-processing microbenchmarks this might be competitive.
For context, most of my programming experience is with Python and Kotlin, and I'm not particularly knowledgeable about low-level languages, optimization, etc.
The first thing to understand is the types that you're working with.
Also -d:danger would be faster than -d:release but I don't think Rust has an equivalent.
Thanks guys, this is very informative and give me a lot to think about!
@Araq, at the time I was starting to read the CSV in Nim, I didn't know there was a parsecsv module. Also, I wanted to make the examples as similar as possible. (Of course I see now that my two examples were different in some key ways. Remember, I'm a newbie to low-level programming!)
An undocumented feature is perhaps that parsecsv also handles utf-encoding formidably.
I ran this:
import std/parsecsv
from std/os import paramStr
from std/streams import newFileStream
var s = newFileStream(paramStr(1), fmRead)
if s == nil:
quit("Cannot open file" & paramStr(1))
var x: CsvParser
var n : int
open(x,s,paramStr(1))
while readRow(x):
n += 1
echo "new row: ", x.row, " (", len(x.row), " fields)"
close(x)
echo n
on this Czech dataset:
"Item","Materiál","Objem","Jednotka objemu","Free Inv Pcs"
"1000028","SL 70","1248.000","CM3",21
"1000031","Karibik 12.5 kg","41440.000","CM3",2
"1000036","IH 26","6974.100","CM3",2
"1000062","IL 35","6557.300","CM3",15
"1000078","Projektor Tomáš se zvukem","8742.400","CM3",11
With:
D:\github\nim_tutorials>nim c -d:release "d:\github\nim_tutorials\nim_csv.nim"
D:\github\nim_tutorials>nim_csv.exe "d:\new 6.csv"
And the output is perfect:
new row: @["Item", "Materiál", "Objem", "Jednotka objemu", "Free Inv Pcs"] (5 fields)
new row: @["1000028", "SL 70", "1248.000", "CM3", "21"] (5 fields)
new row: @["1000031", "Karibik 12.5 kg", "41440.000", "CM3", "2"] (5 fields)
new row: @["1000036", "IH 26", "6974.100", "CM3", "2"] (5 fields)
new row: @["1000062", "IL 35", "6557.300", "CM3", "15"] (5 fields)
new row: @["1000078", "Projektor Tomáš se zvukem", "8742.400", "CM3", "11"] (5 fields)
6
nim does have a sum() in std/math
Not using sum() or mapIt() cuts runtime in half for me:
Original code:
Expected total = 4499998500000.0
Actual total = 4499998500000
real 0m0.457s
Modified:
for l in csv.splitLines():
if l.len > 1:
for col in l.split(","):
totalSum += parseInt(col)
Expected total = 4499998500000.0
Actual total = 4499998500000
real 0m0.252s
parsecsv runs in ~200ms and the rust version ~150ms
It’s a pity that is relatively easy to do the wrong thing (performance wise) for some of these simple cases, particularly coming from Python.
Perhaps there could be a macro that you could use to wrap some code and if there were any allocations inside of it it would throw an error?
Perhaps there could be a macro that you could use to wrap some code and if there were any allocations inside of it it would throw an error?
No, just use a profiler and don't expect to be able to produce the fastest code within 10 minutes of tinkering around. You shouldn't expect splitLines+split for CSV to produce correct results (which indeed it doesn't, quoting is a real thing), let alone a fast result.
The use of view types is tied to the implementation of split and parseInt which is why no one has shown it. You have to rewrite them with changed type signatures (which parseutils.parseInt already has on devel, so we just make a shim for it):
import std/parseutils
{.experimental: "views".} # needed for openarray in iterators apparently
iterator splitView(s: openarray[char], chars: set[char]): openarray[char] =
var lastAfterLine = -1
var lastBeforeLine = 0
for i in 0 ..< s.len:
let c = s[i]
if c in chars:
yield s.toOpenArray(lastAfterLine, lastBeforeLine)
lastAfterLine = -1
else:
if lastAfterLine < 0: lastAfterLine = i
lastBeforeLine = i
if lastAfterLine >= 0:
yield s.toOpenArray(lastAfterLine, lastBeforeLine)
# so we don't shadow strutils:
proc parseInt(s: string): int =
discard parseInt(s, result)
proc `$`(s: openarray[char]): string =
result = newString(s.len)
for i in 0 ..< s.len: result[i] = s[i]
proc parseInt(s: openarray[char]): int =
when compiles(parseInt(s, result)):
# in the development version, parseInt is implemented for openarray[char]
discard parseInt(s, result)
else:
parseInt($s)
let csv = readFile("./nim.csv")
var totalSum = 0
for l in csv.splitView({'\r', '\n'}):
if l.len > 1:
# broken for some reason:
#let rowSum = mapIt(l.splitView({','}), it.parseInt).sum()
for it in l.splitView({','}):
totalSum += it.parseInt
echo "Expected total = ", 2999999 * 3000000 / 2
echo "Actual total = ", totalSum
Method | WallTimeMean | WallTimeStdErr |
---|---|---|
Vindaar | 0.044140 | 0.000028 |
cbMemFls | 0.047743 | 0.000074 |
N4vRust | 0.068667 | 0.000082 |
Hlaaftana | 0.076238 | 0.000041 |
gemath | 0.152973 | 0.000043 |
splitItr | 0.18736 | 0.00017 |
N4v | 0.26486 | 0.00012 |
I made only minor edits to the programs to not print anything out unless there was a mismatch (as per my last) and splitItr is just the obvious split iterator nested inside of lines. Times are mean/stderr(mean) for best 3/10 on N4v's original problem on Linux 6.0.9, i7-6700k at4.7GHz. N4vRust built with rustc -C opt-level=3 -C lto -C target-cpu=native on rust-1.65 Nim built with nim c --cc:gcc -d:lto --passL:-no-pie --mm:arc -d:danger -d:useMalloc --panics:on --passC:-march=native --passL:-march=native --threads:off & gcc-12.2.1_20221203 & no user nim.cfg. I got very similar results (same rankings) on an Intel AlderLake (but needed a taskset there to stay on P-cores).
These mean(3/10)+-se are run-to-run reproducible within a few sigma of their error estimates. This measurement methodology is just one easy-to-explain/something-is-better-than-nothing idea to estimate t0 in tMeasured = t0 + systemNoise (and gauge the uncertainty of said estimate). In Zsh on Unix, it is as easy as (repeat 10 ru -t prog) |& sort -g | head -n3 | cstats (which in my environment is just tim prog) using ru & cstats. The true t0 is a bit lower, of course. Nothing magic about 3 & 10 other than "patience calibration" & bare minimum sdev estimates. { The Zeitgeist, perhaps following Haskell's popular criterion, seems to neglect filtering out systemNoise. Since systemNoise is usually heavy-tailed, non-stationary & likely very idiosyncratic all around, not filtering is..not great; mean times get dragged higher & variance explodes. } If you like, you can use @Vindaar's Measuremancer to compute ratios w/uncertainties, but they are all about 2.8..3.5 significant figures and widely separated.
Parenthetically, using rustc without optimizations yields a binary that did 0.42462 +- 0.00032`, 1.6X slower than the original Nim with optimization. And rustc has no nice "DEBUG BUILD" messages, either. So, you know, many complaints about Nim can be kind of shallow/hollow - SO much depends on if users come from C | Python/JS | C++ | Rust | ... Assumptions simplify not just code, but the entire UX & its meta-design.
For the reference, Vindaar's solution was:
import cligen / [mfile, mslice]
let mf = mopen("./nim.csv")
var totalSum = 0
for row in rows(mf, initSep(",")):
for el in row:
totalSum += parseInt(el)
mf.close()
echo "Expected total = ", 2999999 * 3000000 / 2
echo "Actual total = ", totalSum
No idea about any official positions. Both markAndSweep (gasp! a runtime!) and mm:refc give the best times on this one eeking out 10..20% better than the memory mapped versions (maybe recoverable with huge TLBs on the maps). { but this kind of thing does tend to vary from OS-to-OS and CPU-to-CPU..and vary by, well, more than 20% :-) I even saw gcc PGO slow down the memfiles variant by 10%, roughly corresponding to "only" different backend C compiler flags and that -fno-pie was another weird 1.25x factor, actually...Did not try clang PGO..It's really a dizzyingly large space of things to try. }.
There are at least 3 good easy options here within +- 20%, all faster than the Rust (which as Vindaar correctly observed could surely be brought to parity with "enough work" which, sure, has maybe already been done somewhere). How I like to put it is that Nim, like other systems languages (C, C++, Rust,..) "responds well to optimization effort". How I like to put the dizzying space bit is that "perf sure can be unstable with so many choices!" :-) People too often leap from an initial coarse assessment to never-again-checked rules-of-thumb (& not just in tech!).
What was the Nim version? I'm guessing it's devel since otherwise mine would be fairly slower.
Also this is something people tend to miss but clang instead of gcc might be faster. It's the same compiler architecture as Rust which is faster without memfiles.
Thanks for the benchmark, shows how slow the faststreams based solution gets with --mm:arc.
I cannot imagine this issue to persist. ;-)