Hi everyone,
I recently discovered Nim and I've been pretty excited by what I've seen of the language so far. I've seen that Nim can get excellent performance on some benchmarks, but I wanted to test its speed myself. I've written a simple benchmark program in a few languages. My program calculates the number of pairs of intersecting polygons in a given list. I have deliberately chosen a naive approach and avoided doing any low-level optimizations. I've written the program with nearly identical structure in a few languages, and I was disappointed to see the Nim version dramatically outperformed by the Rust version. On my machine, my Rust implementation generally runs the calculation in about 75 ms, while the Nim version takes about 300ms. The Nim version is actually significantly slower than a version I wrote in Kotlin, which I found rather surprising. Is there anything obvious that I'm doing wrong? I am building the executable using nim c -d:release geometry. My machine is running Windows 10 on an Intel i5 processor.
Source for my program: https://github.com/alexpardes/benchmarks/blob/master/nim/geometry.nim
Thanks
proc `+`(p: Point, v: Vector): Point =
newPoint(p.x + v.x, p.y + v.y)
Maybe these procs are not inlined automatically.
You may try to apply inline pragma on the procs, or compile with passC:-flto for link time optimization.
runtime math checks and this code uses a lot of math operations.
Can you tell us what checks you assume to be active with -d:release but ignored with -d:danger in more detail? (I would assume that there are not much checks involved in float math at all -- plain +, -, * operations should just occur and result in INF for overflow.)
======== ======= ================================================== Run ms Speed Flags ======== ======= ================================================== 157.5 2.02x -d:release 113.7 1.46x -d:danger 158.0 2.03x -d:release --floatChecks:off 123.7 1.59x -d:release --overflowChecks:off 145.1 1.86x -d:release --boundChecks:off 112.8 1.45x -d:release --overflowChecks:off --boundChecks:off 77.9 1x -d:danger --cc:clang 197.8 2.54x -d:release --gc:arc 177.9 2.28x -d:danger --gc:arc 86.8 1.11x -d:danger --cc:clang --gc:arc ======== ======= ==================================================
Conclusion: --gc:arc does this no favors. Integer checks for all the loops are what slows -d:release down. LLVM beats the pants off gcc for this benchmark, which might explain Rust's performance.
I got a 1.76x speed up with PGO on gcc-10.2 Linux 4.7GHz Skylake (default GC):
120 ms -d:release
86 ms -d:danger
49 ms PGO
The full range of perf (120/49=2.45x) is comparable to @jrfondren's 198/78=2.54x. So, I suspect clang PGO would be similar (I do not have a script set up for that, but see here).
@apardes reported a full 4.0x ratio. So, it's possible there is still 1.6x to be explained and/or some nim-level optimization that could be done (also possible diff compilation covers the gap for him). vdivsd showed up at the top of a quick profile for me. Multiplying by the reciprocal may be faster than dividing in proc /(v: Vector, c: float64). Maybe Rust is smart enough to do that here? Or some other small micro-optimization type work?
Also, getting -ffast-math into gcc options made it 1.5x faster for me (31.5 ms with PGO) basically closing the full 4x gap. I don't know Rust very well, but maybe something in its semantics/the way you express your logic there affords similar optimizations to gcc -ffast-math?
All just theories to try to help. Without the @apardes Rust code as @leorize asked for (and probably the Rust & LLVM versions as well), it is fundamentally just guesswork. But if I got a 4x speed-up, chances are good that @apardes can as well.
The cost here is calculated using the formula: "Instruction fetch" + 10 x "Mispredicted branch" + 10 x "Level 1 cache miss" + 100 x "Last level cache miss" (the default formula used by KCacheGrind).
Thanks for much for all the replies. Using clang and danger does give me close to a 2x speedup, getting it running in close to 160ms. I haven't tried PGO yet. Inline pragmas and -ffast-math don't seem to make any significant difference for me.
Here's my Rust implementation, as requested: https://github.com/alexpardes/benchmarks/blob/master/rust/src/main.rs
In Rust I am building with cargo build --release.
I also wrote a version in D after posting yesterday: https://github.com/alexpardes/benchmarks/blob/master/d/source/app.d
When built with LDC (which is LLVM based), this runs at essentially the same speed as the Rust version (usually <75ms).
Perhaps also worth noting that I have an Ivy Bridge CPU, which I assume is older than what most of you are testing with.
Thanks again for taking a look at this. Hearing that eliminating the performance gap is probably just a matter of playing with compiler flags is enough to make me feel confident about Nim's speed.
Just to close out my take on this, on the same CPU that got 31.5 ms with PGO & -ffast-math I got 29.0 ms with that Rust version 1.48.0. So, only about a 1.08x ratio and likely highly variable from CPU to CPU doesn't seem like much of a real problem. (With gdc-10.2 PGO build I got 57 ms while ldc is another LLVM, IIRC..More evidence LLVM's default non-PGO choices seem better for this benchmark.)
@apardes - PGO is pretty easy to script and I predict once you set up such a script that you will use it a lot. I routinely get 1.5-2.0x speed-ups with gcc PGO on Nim generated C code.