nimforum mirror - Why is my program so much slower in Nim than in Rust?

apardes (orginal) [2020-12-22T03:07:09+01:00] view original

Hi everyone,

I recently discovered Nim and I've been pretty excited by what I've seen of the language so far. I've seen that Nim can get excellent performance on some benchmarks, but I wanted to test its speed myself. I've written a simple benchmark program in a few languages. My program calculates the number of pairs of intersecting polygons in a given list. I have deliberately chosen a naive approach and avoided doing any low-level optimizations. I've written the program with nearly identical structure in a few languages, and I was disappointed to see the Nim version dramatically outperformed by the Rust version. On my machine, my Rust implementation generally runs the calculation in about 75 ms, while the Nim version takes about 300ms. The Nim version is actually significantly slower than a version I wrote in Kotlin, which I found rather surprising. Is there anything obvious that I'm doing wrong? I am building the executable using nim c -d:release geometry. My machine is running Windows 10 on an Intel i5 processor.

Source for my program: https://github.com/alexpardes/benchmarks/blob/master/nim/geometry.nim

Thanks

juancarlospaco (orginal) [2020-12-22T06:36:25+01:00] view original

-d:danger --gc:arc

DeletedUser (orginal) [2020-12-22T07:21:16+01:00] view original

As @juancarlospaco said, those are the compilation options you should use for the most speed. Although --gc:arc won't help much as there's no allocations, -d:danger is important as it turns off runtime math checks and this code uses a lot of math. -d:danger was only separated from -d:release fairly recently, so it's understandable for you to not know which one is faster. Here are the compile options that danger and release set in the default Nim config

Stefan_Salewski (orginal) [2020-12-22T09:16:03+01:00] view original

proc `+`(p: Point, v: Vector): Point =
    newPoint(p.x + v.x, p.y + v.y)

Maybe these procs are not inlined automatically.

You may try to apply inline pragma on the procs, or compile with passC:-flto for link time optimization.

Stefan_Salewski (orginal) [2020-12-22T09:39:35+01:00] view original

runtime math checks and this code uses a lot of math operations.

Can you tell us what checks you assume to be active with -d:release but ignored with -d:danger in more detail? (I would assume that there are not much checks involved in float math at all -- plain +, -, * operations should just occur and result in INF for overflow.)

Araq (orginal) [2020-12-22T10:05:18+01:00] view original

There is nothing in your program that's slower than what Rust can do with it. This means it's a quest for finding the right optimizer flags so that the assembler code resembles the one that Rust emits. In fact, comparing the asm outputs should give you some insights.

leorize (orginal) [2020-12-22T11:07:24+01:00] view original

Can we have the source of the Rust version as well?

jrfondren (orginal) [2020-12-22T11:28:45+01:00] view original

========   =======   ==================================================
 Run ms     Speed     Flags
========   =======   ==================================================
  157.5     2.02x     -d:release
  113.7     1.46x     -d:danger
  158.0     2.03x     -d:release --floatChecks:off
  123.7     1.59x     -d:release --overflowChecks:off
  145.1     1.86x     -d:release --boundChecks:off
  112.8     1.45x     -d:release --overflowChecks:off --boundChecks:off
   77.9        1x     -d:danger --cc:clang
  197.8     2.54x     -d:release --gc:arc
  177.9     2.28x     -d:danger --gc:arc
   86.8     1.11x     -d:danger --cc:clang --gc:arc
========   =======   ==================================================

Conclusion: --gc:arc does this no favors. Integer checks for all the loops are what slows -d:release down. LLVM beats the pants off gcc for this benchmark, which might explain Rust's performance.

cblake (orginal) [2020-12-22T13:21:17+01:00] view original

I got a 1.76x speed up with PGO on gcc-10.2 Linux 4.7GHz Skylake (default GC):


120 ms -d:release
 86 ms -d:danger
 49 ms PGO

The full range of perf (120/49=2.45x) is comparable to @jrfondren's 198/78=2.54x. So, I suspect clang PGO would be similar (I do not have a script set up for that, but see here).

@apardes reported a full 4.0x ratio. So, it's possible there is still 1.6x to be explained and/or some nim-level optimization that could be done (also possible diff compilation covers the gap for him). vdivsd showed up at the top of a quick profile for me. Multiplying by the reciprocal may be faster than dividing in proc /(v: Vector, c: float64). Maybe Rust is smart enough to do that here? Or some other small micro-optimization type work?

cblake (orginal) [2020-12-22T15:46:50+01:00] view original

Also, getting -ffast-math into gcc options made it 1.5x faster for me (31.5 ms with PGO) basically closing the full 4x gap. I don't know Rust very well, but maybe something in its semantics/the way you express your logic there affords similar optimizations to gcc -ffast-math?

All just theories to try to help. Without the @apardes Rust code as @leorize asked for (and probably the Rust & LLVM versions as well), it is fundamentally just guesswork. But if I got a 4x speed-up, chances are good that @apardes can as well.

leorize (orginal) [2020-12-22T18:54:52+01:00] view original

I ran some profiling with callgrind and found that the sqrt "calls" (thes are significantly slower when compiled with gcc instead of clang:

clang: sqrt took 4.94% of the runtime, or 27,410,400 units, with other costs being similar to gcc (total program cost 554,136,425 units).

gcc: sqrt took 20.32% of the runtime, or 141,048,250 units, with other costs being similar to clang (total program cost 693,035,597 units).

The cost here is calculated using the formula: "Instruction fetch" + 10 x "Mispredicted branch" + 10 x "Level 1 cache miss" + 100 x "Last level cache miss" (the default formula used by KCacheGrind).

apardes (orginal) [2020-12-22T19:53:05+01:00] view original

Thanks for much for all the replies. Using clang and danger does give me close to a 2x speedup, getting it running in close to 160ms. I haven't tried PGO yet. Inline pragmas and -ffast-math don't seem to make any significant difference for me.

Here's my Rust implementation, as requested: https://github.com/alexpardes/benchmarks/blob/master/rust/src/main.rs

In Rust I am building with cargo build --release.

I also wrote a version in D after posting yesterday: https://github.com/alexpardes/benchmarks/blob/master/d/source/app.d

When built with LDC (which is LLVM based), this runs at essentially the same speed as the Rust version (usually <75ms).

Perhaps also worth noting that I have an Ivy Bridge CPU, which I assume is older than what most of you are testing with.

Thanks again for taking a look at this. Hearing that eliminating the performance gap is probably just a matter of playing with compiler flags is enough to make me feel confident about Nim's speed.

fxn (orginal) [2020-12-22T20:40:41+01:00] view original

--opt:speed that has not been mentioned. Is it maybe enabled by some of the options above?

fxn (orginal) [2020-12-22T20:44:59+01:00] view original

Reply to myself, yes.

cblake (orginal) [2020-12-22T20:48:34+01:00] view original

Just to close out my take on this, on the same CPU that got 31.5 ms with PGO & -ffast-math I got 29.0 ms with that Rust version 1.48.0. So, only about a 1.08x ratio and likely highly variable from CPU to CPU doesn't seem like much of a real problem. (With gdc-10.2 PGO build I got 57 ms while ldc is another LLVM, IIRC..More evidence LLVM's default non-PGO choices seem better for this benchmark.)

@apardes - PGO is pretty easy to script and I predict once you set up such a script that you will use it a lot. I routinely get 1.5-2.0x speed-ups with gcc PGO on Nim generated C code.

Mirror of forum.nim-lang.org

7275 :: Why is my program so much slower in Nim than in Rust?