I ran a few benchmarks of C++, Julia, and Nim compilers, and I found that Nim's implementation of Complex64 type shows sub-optimal performance.
I have implemented a simple algorithm to compute a Julia set , and I have found that the simplest implementation is quite slow:
import complex
func julia(z: Complex64, c: Complex64, maxiter: int = 256): int =
var iteridx = 0
var cur_z = z
while (abs2(cur_z) < 4) and (iteridx < maxiter):
cur_z = cur_z * cur_z + c
iteridx += 1
result = if iteridx == maxiter:
-1
else:
iteridx
Generating a 800x800 image requires 1.5 s on my laptop, to be compared with ~0.07 s for an almost identical code in Julia (note the pun!). Inspecting the C code produced by Nim shows that the compiler is not trying to inline the call to functions like *. So I changed the computation within the while loop in this way:
# cur_z = cur_z * cur_z + c
cur_z *= cur_z
cur_z += c
The elapsed time went down to ~1 s: better, but still not good! My understanding is that this modification helps the compiler to avoid the creation of some temporary Complex64 variables.
Then, I decided to manually unroll the computation of the real and imaginary parts of cur_z:
# cur_z = cur_z * cur_z + c
let tmp = cur_z.re * cur_z.re - cur_z.im * cur_z.im
cur_z.im = 2 * cur_z.re * cur_z.im + c.im
cur_z.re = tmp + c.re
The code is much uglier, but now the elapsed time is ~0.2 s!
At this point, I have a few questions:
The full code of the benchmark, as well as the command-line parameters I used to compile it, can be found in this Gist: https://gist.github.com/ziotom78/346fff619dc093d473abfe4cb0b8060c
Yes, your guess that missing inlining is the problem seems to be true.
We generally use link time optimization, which is really good with gcc10. Try
$ nim c -d:release --passC:-flto t.nim
$ ./t
julia1: 114 ms (sum of pixels: 27677748)
julia2: 115 ms (sum of pixels: 27677748)
julia3: 111 ms (sum of pixels: 27677748)
There are more options to tweak of course, like ARC or march=native and such.
With the gcc backend doing "profile guided optimization" can often help (especially with measurements to drive inlining choices). e.g., doing just @Stefan_Salewski's command-line I get:
julia1: 84 ms (sum of pixels: 27677748)
julia2: 83 ms (sum of pixels: 27677748)
julia3: 82 ms (sum of pixels: 27677748)
while doing this
nim c -d:danger --panics:on -c t.nim
gcc -O3 -flto -fprofile-generate -I/usr/lib/nim/lib ~/.cache/nim/r/t/*.c -o pg
./pg
gcc -O3 -flto -fprofile-use -I/usr/lib/nim/lib ~/.cache/nim/r/t/*.c -o t-final
I get:
julia1: 82 ms (sum of pixels: 27677748)
julia2: 82 ms (sum of pixels: 27677748)
julia3: 82 ms (sum of pixels: 27677748)
So, the PGO "flattened" the performance a bit more. In this example the PGO speed boost was close to zero/measurement error, but I have seen as high as 2x speed-ups for more complicated programs. So, it's worth having some little "nim-pgo" wrapper script to automate the above if you are writing programs that have an easy "benchmark run".