nimforum mirror - Performance issues with "complex" module

zio_tom78 (orginal) [2020-05-02T09:25:40+02:00] view original

I ran a few benchmarks of C++, Julia, and Nim compilers, and I found that Nim's implementation of Complex64 type shows sub-optimal performance.

I have implemented a simple algorithm to compute a Julia set , and I have found that the simplest implementation is quite slow:

import complex

func julia(z: Complex64, c: Complex64, maxiter: int = 256): int =
  var iteridx = 0
  var cur_z = z
  while (abs2(cur_z) < 4) and (iteridx < maxiter):
    cur_z = cur_z * cur_z + c
    iteridx += 1
  
  result = if iteridx == maxiter:
             -1
           else:
             iteridx

Generating a 800x800 image requires 1.5 s on my laptop, to be compared with ~0.07 s for an almost identical code in Julia (note the pun!). Inspecting the C code produced by Nim shows that the compiler is not trying to inline the call to functions like *. So I changed the computation within the while loop in this way:

# cur_z = cur_z * cur_z + c
cur_z *= cur_z
cur_z += c

The elapsed time went down to ~1 s: better, but still not good! My understanding is that this modification helps the compiler to avoid the creation of some temporary Complex64 variables.

Then, I decided to manually unroll the computation of the real and imaginary parts of cur_z:

# cur_z = cur_z * cur_z + c
let tmp = cur_z.re * cur_z.re - cur_z.im * cur_z.im
cur_z.im = 2 * cur_z.re * cur_z.im + c.im
cur_z.re = tmp + c.re

The code is much uglier, but now the elapsed time is ~0.2 s!

At this point, I have a few questions:

Is there a way to further improve the code and reach Julia's speed (0.07 s)?

The last version of the code is surely the fastest, but it's really ugly to read! Is there an elegant way to keep the elapsed time down to ~0.1 while avoiding the need to manually unroll the calculation of the real and imaginary parts? (Maybe using {.inline.} everywhere in the complex module?)

The full code of the benchmark, as well as the command-line parameters I used to compile it, can be found in this Gist: https://gist.github.com/ziotom78/346fff619dc093d473abfe4cb0b8060c

Stefan_Salewski (orginal) [2020-05-02T10:05:38+02:00] view original

Yes, your guess that missing inlining is the problem seems to be true.

We generally use link time optimization, which is really good with gcc10. Try


 $ nim c -d:release --passC:-flto t.nim

$ ./t
julia1: 114 ms (sum of pixels: 27677748)
julia2: 115 ms (sum of pixels: 27677748)
julia3: 111 ms (sum of pixels: 27677748)

There are more options to tweak of course, like ARC or march=native and such.

zio_tom78 (orginal) [2020-05-02T10:49:15+02:00] view original

Awesome @Stefan_Salewski, that worked perfectly! Thanks a lot!

cblake (orginal) [2020-05-02T11:23:13+02:00] view original

With the gcc backend doing "profile guided optimization" can often help (especially with measurements to drive inlining choices). e.g., doing just @Stefan_Salewski's command-line I get:


julia1: 84 ms (sum of pixels: 27677748)
julia2: 83 ms (sum of pixels: 27677748)
julia3: 82 ms (sum of pixels: 27677748)

while doing this


nim c -d:danger --panics:on -c t.nim
gcc -O3 -flto -fprofile-generate -I/usr/lib/nim/lib ~/.cache/nim/r/t/*.c -o pg
./pg
gcc -O3 -flto -fprofile-use -I/usr/lib/nim/lib ~/.cache/nim/r/t/*.c -o t-final

I get:


julia1: 82 ms (sum of pixels: 27677748)
julia2: 82 ms (sum of pixels: 27677748)
julia3: 82 ms (sum of pixels: 27677748)

So, the PGO "flattened" the performance a bit more. In this example the PGO speed boost was close to zero/measurement error, but I have seen as high as 2x speed-ups for more complicated programs. So, it's worth having some little "nim-pgo" wrapper script to automate the above if you are writing programs that have an easy "benchmark run".

Mirror of forum.nim-lang.org

6283 :: Performance issues with "complex" module