I have a hot loop that would benefit from SIMD optimization. In C, I would preface the for loops with this magic:
#pragma GCC ivdepi
#pragma clang loop vectorize(enable)
for (int i=0; blady blah)...
In Julia, I can drop an @SIMD macro in front of a loop for the same effect.
I found it surprisingly hard to get this out of Nim.
I had to resort to this hackery, of which I'm half-proud, half-ashamed:
# abuse the implementation of the existing || operator to inject arbitrary loop #pragmas
iterator `|||`[S, T](a: S, b: T): T {.inline, sideEffect.} =
for i in `||`(a,b, "simd \n#pragma GCC ivdepi \n#pragma clang loop vectorize(enable)") :
yield i
which I use like this
for i in range.a ||| range.b :
#...
and results in C like:
#pragma omp simd
#pragma GCC ivdepi
#pragma clang loop vectorize(enable)
for (i_2 = range.a; i_2 <= range.b; ++i_2) {
//...
}
Mission accomplished! But is there a better way?
Using {.emit.} seems dodgy... Nim's for loops produce gotos instead of C for loops. So you can't just emit the pragma, you'd have to emit the whole for loop.
You could emit a C for loop like so:
template cfor*(x: untyped; a, b: int; body: untyped) =
var `x` {.inject, noinit.}: int
{.emit:["for (", x, "=", a, ";", x, "<=", b, ";", x, "++) {"].}
body
{.emit:"}".}
Or maybe make a macro that takes a Nim for loop and transforms it into the above:
simd:
for i in a..<b:
# blah
But a loop implemented using {.emit.} won't understand Nim control flow, so you can't have break, continue, try/except/finally and such. Maybe you wouldn't want these anyways, if it's supposed to be a high-performance SIMD loop, but it's still worth keeping in mind that this subtly breaks the language. (I haven't checked if the || iterator supports these either, would be good to know.)
This is not a hack, this is intended, I purposely changed || implementation so it can be used for SIMD: https://github.com/nim-lang/Nim/issues/9490
I wonder if this is also defeating compiler optimizations that only work on standard ‘for’ loops?
No, there are many benchmarks that showed that when C is translated one-to-one to Nim, speed is unchanged, or even sometimes, new optimization opportunities arise (better loop hoisting, constant folding, etc). In any case, in compilers intermediate representations, all loops are while loops + jmp/gotos.
For example this is a non-trivial 5 nested for-loop for a (code-golfed) raytracer in 5 lines of C++ https://github.com/mratsim/weave/blob/7682784/demos/raytracing/smallpt.cpp#L81-L86
Those are faithfully translated to 5 idiomatic Nim for loops there: https://github.com/mratsim/weave/blob/7682784/demos/raytracing/smallpt.nim#L189-L197 and performance in Nim is faster than C++ (2.7%, see README).
The translation is 1-t-1 from C++ to Nim, but unobfuscated, with comments where I needed compatibility shims to produce the exact same result (such as when the result depended on right-to-left function parameter evaluation)
Thanks all. I'm really loving Nim -- both the language and community.
I also get a notable speed INCREASE after porting my C code "line-by-line" to Nim. This blows my mind.
If I had to leave a comment for improvement on this topic, I'd suggest there could be a more elegant and generalized way to insert loop optimizations. For example, I want to experiment with OpenMP's loop unrolling pragma, but then I'd be injecting b backspace characters into the custom string I'm passing to || to repurpose it for something other than SIMD.
In my experience, OpenMP loop unrolling doesn't work and unrolling has to be done manually.
Nim also had a {.unroll.} pragma has a placeholder for a future optimization but it was removed.
You can instead create an unroll template to ensure unrolling instead of leaving it to the compiler:
the basic algorithm is https://github.com/mratsim/laser/blob/e23b5d6/benchmarks/fp_reduction_latency/reduction_bench.nim#L118-L130
proc mainBench_2_accum_simple(a: Tensor[float32], nb_samples: int) =
var accum = 0'f32
bench("Reduction - 2 accumulators - simple iter", accum):
let size = a.size
let unroll_stop = size.round_down_multiple(2)
var
accum1 = 0'f32
for i in countup(0, unroll_stop - 1, 2):
accum += a.storage.raw_data[i]
accum1 += a.storage.raw_data[i+1]
for i in unroll_stop ..< size:
accum += a.storage.raw_data[i]
accum += accum1
For reductions you might need a macro for the extra accumulators but for maps, there is no need.
Thanks guys.
When I started this project, the calculation in question was on track to taking 60 days with recursive Python code. I have it down to 2.5 minutes(!) in Nim and I don't miss Python at all.
It seems like there's always "one more optimization" to try. Probably I should step away from the keyboard, but that asm statement is looking pretty shiny. :)