nimforum mirror - Twisting Nim's arm to generate loop #pragmas .

mode80 (orginal) [2023-01-30T20:55:13+01:00] view original

I have a hot loop that would benefit from SIMD optimization. In C, I would preface the for loops with this magic:


   #pragma GCC ivdepi
   #pragma clang loop vectorize(enable)
   for (int i=0; blady blah)...

In Julia, I can drop an @SIMD macro in front of a loop for the same effect.

I found it surprisingly hard to get this out of Nim.

I had to resort to this hackery, of which I'm half-proud, half-ashamed:


# abuse the implementation of the existing || operator to inject arbitrary loop #pragmas
iterator `|||`[S, T](a: S, b: T): T {.inline, sideEffect.} =
    for i in `||`(a,b, "simd \n#pragma GCC ivdepi \n#pragma clang loop vectorize(enable)") :
        yield i

which I use like this


for i in range.a ||| range.b :
  #...

and results in C like:


#pragma omp simd
#pragma GCC ivdepi
#pragma clang loop vectorize(enable)
for (i_2 = range.a; i_2 <= range.b; ++i_2)		{
    //...
}

Mission accomplished! But is there a better way?

xigoi (orginal) [2023-01-31T08:44:33+01:00] view original

https://nim-lang.org/docs/manual.html#implementation-specific-pragmas-emit-pragma

exelotl (orginal) [2023-01-31T14:32:12+01:00] view original

Using {.emit.} seems dodgy... Nim's for loops produce gotos instead of C for loops. So you can't just emit the pragma, you'd have to emit the whole for loop.

You could emit a C for loop like so:

template cfor*(x: untyped; a, b: int; body: untyped) =
  var `x` {.inject, noinit.}: int
  {.emit:["for (", x, "=", a, ";", x, "<=", b, ";", x, "++) {"].}
  body
  {.emit:"}".}

Or maybe make a macro that takes a Nim for loop and transforms it into the above:

simd:
  for i in a..<b:
    # blah

But a loop implemented using {.emit.} won't understand Nim control flow, so you can't have break, continue, try/except/finally and such. Maybe you wouldn't want these anyways, if it's supposed to be a high-performance SIMD loop, but it's still worth keeping in mind that this subtly breaks the language. (I haven't checked if the || iterator supports these either, would be good to know.)

mode80 (orginal) [2023-01-31T20:01:53+01:00] view original

yes , the reason hacking ‘||’ is the only(?) thing that works is because Nim otherwise emits “while true goto” style C loops; so any pragma I tack on top with ‘emit’ won’t help it. I wonder if this is also defeating compiler optimizations that only work on standard ‘for’ loops?

mratsim (orginal) [2023-02-01T11:13:49+01:00] view original

This is not a hack, this is intended, I purposely changed || implementation so it can be used for SIMD: https://github.com/nim-lang/Nim/issues/9490

I wonder if this is also defeating compiler optimizations that only work on standard ‘for’ loops?

No, there are many benchmarks that showed that when C is translated one-to-one to Nim, speed is unchanged, or even sometimes, new optimization opportunities arise (better loop hoisting, constant folding, etc). In any case, in compilers intermediate representations, all loops are while loops + jmp/gotos.

For example this is a non-trivial 5 nested for-loop for a (code-golfed) raytracer in 5 lines of C++ https://github.com/mratsim/weave/blob/7682784/demos/raytracing/smallpt.cpp#L81-L86

Those are faithfully translated to 5 idiomatic Nim for loops there: https://github.com/mratsim/weave/blob/7682784/demos/raytracing/smallpt.nim#L189-L197 and performance in Nim is faster than C++ (2.7%, see README).

The translation is 1-t-1 from C++ to Nim, but unobfuscated, with comments where I needed compatibility shims to produce the exact same result (such as when the result depended on right-to-left function parameter evaluation)

mode80 (orginal) [2023-02-01T22:04:55+01:00] view original

Thanks all. I'm really loving Nim -- both the language and community.

I also get a notable speed INCREASE after porting my C code "line-by-line" to Nim. This blows my mind.

If I had to leave a comment for improvement on this topic, I'd suggest there could be a more elegant and generalized way to insert loop optimizations. For example, I want to experiment with OpenMP's loop unrolling pragma, but then I'd be injecting b backspace characters into the custom string I'm passing to || to repurpose it for something other than SIMD.

mratsim (orginal) [2023-02-02T08:59:40+01:00] view original

In my experience, OpenMP loop unrolling doesn't work and unrolling has to be done manually.

Nim also had a {.unroll.} pragma has a placeholder for a future optimization but it was removed.

You can instead create an unroll template to ensure unrolling instead of leaving it to the compiler:

the basic algorithm is https://github.com/mratsim/laser/blob/e23b5d6/benchmarks/fp_reduction_latency/reduction_bench.nim#L118-L130

proc mainBench_2_accum_simple(a: Tensor[float32], nb_samples: int) =
  var accum = 0'f32
  bench("Reduction - 2 accumulators - simple iter", accum):
    let size = a.size
    let unroll_stop = size.round_down_multiple(2)
    var
      accum1 = 0'f32
    for i in countup(0, unroll_stop - 1, 2):
      accum  += a.storage.raw_data[i]
      accum1 += a.storage.raw_data[i+1]
    for i in unroll_stop ..< size:
      accum += a.storage.raw_data[i]
    accum += accum1

For reductions you might need a macro for the extra accumulators but for maps, there is no need.

ReneSac (orginal) [2023-02-02T18:08:44+01:00] view original

An attempt at loop unrolling in nim was done by schneiderfelipe/unrolled but he ran into roadblocks (see issues).

mode80 (orginal) [2023-02-02T18:29:37+01:00] view original

Thanks guys.

When I started this project, the calculation in question was on track to taking 60 days with recursive Python code. I have it down to 2.5 minutes(!) in Nim and I don't miss Python at all.

It seems like there's always "one more optimization" to try. Probably I should step away from the keyboard, but that asm statement is looking pretty shiny. :)

Mirror of forum.nim-lang.org

9858 :: Twisting Nim's arm to generate loop #pragmas .