nimforum mirror - feedback on macro

jackmott (orginal) [2017-11-29T06:44:17+01:00] view original

I'm working on a macro idea to allow a SIMD library where you can write simd code once, and at runtime the correct simd functions will be used based on feature detection of the cpu.

I have a simple proof of concept here, this works, but I am unsure if this is the best way to accomplish this:

avx.nim

proc add*(a:int,b:int) : int =
    echo "avx add"
    a+b

sse.nim

proc add*(a:int,b:int) : int =
    echo "sse add"
    a+b

main.nim

import avx
import sse
import macros
import rdstdin

# recursively process the AST replacing any simd. idents
proc replaceSIMD(node:NimNode, simdType:string) =
    for node in node.children:
        echo $node.kind
        if node.kind == nnkIdent:
            echo $node.ident
            if node.ident == !"simd":
                node.ident = !simdType
        replaceSIMD(node,simdType)

macro SIMD_AVX(body:untyped): untyped =
    replaceSIMD(body,"avx")
    body

macro SIMD_SSE(body:untyped): untyped =
    replaceSIMD(body,"sse")
    body

# I think macros cannot do anything at runtime so we start with a template
template SIMD(body:untyped) =
    let str = readLineFromStdin "sse or avx? "
    
    # instead of calling different macros, is there a way to pass the string?
    if str == "sse":
        SIMD_SSE(body)
    if str == "avx":
        SIMD_AVX(body)

# inside the SIMD statement, replace each instance of simd. with
# the appropriate SIMD type (avx, sse, etc)
SIMD:
    echo $simd.add(1,2)

Varriount (orginal) [2017-11-29T10:01:24+01:00] view original

Why can't you do this using the when statement?

proc simdAdd(a, b: int): int =
  when defined(useAvxSimd) and defined(useSseSimd):
    {.fatal: "The symbols useAvxSimd or useSseSimd are mutually exclusive, only one may be defined.".}
  elif defined(useAvxSimd):
    # AVX instructions
  elif defined(useSseSimd):
    # SIMD instructions
  else:
    {.fatal: "In order to use SIMD instructions, useAvxSimd or useSseSimd must be defined via -d.".}

jackmott (orginal) [2017-11-29T14:46:19+01:00] view original

the idea is to take a statement list, and generate SSE and AVX versions of it at compile time. Then at runtime select the proper version to use.

boia01 (orginal) [2017-11-29T17:05:02+01:00] view original

If you can afford to do the change globally, you could define a global add function (as a function pointer):

var simd_add*: proc (x: int, y: int): int {.nimcall.}

then at startup, you'd seed the function pointer with the appropriate code:

if avx_detected():
  simd_add = avx_add
elif sse_detected():
  simd_add = sse_add
else:
  simd_add = default_add

and everywhere else in your code, you could have:

let c = simd_add(1, 2)

(you could use macros to rewrite regular additions + to simd_add for convenience)

jackmott (orginal) [2017-11-29T17:13:31+01:00] view original

That is an interesting idea, but I suppose it would make it impossible to inline each SIMD call right? There would be a pointer hop each time? That would be no good.

boia01 (orginal) [2017-11-29T17:26:51+01:00] view original

Yeah, you're right, that's probably less than ideal at a fine-grained level (low-level operations).

However, if you apply the pattern to coarser functions, using your existing macro to specialize the code, then you get the best of both worlds, e.g. instead of dynamically dispatching on add, you can dispatch on something like matrix_multiply

andrea (orginal) [2017-11-29T17:41:04+01:00] view original

I am not sure you can decide this at runtime (but I know nothing). What should be the generated machine code for, say, AVX instructions if the machine on which you are running does not have AVX instructions available?

jackmott (orginal) [2017-11-29T17:58:28+01:00] view original

In cases where there isn't an equivalent function you would have a fallback that does it in non vectorized fashion. So if you used the gather instruction the SSE fallback would just loop over the elements of the simd vector and do them one by one. Or if blendv is not available it can be converted to: Or(AndNot(a,b), And(a,b))

No doubt you would not be able to write code once that optimally uses all the available power of SSE2/3/4,AVX, AVX2, and AVX512 all in one go but I think many cases should be quite good. C# does something like this with JIT intrinsics for a subset of SIMD operations. It is very nice to use but unfortunately the subset of operations is a bit too small for many use cases.

andrea (orginal) [2017-11-29T18:32:37+01:00] view original

I understand what you are saying, I just do not understand how to do this at runtime. Maybe I am just missing something, but could you explain it to me in more detail?

Assume your plan works out. You have a binary executable, which contains a sequence of machine codes. Do the AVX machine instructions ever appear in this executable? What if you run this on a processor that does not support AVX operations? You seem to say that the code path will never touch this part of code, but is the generated executable even valid?

You mention that C# does this with a JIT - the JIT can just avoid to even generate AVX instructions if they are not supported. But with AOT you have to generate all code in advance

jackmott (orginal) [2017-11-29T18:42:58+01:00] view original

This would have to generate all versions of a given statement list at compile time, then execute the correct statement list at run time. So for every block that you use the macro on, there would be N variations of that block in the binary where N is the number of SIMD instructions sets you want to support.

mratsim (orginal) [2017-11-29T18:55:56+01:00] view original

Note: FFMPEG is probably the most prominent library that uses runtime CPU intrinsics detection.

It's written in C so you can port that to Nim (with emit at worst). The only issue is that the code base is quite the behemoth.

Mirror of forum.nim-lang.org

3377 :: feedback on macro