I'm working on a macro idea to allow a SIMD library where you can write simd code once, and at runtime the correct simd functions will be used based on feature detection of the cpu.
I have a simple proof of concept here, this works, but I am unsure if this is the best way to accomplish this:
avx.nim
proc add*(a:int,b:int) : int =
echo "avx add"
a+b
sse.nim
proc add*(a:int,b:int) : int =
echo "sse add"
a+b
main.nim
import avx
import sse
import macros
import rdstdin
# recursively process the AST replacing any simd. idents
proc replaceSIMD(node:NimNode, simdType:string) =
for node in node.children:
echo $node.kind
if node.kind == nnkIdent:
echo $node.ident
if node.ident == !"simd":
node.ident = !simdType
replaceSIMD(node,simdType)
macro SIMD_AVX(body:untyped): untyped =
replaceSIMD(body,"avx")
body
macro SIMD_SSE(body:untyped): untyped =
replaceSIMD(body,"sse")
body
# I think macros cannot do anything at runtime so we start with a template
template SIMD(body:untyped) =
let str = readLineFromStdin "sse or avx? "
# instead of calling different macros, is there a way to pass the string?
if str == "sse":
SIMD_SSE(body)
if str == "avx":
SIMD_AVX(body)
# inside the SIMD statement, replace each instance of simd. with
# the appropriate SIMD type (avx, sse, etc)
SIMD:
echo $simd.add(1,2)
Why can't you do this using the when statement?
proc simdAdd(a, b: int): int =
when defined(useAvxSimd) and defined(useSseSimd):
{.fatal: "The symbols useAvxSimd or useSseSimd are mutually exclusive, only one may be defined.".}
elif defined(useAvxSimd):
# AVX instructions
elif defined(useSseSimd):
# SIMD instructions
else:
{.fatal: "In order to use SIMD instructions, useAvxSimd or useSseSimd must be defined via -d.".}
If you can afford to do the change globally, you could define a global add function (as a function pointer):
var simd_add*: proc (x: int, y: int): int {.nimcall.}
then at startup, you'd seed the function pointer with the appropriate code:
if avx_detected():
simd_add = avx_add
elif sse_detected():
simd_add = sse_add
else:
simd_add = default_add
and everywhere else in your code, you could have:
let c = simd_add(1, 2)
(you could use macros to rewrite regular additions + to simd_add for convenience)
Yeah, you're right, that's probably less than ideal at a fine-grained level (low-level operations).
However, if you apply the pattern to coarser functions, using your existing macro to specialize the code, then you get the best of both worlds, e.g. instead of dynamically dispatching on add, you can dispatch on something like matrix_multiply
In cases where there isn't an equivalent function you would have a fallback that does it in non vectorized fashion. So if you used the gather instruction the SSE fallback would just loop over the elements of the simd vector and do them one by one. Or if blendv is not available it can be converted to: Or(AndNot(a,b), And(a,b))
No doubt you would not be able to write code once that optimally uses all the available power of SSE2/3/4,AVX, AVX2, and AVX512 all in one go but I think many cases should be quite good. C# does something like this with JIT intrinsics for a subset of SIMD operations. It is very nice to use but unfortunately the subset of operations is a bit too small for many use cases.
I understand what you are saying, I just do not understand how to do this at runtime. Maybe I am just missing something, but could you explain it to me in more detail?
Assume your plan works out. You have a binary executable, which contains a sequence of machine codes. Do the AVX machine instructions ever appear in this executable? What if you run this on a processor that does not support AVX operations? You seem to say that the code path will never touch this part of code, but is the generated executable even valid?
You mention that C# does this with a JIT - the JIT can just avoid to even generate AVX instructions if they are not supported. But with AOT you have to generate all code in advance
Note: FFMPEG is probably the most prominent library that uses runtime CPU intrinsics detection.
It's written in C so you can port that to Nim (with emit at worst). The only issue is that the code base is quite the behemoth.