I am learning some Nim, and have a hunch that the metaprogramming features of Nim may allow for a user friendly SIMD library. The primary challenge with SIMD is that various processors support different SIMD features. So to write code that will run as fast as possible on every CPU, you have to write many versions of the same function, then detect CPU features at run time and use the appropriate one.
I would like to approximate the APIs available in Boost.SIM or .NET, both of which allow you to write the code for your algorithm once, and at runtime the appropriate thing happens.
With C# the way this works is if you have an array of say, 32 bit floats, you can access Vector<float> and that will have a property Count, which is figure out when the code is Jitted, tell you how wide the SIMD lane is for that type, on that cpu.
So you can then write your algorithm appropriately by looping over your array Vector<float>.Count at a time and then do whatever SIMD operations you want inside the loop.
I know in Nim we don't have the benefit of a JIT but boost.simd is able to do something similar with C++ templates. I'm not entirely sure what the best way to approach this is in Nim, so was just looking for some high level guidance/ideas.
Alternatively, read this and port it over to Nim (not trivial but not impossible either):
While modern C compilers can do some nice auto vectorization, there are many cases where you have to do it by hand. For instance, fractal noise: https://github.com/jackmott/FastNoise-SIMD/blob/master/FastNoise/FastNoise3d.cpp#L25
I'm betting with Nim it is possible to write the code such that you don't have to compile a separate DLL for each architecture, with a nice pubic api.
jxy - I did look at that, and perhaps I am reading the code wrong but I think that one does compile time decision about which SIMD feature set is available, not runtime, is that correct?
I'm looking to be able to build one exe, send it to a computer with SSE or AVX or AVX512 and have it use the appropriate instructions at runtime.
I have a prototype now that works, that might illustrate what I am after, but perhaps there are much better ways of going about it. Also, I haven't verified how runtime performance is at all yet:
import rdstdin, strutils,x86_sse,x86_avx
var has_sse = true
var has_sse2 = true
var has_avx = false
proc load(a: var m128,s: var seq[float32],index: int) {.inline.} =
a = loadu_ps(addr s[index])
proc load(a: var m256,s: var seq[float32],index: int) {.inline.} =
a = loadu_ps_256(addr s[index])
template simd_block(s:seq[float32], a:untyped,count:untyped,body:untyped) =
if has_avx:
var count = 8
var a: m256
body
elif has_sse:
var count = 4
var a: m128
body
var s = @[1.0'f32,2.0'f32,3.0'f32,4.0'f32,1.0'f32,2.0'f32,3.0'f32,4.0'f32]
simd_block(s,a,count):
for i in countup(0,<s.len,count):
a.load(s,i)
a = add_ps(a,a)
storeu_ps(addr s[i],a)
echo s #result is correct!