Details are in the readme. If anyone is interested in this and would like to provide input or help out please let me know.
I'll follow your development and provide input. The library may prove quite useful for some handcrafted computations in Arraymancer.
My first remark would be to use a ./bin or ./out folder and .gitignore it, you actually added the produced library to your git repo.
Second, I think instead of asking "sse" or "avx", you should use compile-time define with when defined(sse) like I do here for openmp and cuda. And at compilation you can use nim c -d:sse -o:out/yourproject yourproject.nim
Lastly, I think the killer feature would be runtime CPU feature detection.
You might want to check Rust Faster, and for runtime CPU features detection, lots of multimedia libraries like FFMPEG, VLC or OpenCV have it.
Yes runtime detection is the plan, the prompt is just a placeholder, so that I know the decision is happening at runtime.
Thanks on the.gitignore, its ignoring exe but i am in linux!
I don't know much about SIMD, it looks like your approach is to figure out how to take nim code and SIMDify it? Sounds like a hard problem. What are your thoughts about the typed approach? Something like https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/SIMD ? You get types for the common SIMD registers and layouts and you use them.
Would you say SIMDify approach puts more work on the compiler to do the optimizations? Is easy to use but hard to make fast?
While the SIMDtype approach gives more room for the programmer to be more cleaver? But is harder to use.
Do c compilers already use SIMD where they can, or do you always have to use special tricks to get that going?
I don't know much about SIMD, it looks like your approach is to figure out how to take nim code and SIMDify it?
No not quite. You will write explicit SIMD instructions, but it will automatically transform them to use the best possible option given runtime detection. So you can write a loop to iterate over an array of integers and say add 1 to each value, for instance. If runtime detection sees SSE2 is available, it will add 4 integers at a time. If it finds AVX2 is available, it will do 8 integers at a time. This is a much less difficult thing than auto vectorization, which yes, the C compilers can do in some simple cases.
So for instance:
var
a = newSeq[float32](12)
b = newSeq[float32](12)
r = newSeq[float32](12)
for i,v in a:
a[i] = float32(i)
b[i] = 2.0'f32
SIMD(width):
for i in countup(0,<a.len,width div 4):
let av = simd.loadu_ps(addr a[i])
let bv = simd.loadu_ps(addr b[i])
let rv = simd.add_ps(av,bv)
simd.storeu_ps(addr r[i],rv)
if sse2 is detected, it will use the SSE2 version of loadu and add and storeu, and iterate over the array 4 at a time (16 byte width divided by 4 bytes per float32). if avx2 is detected, it will use AVX2 versions of loadu and add and storeu, and iterate over the array 8 at a time (32 byte width divided by 4 bytes per float32)
.NET / C# has a similar abstraction as this which they accomplish with the JIT
AVX2 cpu, forgot that I7 doesn't really narrow it down anymore! The code is ported from a friends C++ library, which should be good but I could definitely have introduced mistakes with some of the obscure C bindings.
edit: tested on windows, had to fix one things but now it is good there too.
Hello,
That is pretty cool and I am interested also in that. I have taken a look and I have some remarks.
If I understood correctly your approach, the marked code is selected at runtime according to the CPU properties. To be able to run on different architectures, the general code must be compiled with a basic set of instructions (like, 'x86-64'). Yet, the code add compiler flags depending on the modules imported (by the way, the 'passL' options are unnecessary here). Witch means, if you use AVX2 instructions, the C compiler will use these instructions for the general code and the program will likely crash on a machine without AVX2 instructions (Yes, I tested it: got a SIGILL).
In C, one can annotate a function with an attribute that change the code generation for the function alone. But there is as far as I know no clean way in Nim to add a specific function attribute to the C generated procedure. An emit pragma just before the proc definition should do the trick. It seems to work...if we don't use the -d:release flag. In release mode, the emitted attributes are not necessarily just before the proc definition in the C code. I have no idea why, a Nim dev may have an answer.
I suggest the use of a procedure passed to a macro.
Example (similar of what you posted above):
proc addition(a,b: openArray[float32]): seq[float32] {.simd.} =
result = newSeq[float32](a.len)
for i in countup(0,a.len-1, simd.width div 4):
let av = simd.loadu_ps(unsafeAddr a[i])
let bv = simd.loadu_ps(unsafeAddr b[i])
let rv = simd.add_ps(av,bv)
simd.storeu_ps(addr result[i],rv)
The macro simd creates two procs, additionsse2 and additionavx2, marked by the C attributes (when it works). The true addition proc calls the correct proc at runtime.
Here is the definition of the simd macro I used:
macro simd*(procDef:untyped): untyped =
result = newStmtList()
let psse2 = makeSimdProcDef(procDef, "sse2", "128")
let csse2 = makeCallForDispatch(psse2,procDef)
result.add newEmitPragma(attrTarget%"sse2")
result.add psse2
let pavx2 = makeSimdProcDef(procDef, "avx2", "256")
let cavx2 = makeCallForDispatch(pavx2,procDef)
result.add newEmitPragma(attrTarget%"avx2")
result.add pavx2
procDef.body = quote do:
if cpuType == UNINITIALIZED:
cpuType = getCPUType()
echo "Detected cpu type:" & $cpuType
if cpuType == SSE2 or cpuType == SSE41:
`csse2`
elif cpuType == AVX2 or cpuType == AVX:
`cavx2`
result.add procDef
#echo repr(result)
I skipped some helper functions for the sake of brevity, but I will give you the entire thing if you are interested.
Sorry about the too long post, I hoped it helped you a bit despite the issues I raised.