nimforum mirror - meta programming a nice SIMD library

jackmott (orginal) [2017-08-29T22:18:21+02:00] view original

I am learning some Nim, and have a hunch that the metaprogramming features of Nim may allow for a user friendly SIMD library. The primary challenge with SIMD is that various processors support different SIMD features. So to write code that will run as fast as possible on every CPU, you have to write many versions of the same function, then detect CPU features at run time and use the appropriate one.

I would like to approximate the APIs available in Boost.SIM or .NET, both of which allow you to write the code for your algorithm once, and at runtime the appropriate thing happens.

With C# the way this works is if you have an array of say, 32 bit floats, you can access Vector<float> and that will have a property Count, which is figure out when the code is Jitted, tell you how wide the SIMD lane is for that type, on that cpu.

So you can then write your algorithm appropriately by looping over your array Vector<float>.Count at a time and then do whatever SIMD operations you want inside the loop.

I know in Nim we don't have the benefit of a JIT but boost.simd is able to do something similar with C++ templates. I'm not entirely sure what the best way to approach this is in Nim, so was just looking for some high level guidance/ideas.

Araq (orginal) [2017-08-29T22:27:27+02:00] view original

Here's what I do when I need SIMD or vectorization:

Write ordinary Nim code that uses arrays.

Look at the produced assembler code.

Tweak the Nim code until the assembler looks efficient. (Ok, that's a lie, usually I don't have to do that.)

Alternatively, read this and port it over to Nim (not trivial but not impossible either):

http://www.codersnotes.com/notes/maths-lib-2016/

jackmott (orginal) [2017-08-29T22:33:09+02:00] view original

While modern C compilers can do some nice auto vectorization, there are many cases where you have to do it by hand. For instance, fractal noise: https://github.com/jackmott/FastNoise-SIMD/blob/master/FastNoise/FastNoise3d.cpp#L25

I'm betting with Nim it is possible to write the code such that you don't have to compile a separate DLL for each architecture, with a nice pubic api.

jxy (orginal) [2017-08-30T04:49:03+02:00] view original

As replied in your other thread, the simd stuff wrapped under https://github.com/jcosborn/qex/blob/devel/src/simd.nim is quite complete. You may need a couple of files in that directory to get it to work. If there is sufficient interests, I can release the simd part as a separate package.

jackmott (orginal) [2017-08-30T04:59:17+02:00] view original

jxy - I did look at that, and perhaps I am reading the code wrong but I think that one does compile time decision about which SIMD feature set is available, not runtime, is that correct?

I'm looking to be able to build one exe, send it to a computer with SSE or AVX or AVX512 and have it use the appropriate instructions at runtime.

I have a prototype now that works, that might illustrate what I am after, but perhaps there are much better ways of going about it. Also, I haven't verified how runtime performance is at all yet:

import rdstdin, strutils,x86_sse,x86_avx

var has_sse = true
var has_sse2 = true
var has_avx = false

proc load(a: var m128,s: var seq[float32],index: int) {.inline.} =
    a = loadu_ps(addr s[index])

proc load(a: var m256,s: var seq[float32],index: int) {.inline.} =
    a = loadu_ps_256(addr s[index])

template simd_block(s:seq[float32], a:untyped,count:untyped,body:untyped) =
    if has_avx:
        var count = 8
        var a: m256
        body
    elif has_sse:
        var count = 4
        var a: m128
        body


var s = @[1.0'f32,2.0'f32,3.0'f32,4.0'f32,1.0'f32,2.0'f32,3.0'f32,4.0'f32]

simd_block(s,a,count):
    for i in countup(0,<s.len,count):
        a.load(s,i)
        a = add_ps(a,a)
        storeu_ps(addr s[i],a)

echo s #result is correct!

jxy (orginal) [2017-08-30T19:07:39+02:00] view original

You are right that I do it at compile time only. Delegating the choice at runtime could really complicate things especially with vectorized data layout, and it is unnecessary for me to maintain binary compatibility since I have chips of power, xeon, nvidia gpus to worry about. To minimize performance impact of the simd type switch, it has to be at a very high level, at which point it almost equivalent to compile architecture specific binaries anyway.

ephja (orginal) [2017-08-31T16:11:05+02:00] view original

Here's another SIMD library https://github.com/bsegovia/x86_simd.nim

Mirror of forum.nim-lang.org

3135 :: meta programming a nice SIMD library