For some reasons it seems like no one cares about optimizing matrix multiplication for integers. I started my own so that my tensor library isn't painfully slow and there is no need to cast to float.
For future optimizations I'd like to know the following:
I've based my code on pure C ulmBLAS. The big difference is that instead of pointers arithmetics, I'm passing a var array and an offset.
I might change to pointers because it's cumbersome as I need to do array[i + offset] and offset += increment everywhere in my code.
Last thing, I'm using global var arrays currently here. If I declare those directly in the proc instead here the program compiles but I get Segmentation fault 11 at runtime. I will try to get a small test case.
Is Nim compiled automatically with -march=native?
No. For gcc I am using something like this:
cat nimgir/nim.cfg
path:"$projectdir"
nimcache:"/tmp/$projectdir"
gcc.options.speed = "-march=native -O3 -flto -fstrict-aliasing"
But the performance increase of -march=native is not as big as I once thought, typical only 10 % or less.
Last thing, I'm using global var arrays currently
May your problem be just a stack overflow? Arrays defined inside a proc are generally allocated on the stack.
Ah it may indeed be stack overflow.
I expect -march=native to yield a lot of gains because my code is basically ax + by in dozens of for loop
I've been playing with point 4, memory alignment.
This compiles:
{.pragma: align16, codegenDecl: "$# $# __attribute__((aligned(16)))".}
proc newBufferArray[T: SomeNumber](N: static[int], typ: typedesc[T]): ref array[N, T] {.noSideEffect.} =
new result
for i in 0 ..< N:
result[i] = 0.T
let x {.align16.} = newBufferArray(100, int)
echo x[0]
However when I look at the generated C code, it seems like what is aligned is the ref pointer, not the actual data.
Is there a way to apply the pragma to the actual data and not the pointer.
By the way I checked and found an align pragma in compiler/ast.nim and compiler/pragmas.nim but it's not exposed (feat request here). Furthermore, I can currently use a ref object wrapper and specify the alignement as mentionned in #5315:
type Foo = ref object
bar {.align16.}: array[N,T]
Part of the C code:
...
L190: N_NIMCALL(NI*, newBufferArray_3aJInY9bpU4oUK9cUIQSVHDA)(void);
L191: static N_NIMCALL(void, Marker_TY_9b0xA7syFjRd01ZWg628edQ)(void* p, NI op);
...
L217: NI* x_Js2X9aXg71jf4IrwTQPo9b3A __attribute__((aligned(16)));
L218: extern GcHeap_1TRH1TZMaVZTnLNcIHuNFQ gch_IcYaEuuWivYAS86vFMTS3Q;
L219: static N_NIMCALL(void, Marker_TY_9b0xA7syFjRd01ZWg628edQ)(void* p, NI op) {
L220: NI* a;
L221: NI T1_;
L222: a = (NI*)p;
L223: T1_ = (NI)0;
L224: for (T1_ = 0; T1_ < 100; T1_++) {
L225: }
L226: }
...
L403: NIM_EXTERNC N_NOINLINE(void, NimMainModule)(void) {
L404: NimStringDesc* T1_;
L405: nimfr_("play_alignment", "play_alignment.nim")
L406: nimRegisterGlobalMarker(TM_DmX0x2ZpvnN2eETjy62C0A_3);
L407: nimln_(10, "play_alignment.nim");
L408: asgnRefNoCycle((void**) (&x_Js2X9aXg71jf4IrwTQPo9b3A), newBufferArray_3aJInY9bpU4oUK9cUIQSVHDA());
L409: nimln_(12, "play_alignment.nim");
L410: T1_ = (NimStringDesc*)0;
L411: T1_ = nimIntToStr(x_Js2X9aXg71jf4IrwTQPo9b3A[(((NI) 0))- 0]);
L411: printf("%s\012", T1_? (T1_)->data:"nil");
L412: fflush(stdout);
L413: popFrame();
L414: }