nimforum mirror - Low-level optimizations in Nim

mratsim (orginal) [2017-05-28T12:08:41+02:00] view original

For some reasons it seems like no one cares about optimizing matrix multiplication for integers. I started my own so that my tensor library isn't painfully slow and there is no need to cast to float.

For future optimizations I'd like to know the following:

It doesn't seem possible to pass a specific path to GCC/Clang to the nim compiler right? Clang on OSX doesn't support OpenMP, so I'd like to configure compilation with default clang-omp installed by Homebrew.

I saw the unroll pragma and the doc says it’s seen but ignored. Is it still true? That would be very helpful.

Is Nim compiled automatically with -march=native? (I can use —passC otherwise I guess).

How to force a specific memory alignment

Is there a way to get L1/L2 cache size at compile time.

Is there a way to check number of registers and their size at compile time. (Bonus if I can feed that to the unroll pragma)

I've based my code on pure C ulmBLAS. The big difference is that instead of pointers arithmetics, I'm passing a var array and an offset.

I might change to pointers because it's cumbersome as I need to do array[i + offset] and offset += increment everywhere in my code.

Last thing, I'm using global var arrays currently here. If I declare those directly in the proc instead here the program compiles but I get Segmentation fault 11 at runtime. I will try to get a small test case.

Stefan_Salewski (orginal) [2017-05-28T12:32:27+02:00] view original

Is Nim compiled automatically with -march=native?

No. For gcc I am using something like this:


cat nimgir/nim.cfg
path:"$projectdir"
nimcache:"/tmp/$projectdir"
gcc.options.speed = "-march=native -O3  -flto -fstrict-aliasing"

But the performance increase of -march=native is not as big as I once thought, typical only 10 % or less.

Last thing, I'm using global var arrays currently

May your problem be just a stack overflow? Arrays defined inside a proc are generally allocated on the stack.

mratsim (orginal) [2017-05-28T12:37:28+02:00] view original

Ah it may indeed be stack overflow.

I expect -march=native to yield a lot of gains because my code is basically ax + by in dozens of for loop

mratsim (orginal) [2017-05-30T12:43:53+02:00] view original

I've been playing with point 4, memory alignment.

This compiles:

{.pragma: align16, codegenDecl: "$# $# __attribute__((aligned(16)))".}

proc newBufferArray[T: SomeNumber](N: static[int], typ: typedesc[T]): ref array[N, T]  {.noSideEffect.} =
  new result
  for i in 0 ..< N:
    result[i] = 0.T

let x {.align16.} = newBufferArray(100, int)

echo x[0]

However when I look at the generated C code, it seems like what is aligned is the ref pointer, not the actual data.

Is there a way to apply the pragma to the actual data and not the pointer.

By the way I checked and found an align pragma in compiler/ast.nim and compiler/pragmas.nim but it's not exposed (feat request here). Furthermore, I can currently use a ref object wrapper and specify the alignement as mentionned in #5315:

type Foo = ref object
  bar {.align16.}: array[N,T]

Part of the C code:

...
L190: N_NIMCALL(NI*, newBufferArray_3aJInY9bpU4oUK9cUIQSVHDA)(void);
L191: static N_NIMCALL(void, Marker_TY_9b0xA7syFjRd01ZWg628edQ)(void* p, NI op);
...
L217: NI* x_Js2X9aXg71jf4IrwTQPo9b3A __attribute__((aligned(16)));
L218: extern GcHeap_1TRH1TZMaVZTnLNcIHuNFQ gch_IcYaEuuWivYAS86vFMTS3Q;
L219: static N_NIMCALL(void, Marker_TY_9b0xA7syFjRd01ZWg628edQ)(void* p, NI op) {
L220: 	NI* a;
L221: 	NI T1_;
L222: 	a = (NI*)p;
L223: 	T1_ = (NI)0;
L224: 	for (T1_ = 0; T1_ < 100; T1_++) {
L225: 	}
L226: }
...
L403: NIM_EXTERNC N_NOINLINE(void, NimMainModule)(void) {
L404: 	NimStringDesc* T1_;
L405: 	nimfr_("play_alignment", "play_alignment.nim")
L406: nimRegisterGlobalMarker(TM_DmX0x2ZpvnN2eETjy62C0A_3);
L407: 	nimln_(10, "play_alignment.nim");
L408: 	asgnRefNoCycle((void**) (&x_Js2X9aXg71jf4IrwTQPo9b3A), newBufferArray_3aJInY9bpU4oUK9cUIQSVHDA());
L409: 	nimln_(12, "play_alignment.nim");
L410: 	T1_ = (NimStringDesc*)0;
L411: 	T1_ = nimIntToStr(x_Js2X9aXg71jf4IrwTQPo9b3A[(((NI) 0))- 0]);
L411: 	printf("%s\012", T1_? (T1_)->data:"nil");
L412: 	fflush(stdout);
L413: 	popFrame();
L414: }

Mirror of forum.nim-lang.org

2976 :: Low-level optimizations in Nim