nimforum mirror - UncheckedArray conversion

mantielero (orginal) [2020-04-06T22:43:48+02:00] view original

I have something like the following:

..code-block::nim

let value:int32 = r0[col0].int32 + r0[col1].int32 * 2 + ...

where r0 has type ptr UncheckedArray[uint8].

How can I avoid adding .int32 each time I use one item from the unckeckedarray?

DeletedUser (orginal) [2020-04-07T07:19:44+02:00] view original

# rename to whatever, col can also be typed
template get(col): int32 = r0[col].int32
let value:int32 = get(col0) + get(col1) * 2 + ...

lucian (orginal) [2020-04-07T09:34:27+02:00] view original

out of curiosity, isn't this prone to integer overflow or are you using value in fact as an uint32 ?

mantielero (orginal) [2020-04-07T11:31:46+02:00] view original

This is part of a convolution filter. The result is in the same order of magnitude (uint8):

let value:int32  = r0[col0].int32     + r0[col1].int32 * 2 + r0[col2].int32 +
                   r1[col0].int32 * 2 + r1[col1].int32 * 4 + r1[col2].int32 * 2 +
                   r2[col0].int32     + r2[col1].int32 * 2 + r2[col2].int32
w1[col1] = (value  / 16.0).uint8

I am looking how to improve performance-wise without entering into SIMD stuff (that I have never used by the way). I think that all those type conversions are killing the performance that I should achieve.

mratsim (orginal) [2020-04-07T14:55:41+02:00] view original

Type conversion from uint8 to uint32 (or uint64) is technically free because registers have a size of 32-bit or 64-bit.

If anything it's less costly because at a low-level loading a 8-bit value requires movzb (mov to register and zero extend the byte) which has a slightly high latency than plain mov for uint32 and uint64.

The issue in your convolutions is memory bottleneck not compute.

mantielero (orginal) [2020-04-07T15:21:21+02:00] view original

You are probably right. For me is the first time doing this sort of stuff. But the feeling that I have is that even small modifications in the algorithm (avoiding conversions as much as possible, for example) has a significant impact. I think that the reason is that I am interating on 100_000 frames of 640x480 pixels.

For instance I have just reached 1200fps just by replacing how the rows numbers are calculated. Changing this:

template clamp(val:cint, max_val:cint):untyped =
   min( max(val, 0), max_val).uint
 
 .....
for row1 in 0..<height:
     let row0 = clamp(row1-1, h)
     let row2 = clamp(row1+1, h)
 ......

into:

iterator `...`[T](ini:T,`end`:T):tuple[a:uint,b:uint,c:uint] =
let ini = ini.uint
let `end` = `end`.uint
yield (a:ini,b:ini,c:ini+1)
for i in ini+1..`end`-1:
    yield (a:i-1,b:i,c:i+1)
yield (a:`end`-1,b:`end`,c:`end`)

...
for (row0, row1, row2) in 0...h:
....

The calculation now looks like this:

#import math
import ../vapoursynth
import math

iterator `...`[T](ini:T,`end`:T):tuple[a:uint,b:uint,c:uint] =
let ini = ini.uint
let `end` = `end`.uint
yield (a:ini,b:ini,c:ini+1)
for i in ini+1..`end`-1:
    yield (a:i-1,b:i,c:i+1)
yield (a:`end`-1,b:`end`,c:`end`)


proc apply_kernel*(src:ptr VSFrameRef, dst:ptr VSFrameRef, kernel:array[9, int32], mul:int, den:int) =
#let n = (( math.sqrt(kernel.len.float).int - 1 ) / 2).int
for pn in 0..<src.numPlanes:  # pn= Plane Number
    # These cost 60fps (if I take them outside of this loop)
    let height = src.height( pn )
    let width  = src.width( pn )
    let h = (height-1)
    let w = (width-1)
    
    for (row0, row1, row2) in 0...h:
        let r0 = src[pn, row0]
        let r1 = src[pn, row1]
        let r2 = src[pn, row2]
        let w1 = dst[pn, row1]
        for (col0, col1, col2) in 0...w:
            let value:int32  = r0[col0].int32     + r0[col1].int32 * 2 + r0[col2].int32 +
                            r1[col0].int32 * 2 + r1[col1].int32 * 4 + r1[col2].int32 * 2 +
                            r2[col0].int32     + r2[col1].int32 * 2 + r2[col2].int32
            w1[col1] = (value  / den).uint8

and I have managed to go from the original 80fps to 1200fps. But I am still very far from what I should be able to reach.

The analogous C++ code (using float instead of int32) for reference is way faster:

auto srcp = reinterpret_cast<const float*>(vsapi->getReadPtr(frame, plane));
auto dstp = reinterpret_cast<float*>(vsapi->getWritePtr(dst, plane));
auto h = vsapi->getFrameHeight(frame, plane);
auto w = vsapi->getFrameWidth(frame, plane);

auto gauss = [&](auto y, auto x) {
    
    auto clamp = [&](auto val, auto bound) {
        return std::min(std::max(val, 0), bound - 1);
    };
    
    auto above = srcp + clamp(y - 1, h) * w;
    auto current = srcp + y * w;
    auto below = srcp + clamp(y + 1, h) * w;
    
    auto conv = above[clamp(x - 1, w)] + above[x] * 2 + above[clamp(x + 1, w)] +
        current[clamp(x - 1, w)] * 2 + current[x] * 4 + current[clamp(x + 1, w)] * 2 +
        below[clamp(x - 1, w)] + below[x] * 2 + below[clamp(x + 1, w)];
    return conv / 16;
};

for (auto y = 0; y < h; y++)
    for (auto x = 0; x < w; x++)
        (dstp + y * w)[x] = gauss(y, x);

When you mention memory bottleneck, what do you mean? Where should I look at?

Thanks a lot

mratsim (orginal) [2020-04-07T17:16:09+02:00] view original

If you have instructions to reproduce the benchmark in Nim and C++ I can help.

I just need a repo to clone, the scripts to run and the dataset. I already have Vapoursynth working.

Ideally you have a profiler like Intel Instruments or Apple VTune to dive into assembly.

For example this is my approach in debugging performance issue: https://github.com/nim-lang/Nim/issues/9514

For memory bottlenecks it's a bit different, I use the roofline model as mentioned in my convolution optimization resources: https://github.com/numforge/laser/blob/master/research/convolution_optimisation_resources.md#computational-complexity

For example I know that matrix multiplication and convolution can reach 90% of the peak CPU GFlop because their arithmetic intensity is high (i.e. you do over 10 operations (add/mul) per byte) so if you don't reach that perf it's because you are moving bytes to much instead of computing with them.

The theoretical peak of your CPU is easy to compute:

single threaded:
CpuGhz * VectorWidth * InstrCycle * FlopInstr

for a CPU that supports AVX2 on float32 (so packing 8 float32) that can issue 2 Fused-Multiply-Add per cycle at 3GHz we have

3 (GHz) * 8 (packed float32 in AVX) * 2 (FMA per cycle) * 2 (FMA = 1 add + 1 mul) = 96 GFlops

multithreaded: Just multiply the single result by the number of cores. For example 10 cores would be 960 GFlops or 0.9 TFlops

And then the usual way to benchmark numerical algorithm is, you know the number of operations required by your algorithm, you divide that by the time spent to do them and you have your actual flops. And you compare your actual Flops with the theoretical peak. If you only reach 20% of the peak, you have a memory bottleneck and probably need to repack before processing to optimize cache usage, if not you need to look into SIMD vectorization, prefetching, ...

All of that is quite complex so what I can do is reach the naive C++ implementation performance.

Going beyond is something that I want to do but it's time-consuming and I feel that it would be better to spend my time on an image processing compiler similar to what's discussed here: https://github.com/mratsim/Arraymancer/issues/347#issuecomment-459351890 and with a proof of concept here:

https://github.com/numforge/laser/tree/master/laser/lux_compiler

https://github.com/numforge/laser/tree/master/laser/lux_compiler/core

mantielero (orginal) [2020-04-07T19:02:34+02:00] view original

Nim stuff

I have updated the repository. I have a local package.json with:

[
{
    "name": "vapoursynth",
    "url": "https://github.com/mantielero/VapourSynth.nim",
    "method": "git",
    "tags": [
    "video",
    "vapoursynth"
    ],
    "description": "Video processing based on Vapoursynth",
    "license": "BSD"

}
]

With that you only need:

nimble install vapoursynth

The files to play with are custom_filter.nim which makes use of DrawFrame.nim. So you need these two files in the same folder and compile them like:

nim c -d:release -d:danger custom_filter

And in order to get the performance, just run it:

$ ./custom_filter

I will check how to do the same in C++.

mantielero (orginal) [2020-04-07T22:58:43+02:00] view original

C++ stuff

In order to test the C++ version:

The code can be downloaded from here.

I compiled it like:

$ g++ -Wall -O3 -shared -fPIC -I. -o libfilter.so GaussBlur.cxx

Create a VapourSynth python filter like test_filter.vpy:

import vapoursynth as vs
core = vs.get_core()
core.std.LoadPlugin(path='./libfilter.so')
core.std.SetMaxCPU('none')
clip = core.std.BlankClip(format=vs.GRAYS, length=100000, fpsnum=24000, fpsden=1001, keep=True)
clip = core.testc.GaussBlur(clip)
clip.set_output()

Testing:

$ vspipe test_filter.vpy /dev/null
Output 100000 frames in 29.53 seconds (3386.27 fps)

So I am just getting 3386fps with the pure C++ version while I am getting 1200fps with the Nim version.

I am not comparing apples to apples here:

The Nim version uses int32 while the C++ version uses float.

The Nim version is not using getFrameAsync, so it is not taking advantage of multithreading as discussed here.

mratsim (orginal) [2020-04-07T23:36:47+02:00] view original

I'm having a look at the moment, here are things that I find sketchy, note that I don't know the low-level VapourSynth API.

You are not using sampleType or bitsPerSample in your example.

From what I understood, those format fields are supposed to tell you what your ptr UncheckedArray[byte] really is, if you are like in the C++ case, it's a type-erased float pointer so you should use the same. Note that float32 vs int might be very relevant as floating point vector operations are way more optimized than integers (for matrix multiplication the ratio is between 10x to 100x depending on SSE4.1 AVX2 / AVX512 and fused multiply add support)

Profiling shows that row0 access is a bottleneck

Your iteration strategy seems problematic with repeated query to the API to get the stride and pointer in vsframe.nim

proc getPtr(frame:ptr VSFrameRef, plane:cint):uint =
  ## Retrieves an `uint` rather than the pointer in order to perform pointer arithmetic with it
  cast[uint]( API.getWritePtr(frame, plane) )

proc stride(frame:ptr VSFrameRef, plane:cint):uint =
  API.getStride( frame, plane ).uint
#type
#  Row = ptr UncheckedArray[uint8]

proc `[]`*(frame:ptr VSFrameRef, plane:cint, row:uint):ptr UncheckedArray[uint8] =
  let ini:uint = frame.getPtr(plane)            # Don't use other thing here
  let stride:uint = frame.getStride( plane )
  result = cast[ptr UncheckedArray[uint8]]( ini + row * stride )
  #assert(result != nil)

proc `[]`*(frame:ptr VSFrameRef, plane:cint, row:cint):ptr UncheckedArray[uint8] =
  let ini:uint = frame.getPtr(plane)            # Don't use other thing here
  let stride:uint = frame.getStride( plane )
  result = cast[ptr UncheckedArray[uint8]]( ini + row.uint * stride )

proc `[]`*(frame:ptr VSFrameRef, plane:cint ):Plane =
  let ini = frame.getPtr(plane)
  let stride = frame.stride(plane)
  return Plane(ini:ini,stride:stride)

proc `[]`*(plane:Plane, row:cint):Row = #ptr UncheckedArray[uint8] =
  cast[ptr UncheckedArray[uint8]]( plane.ini + row.uint * plane.stride )

Instead I suggest you use a scheme like this, only get the plane/query the API once and use the plane object that holds the metadata. This way the compiler can do inlining and loop hoisting on all those accesses.

The main issue with calling the VapourSynth API repeatedly is that all those function calls are just polluting your cache, requires the compiler to move the function parameter and result to and from the stack in an inner loop.

import ../src/vapoursynth

type Plane*[T: SomeNumber] = object
  # T depends on the frame format:
  # - sample_type (uint or float)
  # - bitsPerSample (8, 16, 32, 64)
  # The plane must be instantiated with
  # the proper uint8 ... 64, float32, float64
  data: ptr UncheckedArray[T]
  rows: int
  cols: int
  rowStride: int

func getPlane*(frame: ptr VSFrameRef, plane: int, T: typedesc[SomeInteger]): Plane[T] {.inline.} =
  ## Extract a plane
  result.data = cast[ptr UncheckedArray[T]](API.getWritePtr(frame, plane))
  result.rows = API.getFrameHeight(frame, plane)
  result.cols = API.getFrameWidth(frame, plane)
  result.rowStride = API.getStride(frame, plane)

template `[]`*[T](plane: Plane[T], row: int, col: int): T =
  plane.data[plane.row * plane.rowStride + plane.col]

template `[]=`*[T](plane: Plane[T], row: int, col: int, val: T) =
  plane.data[plane.row * plane.rowStride + plane.col] = val

Regarding iteration strategy, your scheme is doing 3 times more memory access than a basic scheme for each row and each column (total 9x). Instead you should iterate once per row/column but in-place accumulate with dst[dstIdx] += src[srcIdx] * filter[filterIdx]

A starter point could be my basic convolution scheme in Laser https://github.com/numforge/laser/blob/d1e6ae61/benchmarks/convolution/conv2d_direct_convolution.nim#L8-L73

It's for batch of images with potentially different plane numbers (called channels) between in and out (for example RGB --> Grayscale) but the iteration should be faster than yours:

Note that here I tracked the indices manually instead of using an intermediate Plane object but both methods should be the same with an optimizing compiler.

proc conv2d_direct*[T](
    oim: var Tensor[T],     # Output images
    iim: Tensor[T],         # Input images
    ishape: TensorShape,    # Shape of a single image
    kernel: Tensor[T],      # filter kernel
    kshape: KernelShape,    # kernel shape (should be const)
    padding: Padding,       # Padding (should be const)
    strides: Strides        # Strides (should be const)
  ) =
  ## oim must be zero-initialized
  # Reminder: convolution deep learning == cross-correlation signal processing
  
  assert ishape.c == kshape.c_in
  let out_shape = conv2d_out_shape(ishape, kshape, padding, strides)
  assert oim.len == out_shape.n * out_shape.c * out_shape.h * out_shape.w
  
  let
    N = ishape.n
    H = ishape.h
    W = ishape.w
    outH = out_shape.h
    outW = out_shape.w
    kdata = cast[ptr UncheckedArray[T]](kernel[0].unsafeaddr)
  
  let # Should be const but padding.h causes problem and padding[0] indexing
      # doesn't work in generic proc
    C_out = kshape.c_out
    C_in = kshape.c_in
    kH = kshape.kH
    kW = kshape.kW
    pH = padding.h
    pW = padding.w
    sH = strides.h
    sW = strides.w
  
  let odata = cast[ptr UncheckedArray[T]](oim[0].addr)
  let idata = cast[ptr UncheckedArray[T]](iim[0].unsafeaddr)
  
  const arithmetic_intensity = 12 # number of FLOP per byte for the convolution
  const flop_per_elem = arithmetic_intensity div sizeof(T)
  let parallelize = OMP_MEMORY_BOUND_GRAIN_SIZE div flop_per_elem < outH * outW
  
  for n in 0 ..< N:
    for co in 0 ..< C_out:
      for ci in 0 ..< C_in:
        # We parallelize over the image height to deal with cases
        # where we have a single image or a low number of channels
        
        # We assume no false sharing with a proper grain size
        omp_parallel_if(parallelize):
          omp_for(oh, outH, use_simd = true, nowait=true):                  # output height
            let ih = sH * oh                                                # input height
            for ow in 0 ..< outW:                                           # output width
              let iw = sH * ow                                              # input width
              # The following should be loop hoisted by the compiler
              let oidx = ow + outW * (oh + outH * (co + C_out * n))         # odata[n][co][oh][ow]
              # Entering conv kernel region
              for krow in 0 ..< kH:
                let row = ih + krow - pH
                if row <% H:     # Unsigned '<' does 0 < row < H.
                  for kcol in 0 ..< kW:
                    let col = iw + kcol - pW
                    if col <% W: # Unsigned '<' does 0 < row < H.
                      let iidx = col + W * (row + H * (ci + C_in * n))      # idata[n][ci][row][col]
                      let kidx = kcol + kW * (krow + kH * (ci + C_in * co)) # kdata[co][ci][krow][kcol]
                      odata[oidx] += idata[iidx] * kdata[kidx]

mantielero (orginal) [2020-04-08T11:40:55+02:00] view original

Lot to digest here.

I am aware of this. I am just scaffolding the library. Just to avoid doing something fundamentally wrong (like reaching 80fps instead of >1000fps ;oP). In the future, the library should be more general. In this case, I take advance that it is always 1 byte per sample. To be honest I don't know why it is using float in the C++ version. The API states that both the read and write pointer are uint8. Doing floating point vector is something that I should try in the future.

I bit like chinese for me :O/. I understand that the left column is the C generated from Nim, and the rigth column is the assembler generated. The bottleneck is because of the 22s for the row1? I am surprised. At the end of the day, I first iterate on rows and then in columns.

I used both strategies as can be seen here:

proc `[]`*(frame:ptr VSFrameRef, plane:cint ):Plane =
  let ini = frame.getPtr(plane)
  let stride = frame.stride(plane)
  return Plane(ini:ini,stride:stride)

I will give it another shot as you suggest.

Regarding the triple iteration, I realized that, but I don't want to change that (yet) until I get similar performance to the C++ version with similar algorithm.

I tried to perform a filter with ArrayMancer but I got worst.

My short term objective is trying to reach in the order of the 3000fps using float32 and multithreading (keeping similar to the C++ algorithm).

Next step, to eliminate the triple loop on the rows and to see if I can reduce the number of times that I call getStride and getPtr.

In the long term I'd like to try vectorization.

mantielero (orginal) [2020-04-08T12:28:14+02:00] view original

Well now I am amazed but on the other side of the pic.

I need to check if I have done something really wrong, but I have managed to get: 6074fps using float32 (I was expecting 3000fps) and 10700fps using int32, just by taking advance of multithreading.

mratsim (orginal) [2020-04-08T14:37:55+02:00] view original

I am aware of this. I am just scaffolding the library. Just to avoid doing something fundamentally wrong (like reaching 80fps instead of >1000fps ;oP). In the future, the library should be more general. In this case, I take advance that it is always 1 byte per sample. To be honest I don't know why it is using float in the C++ version. The API states that both the read and write pointer are uint8. Doing floating point vector is something that I should try in the future.

I'm pretty sure it's supposed to be void * and the sampleType and bitsPerSample are telling you what to cast it to.

Anyway, very nice to hear your improvements.

Arraymancer indexing via t[i, j] would be slow because:

You are copying the data to the tensor. I have a PR pending to allow zero-copy tensors over pointers but still pending

Arraymancer has slightly more overhead on iteration that might suggestion because it can handles N-dimensional tensors not just (Plane, Height, Width) images. It can handle (Frame, Plane, Depth, Height, Width) for a 3D video for example. There are iterators for simple case but convolution is something I'm leaving for a proper compiler.

I.e. hopefully in the future the way to specify a convolution would be similar to Halide (https://halide-lang.org/assets/lectures/Halide_CVPR_intro.pdf) and that would lead to optimal code:

proc convolve(A, Filter: Function): Function =
  # Generator for the convolution function
  var i, j: Domain
  var convH, convW: Function
  
  # Definition of the result function
  # Filter F = [1, 2, 1]
  # So 2D application is:
  # [[1, 2, 1],
  #  [2, 4, 2],
  #  [1, 2, 1]]
  #
  convH[i, j] = A[i-1, j] * F[0] + A[i, j] * F[1] + A[i+1, j] * F[2]
  convW[i, j] = convH[i, j-1] * F[0] + convH[i, j] * F[1] + convH[i, j] * F[2]
  
  # Optimization of the definition - completely optional
  convH.compute_at(convW, j) # Compute the required region of convH before convW start (i.e. merged)
  # convH.compute_root()     # Alternatively do the height convolve before the width convolve (i.e. 2 steps)
  
  convW.tile(i, ii, 64)      # Split i in tile of size 64, with ii as the tile iterator
  convW.parallel(i)          # Iterating on tiles is done in parallel
  convW.tile(j, jj, 64)      # Split j in tile of size 64 as well with jj as the tile iterator
  convW.vectorize(jj, 8)     # Iterating on jj is done with size 8 vectors if possible (i.e. AVX on x86)
  
  return convW
  
  generate convolve:
    proc foobar(a: Tensor[float32], filter: array[3, float32]): Tensor[float32]

(From Lux: https://github.com/numforge/laser/blob/d1e6ae61/laser/lux_compiler/lux_dsl.nim#L43-L58)

The importance of loop tiling (also called loop blocking) is highlighted in this simple transposition benchmark: https://github.com/numforge/laser/blob/d1e6ae61/benchmarks/transpose/transpose_bench.nim

There is a factor 3x between simple parallel transposition (|| is the OpenMP parallel for loop)

proc benchNaive(a: seq[float32], nb_samples: int) =
  var output = newSeq[float32](out_size)
  withCompilerOptimHints()
  
  let pa{.restrict.} = cast[ptr UncheckedArray[float32]](a[0].unsafeAddr)
  let po{.restrict.} = cast[ptr UncheckedArray[float32]](output[0].addr)
  
  bench("Naive transpose"):
    discard
  do:
    for i in `||`(0, M-1):
      for j in `||`(0, N-1, "simd"): # This only add "#pragma omp simd"
        po[i+j*M] = pa[j+i*N]

1D tiling

proc benchCacheBlocking(a: seq[float32], nb_samples: int) =
  var output = newSeq[float32](out_size)
  withCompilerOptimHints()
  
  let pa{.restrict.} = a[0].unsafeAddr
  let po{.restrict.} = output[0].addr
  const blck = 64
  
  bench("Cache blocking"):
    discard
  do:
    {.emit: """
    // No min function in C ...
    #define min(a,b) (((a)<(b))?(a):(b))
    
    #pragma omp parallel for
    for (int i = 0; i < `M`; i+=`blck`)
      for (int j = 0; j < `N`; ++j)
        #pragma omp simd
        for (int ii = i; ii < min(i+`blck`,`M`); ++ii)
          `po`[ii+j*`M`] = `pa`[j+ii*`N`];
    """.}

And 2D tiling

proc bench2Dtiling(a: seq[float32], nb_samples: int) =
  var output = newSeq[float32](out_size)
  withCompilerOptimHints()
  
  let pa{.restrict.} = a[0].unsafeAddr
  let po{.restrict.} = output[0].addr
  const blck = 128
  
  bench("2D Tiling"):
    discard
  do:
    {.emit: """
    // No min function in C ...
    #define min(a,b) (((a)<(b))?(a):(b))
    
    #pragma omp parallel for collapse(2)
    for (int i = 0; i < `M`; i+=`blck`)
      for (int j = 0; j < `N`; j+=`blck`)
        for (int ii = i; ii<i+`blck` && ii<`M`; ii++)
          #pragma omp simd
          for (int jj = j; jj<min(j+`blck`,`N`); jj++)
            `po`[ii+jj*`M`] = `pa`[jj+ii*`N`];
    """.}

And you can gain about 20% performance with prefetching

proc bench2DtilingExchangedPrefetch(a: seq[float32], nb_samples: int) =
  var output = newSeq[float32](out_size)
  withCompilerOptimHints()
  
  let pa{.restrict.} = a[0].unsafeAddr
  let po{.restrict.} = output[0].addr
  const blck = 32
  
  bench("2D Tiling + Prefetch - input row iteration"):
    discard
  do:
    {.emit: """
    #pragma omp parallel for collapse(2)
    for (int j = 0; j < `N`; j+=`blck`)
      for (int i = 0; i < `M`; i+=`blck`)
        for (int jj = j; jj<j+`blck` && jj<`N`; jj++)
          #pragma omp simd
          for (int ii = i; ii<min(i+`blck`,`M`); ii++)
            `po`[ii+jj*`M`] = `pa`[jj+ii*`N`];
        __builtin_prefetch(&`pa`[(i+1)*`N`], 0, 1); // Prefetch read with low temporal locality
    """.}

Most of the low hanging fruits to accelerate convolution are similar to accelerating transposition so I suggest you try to understand those benchmarks and techniques i.e.:

2D Tiling

vectorization

parallelization of outermost loop

Beyond the the convolution research I already linked, the research in previous post, if you really want to go in the rabbit hole, additional reads are:

automatic loop nest scheduling: https://github.com/numforge/laser/blob/master/research/automatic_loop_nest_scheduling.md

matrix multiplication: https://github.com/numforge/laser/blob/master/research/matrix_multiplication_optimisation_resources.md are relevant. Though for matrix multiplication basically the tricks are:
- 6 layers of tiling, 2 for each of the M,N,K dimensions (since we do M-by-K * K-by-N -> M-by-N matrices.
- repack the data to a temporary storage and change the data layout so that all tile accesses are contiguous on the M and N dimensions (i.e. transpose so that we access K-by-M and K-by-N
- Extra tiling for vectorizations, and careful track of low-level register usage.
- all the tiling is carefully tuned to properly use L1 and L2 cache and also limit TLB faults.
- prefetching
- in actual code: https://github.com/numforge/laser/blob/d1e6ae61/laser/primitives/matrix_multiplication/gemm_tiling.nim#L284-L305

Mirror of forum.nim-lang.org

6164 :: UncheckedArray conversion