nimforum mirror - Kernel Matrix Calculation in Nim

dataPulverizer (orginal) [2020-11-10T23:52:56+01:00] view original

Hi All,

I just finished a breif article on programming in Nim demonstrating it using Kernel Matrix calculations (https://www.active-analytics.com/blog/kernel-matrix-calculation-in-nim/). It's live, but I wanted to get the opinion of the Nim community on its fairness and accuracy. I will change anything that is inaccurate or will update suggested code changes that have technical merit only using the language and the standard library.

Thank you

Recruit_main707 (orginal) [2020-11-11T00:08:35+01:00] view original

I get a 404

Recruit_main707 (orginal) [2020-11-11T00:12:28+01:00] view original

Why didnt you use an array instead of a sequence?

dataPulverizer (orginal) [2020-11-11T00:13:30+01:00] view original

Thanks for letting me know. I've updated the link now.

dataPulverizer (orginal) [2020-11-11T00:35:55+01:00] view original

Both arrays and sequences in Nim are value rather than reference types.
This is not true, sequences are reference types, id suggest you to:
1st. Correct this.

Corrected. Thanks

2nd. Use an array instead of a seq, something like this maybe: ...

Using sequences serves two purposes, it introduces readers into the sequence data structure. Secondly, matrix data structures are often implemented in such a way that the number of items in the matrix can be changed for example when appending or removing data. Thanks for the implementation though, that is useful.

cantanima (orginal) [2020-11-11T03:10:53+01:00] view original

Consider a naive implementation of a .* operator to add two sequences together:

You mean "compute the dot product"? That seems to be what that operation actually does in the example beneath it.

xigoi (orginal) [2020-11-11T07:14:55+01:00] view original

The dimensions of a matrix often aren't known at compile time.

sschwarzer (orginal) [2020-11-11T10:12:09+01:00] view original

This is not true, sequences are reference types

My understanding is that the "top-level" structure of a sequence is a value object (also containing length and capacity), but the structure contains a reference to the actual seq data on the heap.

Has this changed?

dataPulverizer (orginal) [2020-11-11T12:32:19+01:00] view original

You mean "compute the dot product"? That seems to be what that operation actually does in the example beneath it.

Thanks for spotting the mistake, not the dot product but just a simple element by element multiplication . It shows you can easily create operator similar to that in Julia's element-by-element operations you wish. It's corrected now.

dataPulverizer (orginal) [2020-11-11T12:37:37+01:00] view original

No seqs are not reference types, they copy on assignment. If you need ref semantics you would use ref seq[T].

When I first wrote that both arrays and sequences were value types in Nim I was fairly sure but then I haven't written Nim in a little while and, to be fair when I made the wrong amendment it was late and I was tired in my defence, but I should have double-checked. I have reverted back to the original correct version that both arrays and sequences are value types. Thanks

Vindaar (orginal) [2020-11-11T12:51:06+01:00] view original

Couple of things after skimming through it:

"Consider a naive implementation of a .* operator to add two sequences together:" -> to multiply two sequences

in the equation for polynomial kernel use \\text{Offset}

personally would not define var ret and use return ret. Instead use the implicit result variable and avoid explicit return

semicolons at line ends look weird to me

in gaussian I would define the var tmp before the loop and just reassign

why define n = x.len if only used to indicate running index of the loops?

you didn't show the implementation of $ for the matrix type. Not a problem, but I would mention that it has to be defined manually (maybe I missed a reference to that)

given that this is numeric code, which should probably run fast, I wouldn't refer to nim c -r <file> without mentioning -d:danger to readers

regarding no SIMD support from the compiler: thanks to macros one can sugar coat SIMD in principle very well oneself

Other than that, nice article! Always great to see articles talking about Nim for mathematical / scientific etc. usages!

dataPulverizer (orginal) [2020-11-11T14:38:41+01:00] view original

Thank you very much for your suggestions, they were useful. My amendments, comments or otherwise:

"Consider a naive implementation of a .* operator to add two sequences together:" -> to multiply two sequences

This is correction is already done - maybe refresh your browser. I usually refresh in private mode because the browser doesn't cache.

in the equation for polynomial kernel use text{Offset}

Now changed, I was in two minds about this but it does look better as text. I also found a copy-paste error in the latex which I've now changed.

personally would not define var ret and use return ret. Instead use the implicit result variable and avoid explicit return

Good point, now changed.

semicolons at line ends look weird to me

Removed them, for consistency, and its probably the Nim style.

in gaussian I would define the var tmp before the loop and just reassign

This might be more of a taste thing, but I usualy declare variables only used in a loop within that loop. Unchanged.

you didn't show the implementation of $ for the matrix type. Not a problem, but I would mention that it has to be defined manually (maybe I missed a reference to that)

You're right, I didn't mention anything about the rest of the implementation. I have now commented that they have been left out.

given that this is numeric code, which should probably run fast, I wouldn't refer to nim c -r <file> without mentioning -d:danger to readers

Now added with a link to Nim compiler documentation.

regarding no SIMD support from the compiler: thanks to macros one can sugar coat SIMD in principle very well oneself

Yes. I added the comment on concurrency and SIMD because it's important for a programmer to know what they are letting themselves in for particularly with regards to scientific computing where SIMD and concurrency are basically a must. On one hand they could have the fun of implementing SIMD macros in a powerful language, on the other hand they might not be up for that.

Many thanks for your valuable review!

mratsim (orginal) [2020-11-11T20:35:40+01:00] view original

Regarding SIMD, you have the same challenges as C or C++ in Nim.

You can use GCC/Clang builtin vector types for example

{.emit:"typedef float Float32x8 __attribute__ ((vector_size (32)));".}

type Float32x8 {.importc, bycopy.} = object
  raw: array[8, float32]

Or you can use the SIMD intrinsics.

But in terms of implementation you need the same effort as C or C++ except that Nim metaprogramming makes it possible to generalize your SIMD implementations to many architectures or SIMD size.

For example my SIMD definition for SSE2 and AVX512 mtrix multiplication which allows me, in thousand of lines of Nim code to be as fast as 50x more pure assembly lines in OpenBLAS

https://github.com/numforge/laser/blob/d1e6ae6/laser/primitives/matrix_multiplication/gemm_ukernel_avx512.nim#L10-L74

https://github.com/numforge/laser/blob/d1e6ae6/laser/primitives/matrix_multiplication/gemm_ukernel_sse2.nim#L6-L129

Note that this includes integer support that no scientific library has in an efficient manner (150x faster than Numpy or Julia on 1500x1500 matrices) and also it's easy to implement fallback to scalar code.

doofenstein (orginal) [2020-11-11T21:48:24+01:00] view original

For example my SIMD definition for SSE2 and AVX512 mtrix multiplication which allows me, in thousand of lines of Nim code to be as fast as 50x more pure assembly lines in OpenBLAS

Can you please elaborate on that? I looked into your code and since you're using intrinsics you're dependant on the mercy of the compiler to schedule everything right, is the assembler code you're talking about worse than what the compiler archieve? Or do you use some faster algorithm to perform the matrix multiplication?

mratsim (orginal) [2020-11-12T12:08:35+01:00] view original

Can you please elaborate on that? I looked into your code and since you're using intrinsics you're dependant on the mercy of the compiler to schedule everything right, is the assembler code you're talking about worse than what the compiler archieves? Or do you use some faster algorithm to perform the matrix multiplication?

To answer all 3 questions

since you're using intrinsics you're dependant on the mercy of the compiler to schedule everything right

I use intrinsics because the compiler cannot derive the best schedule and using explicit intrinsics avoid being at the mercy of the compiler actually. At least for SIMD intrinsics compiler do respect what is written in C/C++/Nim and there is no need to go down to Assembly (which is not true for BigInt or Crypto intrinsics however). You do need the restrict/{.noalias.} pragma and also prefetch intrinsics as well to reach state of the art performance.

is the assembler code you're talking about worse than what the compiler archieves?

No, the assembler code I'm talking about is the fruit of dozens of PhD research and thousands of man-hours looking into optimizing matrix multiplication. Compilers are still playing catch up to automatically generate efficient matrix multiplication kernels and nowadays they even do pattern matching and swap the manually optimized assembly.[1]

References (https://github.com/numforge/laser/blob/d1e6ae6/laser/primitives/matrix_multiplication/gemm_tiling.nim#L12-L30)


# Papers:
#   [1] Anatomy of High-Performance Matrix Multiplication (Revised)
#       Kazushige Goto, Robert A. Van de Geijn
#     - http://www.cs.utexas.edu/~flame/pubs/GotoTOMS_revision.pdf
#
#   [2] Anatomy of High-Performance Many-Threaded Matrix Multiplication
#       Smith et al
#     - http://www.cs.utexas.edu/users/flame/pubs/blis3_ipdps14.pdf
#
#   [3] Automating the Last-Mile for High Performance Dense Linear Algebra
#       Veras et al
#     - https://arxiv.org/pdf/1611.08035.pdf
#
#   [4] GEMM: From Pure C to SSE Optimized Micro Kernels
#       Michael Lehn
#     - http://apfel.mathematik.uni-ulm.de/~lehn/sghpc/gemm/index.html
#
#   Laser wiki - GEMM optimization resources
#     - https://github.com/numforge/laser/wiki/GEMM-optimization-resources

That file as a lot of details on how to schedule high performance compute kernels that compilers can't match at the moment. Note that those sizes assumes no hyperthreading because with hyperthreading a physical core will have half L1 and L2 caches available per panel leading to many more flushes/reloads.


# We have to take into account vectorisation
# so that the microkernel can be processed with vectorized intrinsics.
#
# Caux [mr, nr] must stay in register.
#   - mr ~= nr is optimal to amortize register load cost
#   - some registers must be left to prefetch Ã and ~B (PackedA and PackedB)
#   - nr >= (flops/cycle) / (bytes/cycle) * sizeof(element)
#
# For example Haswell is capable of
#   - 32 single-precision FLOPs/cycle
#   - 32 bytes/cycle store and 64 bytes/cycle load (store C, load A and B)
#
#   so nr >= 32/32 * 4
#   For that number of FLOP it must schedule
#   2xFMA so consume 16 single-precision float
#   so mr*nr >= 16


# Registers constraints and micro-kernel tuning
#   - To issue 2xFMAs in parallel we need to use 2x SIMD registers
#   - We want to hold C of size MR * NR completely in SIMD registers as well
#     as each value is reused k times during accumulation C[i, j] += A[i, k] * B[k, j]
#   - We should have enough SIMD registers left to hold
#     the corresponding sections of A and B (at least 4, 2xA and 2xB for FMAs)
#
# On x86-64 X SIMD registers that can issue 2xFMAs per cycle:
#    - NbVecs is 2 minimum
#    - RegsPerVec = 2 * NbVecs => 4 minimum (for A and for B)
#    - NR = NbVecs * NbScalarsPerSIMD
#    - C: MR*NR and uses MR*NbVecs SIMD registers
#    - MR*NbVecs + RegsPerVec <= X
#       -> MR*NbVecs + 2 * NbVecs <= X
#       -> (MR+2) * NbVecs <= X
#
# Some solutions:
#    - AVX with 16 registers:
#          - MR = 6, NbVecs = 2
#            FP32: 8xFP32 per SIMD --> NR = 2x8
#                  ukernel = 6x16
#            FP64, ukernel = 6x8
#          - MR = 2, NbVecs = 4
#            FP32: 8xFP32 per SIMD --> NR = 4x8
#                  ukernel = 2x32
#            FP64, ukernel = 2x16
#    - AVX512 with 32 registers
#          - MR = 6, NbVecs = 4
#            FP32 ukernel = 6x64
#            FP64 ukernel = 6x32
#          - MR = 2, NbVecs = 8
#            FP32 ukernel = 2x128
#            FP64 ukernel = 2x64
#          - MR = 14, NbVecs = 2
#            FP32 ukernel = 14x32
#            FP64 ukernel = 14x16


  # The inner microkernel loop does:
  #   AB[m][n] = A[m] * B[n]
  # So n should be the vector size
  # if most matrices are row-Major.
  # This avoids dealing with transpose
  # in the inner loop and untranspose in the epilogue


  # ## Panel sizes
  # - TLB constraint
  #   TA ̃ + 2(TBj + TCj)≤T
  #   Note: optimizing the similar problem mckc/(2mc+2kc)
  #         under the constraint that mckc ≤ K is the problem
  #         of maximizing the area of a rectangle
  #         while minimizing the perimeter,
  #
  # Goto paper [1] section 6.3: choosing kc
  #   - kc should be as large as possible to amortize the mr*nr updates of Cj
  #   - Elements from Bj [kc, nr] must remain in L1 cache.
  #   - kc * nr should occupy less than half the L1 cache
  #     so that Ã and Caux do not evict element of Bj
  #   - Ã [kc, mc] should occupy
  #     a considerable fraction of the L2 cache
  #   In our experience optimal choice is so that "kc" float64 occupy half a page
  #     -> a page is 4096 bytes = 512 float64 -> half a page = 256
  
  # Goto paper [1] section 6.3: choosing mc
  #   - mc*kc should fill a considerable part of (1) the memory addressable
  #     by the TLB and (2) the L2 cache
  #     In practice mc is chosen so that A occupies about half the smaller of (1) and (2)

Or do you use some faster algorithm to perform the matrix multiplication

No, I use the default O(n^3) algorithm. The asymptotically faster algorithms like Strassen may be theoretically faster but implementation "details" make the default algorithm 100x~1000x faster than a naive implementation already.

I've collated further research into high performance computing optimization in:

Convolution Optimisation Resources: https://github.com/numforge/laser/wiki/Convolution-optimisation-resources or https://github.com/numforge/laser/blob/master/research/convolution_optimisation_resources.md

Nested loop automated scheduling: https://github.com/numforge/laser/blob/master/research/automatic_loop_nest_scheduling.md Includes:
- Expert heuristics driven scheduling
- Machine Learning approaches
- Polyhedral compilation
- Brute-Force search

[1] Regarding pattern matching matrix multiplication, Intel ICC detects nested triple for-loops and swaps them to their Assembly library MKL. LLVM Polly (their polyhedral optimizer for nested loop) will detect nested triple for-loops and apply BLIS transformation instead of automated polyhedral scheduling (https://reviews.llvm.org/D21491). Laser code architecture is based on BLIS code at a high-level as well.

Mirror of forum.nim-lang.org

7071 :: Kernel Matrix Calculation in Nim