Hi All,
I just finished a breif article on programming in Nim demonstrating it using Kernel Matrix calculations (https://www.active-analytics.com/blog/kernel-matrix-calculation-in-nim/). It's live, but I wanted to get the opinion of the Nim community on its fairness and accuracy. I will change anything that is inaccurate or will update suggested code changes that have technical merit only using the language and the standard library.
Thank you
Both arrays and sequences in Nim are value rather than reference types.
This is not true, sequences are reference types, id suggest you to:
1st. Correct this.
Corrected. Thanks
2nd. Use an array instead of a seq, something like this maybe: ...
Using sequences serves two purposes, it introduces readers into the sequence data structure. Secondly, matrix data structures are often implemented in such a way that the number of items in the matrix can be changed for example when appending or removing data. Thanks for the implementation though, that is useful.
Consider a naive implementation of a .* operator to add two sequences together:
You mean "compute the dot product"? That seems to be what that operation actually does in the example beneath it.
This is not true, sequences are reference types
My understanding is that the "top-level" structure of a sequence is a value object (also containing length and capacity), but the structure contains a reference to the actual seq data on the heap.
Has this changed?
You mean "compute the dot product"? That seems to be what that operation actually does in the example beneath it.
Thanks for spotting the mistake, not the dot product but just a simple element by element multiplication . It shows you can easily create operator similar to that in Julia's element-by-element operations you wish. It's corrected now.
No seqs are not reference types, they copy on assignment. If you need ref semantics you would use ref seq[T].
When I first wrote that both arrays and sequences were value types in Nim I was fairly sure but then I haven't written Nim in a little while and, to be fair when I made the wrong amendment it was late and I was tired in my defence, but I should have double-checked. I have reverted back to the original correct version that both arrays and sequences are value types. Thanks
Other than that, nice article! Always great to see articles talking about Nim for mathematical / scientific etc. usages!
Thank you very much for your suggestions, they were useful. My amendments, comments or otherwise:
"Consider a naive implementation of a .* operator to add two sequences together:" -> to multiply two sequences
This is correction is already done - maybe refresh your browser. I usually refresh in private mode because the browser doesn't cache.
in the equation for polynomial kernel use text{Offset}
Now changed, I was in two minds about this but it does look better as text. I also found a copy-paste error in the latex which I've now changed.
personally would not define var ret and use return ret. Instead use the implicit result variable and avoid explicit return
Good point, now changed.
semicolons at line ends look weird to me
Removed them, for consistency, and its probably the Nim style.
in gaussian I would define the var tmp before the loop and just reassign
This might be more of a taste thing, but I usualy declare variables only used in a loop within that loop. Unchanged.
you didn't show the implementation of $ for the matrix type. Not a problem, but I would mention that it has to be defined manually (maybe I missed a reference to that)
You're right, I didn't mention anything about the rest of the implementation. I have now commented that they have been left out.
given that this is numeric code, which should probably run fast, I wouldn't refer to nim c -r <file> without mentioning -d:danger to readers
Now added with a link to Nim compiler documentation.
regarding no SIMD support from the compiler: thanks to macros one can sugar coat SIMD in principle very well oneself
Yes. I added the comment on concurrency and SIMD because it's important for a programmer to know what they are letting themselves in for particularly with regards to scientific computing where SIMD and concurrency are basically a must. On one hand they could have the fun of implementing SIMD macros in a powerful language, on the other hand they might not be up for that.
Many thanks for your valuable review!
Regarding SIMD, you have the same challenges as C or C++ in Nim.
You can use GCC/Clang builtin vector types for example
{.emit:"typedef float Float32x8 __attribute__ ((vector_size (32)));".}
type Float32x8 {.importc, bycopy.} = object
raw: array[8, float32]
Or you can use the SIMD intrinsics.
But in terms of implementation you need the same effort as C or C++ except that Nim metaprogramming makes it possible to generalize your SIMD implementations to many architectures or SIMD size.
For example my SIMD definition for SSE2 and AVX512 mtrix multiplication which allows me, in thousand of lines of Nim code to be as fast as 50x more pure assembly lines in OpenBLAS
Note that this includes integer support that no scientific library has in an efficient manner (150x faster than Numpy or Julia on 1500x1500 matrices) and also it's easy to implement fallback to scalar code.
For example my SIMD definition for SSE2 and AVX512 mtrix multiplication which allows me, in thousand of lines of Nim code to be as fast as 50x more pure assembly lines in OpenBLAS
Can you please elaborate on that? I looked into your code and since you're using intrinsics you're dependant on the mercy of the compiler to schedule everything right, is the assembler code you're talking about worse than what the compiler archieve? Or do you use some faster algorithm to perform the matrix multiplication?
Can you please elaborate on that? I looked into your code and since you're using intrinsics you're dependant on the mercy of the compiler to schedule everything right, is the assembler code you're talking about worse than what the compiler archieves? Or do you use some faster algorithm to perform the matrix multiplication?
To answer all 3 questions
since you're using intrinsics you're dependant on the mercy of the compiler to schedule everything right
I use intrinsics because the compiler cannot derive the best schedule and using explicit intrinsics avoid being at the mercy of the compiler actually. At least for SIMD intrinsics compiler do respect what is written in C/C++/Nim and there is no need to go down to Assembly (which is not true for BigInt or Crypto intrinsics however). You do need the restrict/{.noalias.} pragma and also prefetch intrinsics as well to reach state of the art performance.
is the assembler code you're talking about worse than what the compiler archieves?
No, the assembler code I'm talking about is the fruit of dozens of PhD research and thousands of man-hours looking into optimizing matrix multiplication. Compilers are still playing catch up to automatically generate efficient matrix multiplication kernels and nowadays they even do pattern matching and swap the manually optimized assembly.[1]
References (https://github.com/numforge/laser/blob/d1e6ae6/laser/primitives/matrix_multiplication/gemm_tiling.nim#L12-L30)
# Papers:
# [1] Anatomy of High-Performance Matrix Multiplication (Revised)
# Kazushige Goto, Robert A. Van de Geijn
# - http://www.cs.utexas.edu/~flame/pubs/GotoTOMS_revision.pdf
#
# [2] Anatomy of High-Performance Many-Threaded Matrix Multiplication
# Smith et al
# - http://www.cs.utexas.edu/users/flame/pubs/blis3_ipdps14.pdf
#
# [3] Automating the Last-Mile for High Performance Dense Linear Algebra
# Veras et al
# - https://arxiv.org/pdf/1611.08035.pdf
#
# [4] GEMM: From Pure C to SSE Optimized Micro Kernels
# Michael Lehn
# - http://apfel.mathematik.uni-ulm.de/~lehn/sghpc/gemm/index.html
#
# Laser wiki - GEMM optimization resources
# - https://github.com/numforge/laser/wiki/GEMM-optimization-resources
That file as a lot of details on how to schedule high performance compute kernels that compilers can't match at the moment. Note that those sizes assumes no hyperthreading because with hyperthreading a physical core will have half L1 and L2 caches available per panel leading to many more flushes/reloads.
# We have to take into account vectorisation
# so that the microkernel can be processed with vectorized intrinsics.
#
# Caux [mr, nr] must stay in register.
# - mr ~= nr is optimal to amortize register load cost
# - some registers must be left to prefetch à and ~B (PackedA and PackedB)
# - nr >= (flops/cycle) / (bytes/cycle) * sizeof(element)
#
# For example Haswell is capable of
# - 32 single-precision FLOPs/cycle
# - 32 bytes/cycle store and 64 bytes/cycle load (store C, load A and B)
#
# so nr >= 32/32 * 4
# For that number of FLOP it must schedule
# 2xFMA so consume 16 single-precision float
# so mr*nr >= 16
# Registers constraints and micro-kernel tuning
# - To issue 2xFMAs in parallel we need to use 2x SIMD registers
# - We want to hold C of size MR * NR completely in SIMD registers as well
# as each value is reused k times during accumulation C[i, j] += A[i, k] * B[k, j]
# - We should have enough SIMD registers left to hold
# the corresponding sections of A and B (at least 4, 2xA and 2xB for FMAs)
#
# On x86-64 X SIMD registers that can issue 2xFMAs per cycle:
# - NbVecs is 2 minimum
# - RegsPerVec = 2 * NbVecs => 4 minimum (for A and for B)
# - NR = NbVecs * NbScalarsPerSIMD
# - C: MR*NR and uses MR*NbVecs SIMD registers
# - MR*NbVecs + RegsPerVec <= X
# -> MR*NbVecs + 2 * NbVecs <= X
# -> (MR+2) * NbVecs <= X
#
# Some solutions:
# - AVX with 16 registers:
# - MR = 6, NbVecs = 2
# FP32: 8xFP32 per SIMD --> NR = 2x8
# ukernel = 6x16
# FP64, ukernel = 6x8
# - MR = 2, NbVecs = 4
# FP32: 8xFP32 per SIMD --> NR = 4x8
# ukernel = 2x32
# FP64, ukernel = 2x16
# - AVX512 with 32 registers
# - MR = 6, NbVecs = 4
# FP32 ukernel = 6x64
# FP64 ukernel = 6x32
# - MR = 2, NbVecs = 8
# FP32 ukernel = 2x128
# FP64 ukernel = 2x64
# - MR = 14, NbVecs = 2
# FP32 ukernel = 14x32
# FP64 ukernel = 14x16
# The inner microkernel loop does:
# AB[m][n] = A[m] * B[n]
# So n should be the vector size
# if most matrices are row-Major.
# This avoids dealing with transpose
# in the inner loop and untranspose in the epilogue
# ## Panel sizes
# - TLB constraint
# TA ̃ + 2(TBj + TCj)≤T
# Note: optimizing the similar problem mckc/(2mc+2kc)
# under the constraint that mckc ≤ K is the problem
# of maximizing the area of a rectangle
# while minimizing the perimeter,
#
# Goto paper [1] section 6.3: choosing kc
# - kc should be as large as possible to amortize the mr*nr updates of Cj
# - Elements from Bj [kc, nr] must remain in L1 cache.
# - kc * nr should occupy less than half the L1 cache
# so that à and Caux do not evict element of Bj
# - Ã [kc, mc] should occupy
# a considerable fraction of the L2 cache
# In our experience optimal choice is so that "kc" float64 occupy half a page
# -> a page is 4096 bytes = 512 float64 -> half a page = 256
# Goto paper [1] section 6.3: choosing mc
# - mc*kc should fill a considerable part of (1) the memory addressable
# by the TLB and (2) the L2 cache
# In practice mc is chosen so that A occupies about half the smaller of (1) and (2)
Or do you use some faster algorithm to perform the matrix multiplication
No, I use the default O(n^3) algorithm. The asymptotically faster algorithms like Strassen may be theoretically faster but implementation "details" make the default algorithm 100x~1000x faster than a naive implementation already.
I've collated further research into high performance computing optimization in:
[1] Regarding pattern matching matrix multiplication, Intel ICC detects nested triple for-loops and swaps them to their Assembly library MKL. LLVM Polly (their polyhedral optimizer for nested loop) will detect nested triple for-loops and apply BLIS transformation instead of automated polyhedral scheduling (https://reviews.llvm.org/D21491). Laser code architecture is based on BLIS code at a high-level as well.