I am trying to call some BLAS functions for linear algebra, for instance let me take matrix multiplication.
This is what I am doing:
type
Matrix[M, N: static[int]] = array[N, array[M, float64]]
TransposeType = enum
noTranspose = 111, transpose = 112, conjTranspose = 113
OrderType = enum
rowMajor = 101, colMajor = 102
proc sgemm(ORDER: OrderType, TRANSA, TRANSB: TransposeType, M, N, K: int, ALPHA: float64,
A: ptr float64, LDA: int, B: ptr float64, LDB: int, BETA: float64, C: ptr float64, LDC: int)
{. header: "cblas.h", importc: "cblas_sgemm" .}
template asPtr[M, N: static[int]](a: Matrix[M, N]): ptr float64 = cast[ptr float64](a.addr)
proc `*`*[M, N, K: static[int]](a: var Matrix[M, K], b: var Matrix[K, N]): Matrix[M, N] {. inline .} =
sgemm(colMajor, noTranspose, noTranspose, M, N, K, 1, a.asPtr, M, b.asPtr, K, 0, result.asPtr, M)
I am using column major order (notice that M and N in the definition of Matrix are swapped), since I think this is the native Fortran order and should result in better performance (as far as I understand). To use this, I do something like
import math, times
proc makeMatrix(M, N: static[int], f: proc (i, j: int): float64): Matrix[M, N] =
for i in 0 .. < N:
for j in 0 .. < M:
result[i][j] = f(i, j)
var
mat1 = makeMatrix(1000, 987, proc(i, j: int): float64 = random(1.0))
mat2 = makeMatrix(987, 876, proc(i, j: int): float64 = random(1.0))
let startTime1 = epochTime()
for i in 0 .. 10:
discard mat1 * mat2
let endTime1 = epochTime()
echo "We have required ", endTime1 - startTime1, " seconds to do multiply matrices 10 times."
Now, the problem is that this is really quite slow. I have tried this with both ATLAS and Intel MKL (multithreaded), and I have also tried it with Numpy and Breeze (a scala library). A summary of the times that I obtain is:
What can I do to figure out why BLAS libraries appear to be slower when called from Nim? Am I doing something wrong in the way I represent data?
Sorry, but this benchmark is quite flawed. That said, I have no idea if the results change when you address my concerns. But hey, beating Python+Numpy is not that bad, is it?
@Varriount I am compiling with nim c -d:release blas.nim. Inside the file I have {. passl: "-lmkl_intel_lp64" passl: "-lmkl_core" passl: "-lmkl_gnu_thread" passl: "-lgomp" .} (for MKL+threading) or {.passl: "-lcblas".} (for ATLAS).
@Araq Yeah, I know this is not a serious benchmark. I will try to make one, but I was surprised at the big difference between Nim and Python/Scala. Consider that neither Python nor Scala are using the Intel MKL libraries, which are much more heavily optimized than ATLAS or Netlib, and I also think that they are going single-threaded (will have to check). So I believe the meaningful comparison should be 8 seconds vs 1.5 seconds.
About the type Matrix[M, N]: my intention would be to have a Nim wrapper which is type-safe and knows dimensions at compile time. I have found Nim static[T] types very convenient for this. For instance, Nim is able to infer dimensions, and keeps track of those numbers while I apply operations. It also allows me to allocate everything on the stack, unless dimensions get really big. The fact that behind the scenes I just pass a pointer is due to the BLAS interface, which I am in fact trying to hide.
Do you have in mind any particular downside in doing this?
@Araq: the reason why I am using a var there is that otherwise I get Error: expression has no address. This is something I wanted to fix, but I postponed this after having found what is causing the slowdown. Ideally I would like this to work for immutable matrices allocated either on the stack or on the heap.
@Varriount: Yeah, I just came back to work and I plan to try that, as well as various combinations of matrix size and backend libraries for all languages.
Implementation | 10 multiplications | 100 multiplications
| |
Numpy ATLAS | 1350ms | 13322ms
Breeze ATLAS | 1359ms | 13536ms
Breeze Netlib | 4222ms | 41128ms
Breeze Java (?) | 6596ms | 69687ms
Nim ATLAS | 7182ms | 71451ms
Nim MKL threaded | 1686ms | 15261ms
Nim MKL single | 6324ms | 58766ms
C MKL threaded | 271ms | 1827ms
C MKL single | 794ms | 6438ms
The fact that in all implementations them time for 100 multiplications is about 10x the time for 10 multiplications should confirm that the work is not optimized away.
The only guess I am left with is that the BLAS libraries may be slower because in Nim I am allocating matrices on the stack. I will add more details as soon as I try other things.
total executions of each stack trace:
Entry: 1/4 Calls: 8/22 = 36.% [sum: 8; 8/22 = 36.%]
:anonymous 8/22 = 36.%
makeMatrix 10/22 = 45.%
blas 22/22 = 1.0e+02%
Entry: 2/4 Calls: 8/22 = 36.% [sum: 16; 16/22 = 73.%]
genericReset 12/22 = 55.%
genericReset 12/22 = 55.%
genericReset 12/22 = 55.%
blas 22/22 = 1.0e+02%
Entry: 3/4 Calls: 4/22 = 18.% [sum: 20; 20/22 = 91.%]
genericReset 12/22 = 55.%
genericReset 12/22 = 55.%
blas 22/22 = 1.0e+02%
Entry: 4/4 Calls: 2/22 = 9.1% [sum: 22; 22/22 = 1.0e+02%]
makeMatrix 10/22 = 45.%
blas 22/22 = 1.0e+02%
For reference, this is what I am running
when defined(mkl):
const header = "mkl.h"
when defined(threaded):
{. passl: "-lmkl_intel_lp64" passl: "-lmkl_core" passl: "-lmkl_gnu_thread" passl: "-lgomp" .}
static: echo "--USING MKL THREADED--"
else:
{.passl: "-lmkl_intel_lp64" passl: "-lmkl_core" passl: "-lmkl_sequential" passl: "-lpthread" .}
static: echo "--USING MKL SEQUENTIAL--"
else:
when defined(atlas):
{.passl: "-lcblas".}
const header = "atlas/cblas.h"
static: echo "--USING ATLAS--"
else:
{.passl: "-lblas".}
const header = "cblas.h"
static: echo "--USING DEFAULT BLAS--"
type
Matrix32*[M, N: static[int]] = array[N, array[M, float32]]
Matrix64*[M, N: static[int]] = array[N, array[M, float64]]
Matrix*[M, N: static[int]] = Matrix64[M, N]
TransposeType = enum
noTranspose = 111, transpose = 112, conjTranspose = 113
OrderType = enum
rowMajor = 101, colMajor = 102
proc sgemm(ORDER: OrderType, TRANSA, TRANSB: TransposeType, M, N, K: int, ALPHA: float64,
A: ptr float64, LDA: int, B: ptr float64, LDB: int, BETA: float64, C: ptr float64, LDC: int)
{. header: header, importc: "cblas_sgemm" .}
template asPtr[M, N: static[int]](a: Matrix64[M, N]): ptr float64 = cast[ptr float64](a.addr)
proc makeMatrix(M, N: static[int], f: proc (i, j: int): float64): Matrix64[M, N] =
for i in 0 .. < N:
for j in 0 .. < M:
result[i][j] = f(i, j)
proc `*`*[M, N, K: static[int]](a: var Matrix64[M, K], b: var Matrix64[K, N]): Matrix64[M, N] {. inline .} =
sgemm(colMajor, noTranspose, noTranspose, M, N, K, 1, a.asPtr, M, b.asPtr, K, 0, result.asPtr, M)
when isMainModule:
import math, times, nimprof
var
mat1 = makeMatrix(1000, 987, proc(i, j: int): float64 = random(1.0))
mat2 = makeMatrix(987, 876, proc(i, j: int): float64 = random(1.0))
let startTime1 = epochTime()
for i in 0 .. < 10:
discard mat1 * mat2
let endTime1 = epochTime()
echo "We have required ", endTime1 - startTime1, " seconds to do multiply matrices 10 times."
After some more experiments, it seems that I finally found out that the issue lies in using the BLAS function sgemm, which is meant for single precision use, as opposed to dgemm.
Now, I am not sure why working with single precision should slow things that much - possibly the reason is that this are of memory was initialized with float64 values, so who knows what the contents were when intepreted as float32...
In any case, it seems that using dgemm I get a performance comparable with other languages. Thanks to everyone!
Sure, I plan to give BLAS and LAPACK a reasonable Nim interface and publish it on Nimble.
I will make appropriate benchmarks, but I expect the timing to be dependent on the underlying Fortran library. In fact, I do not see much varation, say, in calling ATLAS from either Python, Scala or C.
I do not them readily available right now, but they are essentially the same. The difference is by far the BLAS implementation.
If you want to try it yourself, you can use the wrapper I have written around BLAS (it is also going to support GPU too, see the cublas branch)
Implementation | 10 multiplications | 100 multiplications
| |
Numpy ATLAS | 1350ms | 13322ms
Breeze ATLAS | 1359ms | 13536ms
Breeze Netlib | 4222ms | 41128ms
Breeze Java (?) | 6596ms | 69687ms
Nim ATLAS | 1434ms | 14111ms
Nim MKL threaded | 327ms | 1811ms
Nim MKL single | 694ms | 6344ms
C MKL threaded | 271ms | 1827ms
C MKL single | 794ms | 6438ms
Keep in mind that this is just a very naif benchmark of a matrix multiplication 1000x987 by 987x876 10 or 100 times.