nimforum mirror - Invoking BLAS functions (unexpected slowness)

andrea (orginal) [2015-05-28T18:27:00+02:00] view original

I am trying to call some BLAS functions for linear algebra, for instance let me take matrix multiplication.

This is what I am doing:

type
  Matrix[M, N: static[int]] = array[N, array[M, float64]]
  TransposeType = enum
    noTranspose = 111, transpose = 112, conjTranspose = 113
  OrderType = enum
    rowMajor = 101, colMajor = 102

proc sgemm(ORDER: OrderType, TRANSA, TRANSB: TransposeType, M, N, K: int, ALPHA: float64,
  A: ptr float64, LDA: int, B: ptr float64, LDB: int, BETA: float64, C: ptr float64, LDC: int)
  {. header: "cblas.h", importc: "cblas_sgemm" .}

template asPtr[M, N: static[int]](a: Matrix[M, N]): ptr float64 = cast[ptr float64](a.addr)

proc `*`*[M, N, K: static[int]](a: var Matrix[M, K], b: var Matrix[K, N]): Matrix[M, N] {. inline .} =
  sgemm(colMajor, noTranspose, noTranspose, M, N, K, 1, a.asPtr, M, b.asPtr, K, 0, result.asPtr, M)

I am using column major order (notice that M and N in the definition of Matrix are swapped), since I think this is the native Fortran order and should result in better performance (as far as I understand). To use this, I do something like

import math, times

proc makeMatrix(M, N: static[int], f: proc (i, j: int): float64): Matrix[M, N] =
  for i in 0 .. < N:
    for j in 0 .. < M:
      result[i][j] = f(i, j)

var
  mat1 = makeMatrix(1000, 987, proc(i, j: int): float64 = random(1.0))
  mat2 = makeMatrix(987, 876, proc(i, j: int): float64 = random(1.0))

let startTime1 = epochTime()
for i in 0 .. 10:
  discard mat1 * mat2
let endTime1 = epochTime()
echo "We have required ", endTime1 - startTime1, " seconds to do multiply matrices 10 times."

Now, the problem is that this is really quite slow. I have tried this with both ATLAS and Intel MKL (multithreaded), and I have also tried it with Numpy and Breeze (a scala library). A summary of the times that I obtain is:

Nim + ATLAS: 8198ms

Nim + MKL multithread: 1669ms

Python + Numpy (in the config shows available libraries = ['lapack', 'f77blas', 'cblas', 'atlas']): 1753ms

Scala + Breeze + NETLIB Blas: 1420ms

What can I do to figure out why BLAS libraries appear to be slower when called from Nim? Am I doing something wrong in the way I represent data?

Varriount (orginal) [2015-05-28T19:35:53+02:00] view original

Hm. What options are you using when invoking Nim? The only thing I can think of that would be slowing your project down would be floating point checks (if you have them enabled), and the Nim compiler not telling the C compiler to optimize.

Araq (orginal) [2015-05-29T00:09:13+02:00] view original

Using var Matrix[M, K] for * is weird and unnecessary.

wrap it in a main proc to enable more optimizations.

For a pure Java/Scala version the JIT can infer that for i in 0 .. 10: discard mat1 * mat2 has no effect and optimize it away.

Sorry, but this benchmark is quite flawed. That said, I have no idea if the results change when you address my concerns. But hey, beating Python+Numpy is not that bad, is it?

andrea (orginal) [2015-05-29T08:54:02+02:00] view original

@Varriount I am compiling with nim c -d:release blas.nim. Inside the file I have {. passl: "-lmkl_intel_lp64" passl: "-lmkl_core" passl: "-lmkl_gnu_thread" passl: "-lgomp" .} (for MKL+threading) or {.passl: "-lcblas".} (for ATLAS).

@Araq Yeah, I know this is not a serious benchmark. I will try to make one, but I was surprised at the big difference between Nim and Python/Scala. Consider that neither Python nor Scala are using the Intel MKL libraries, which are much more heavily optimized than ATLAS or Netlib, and I also think that they are going single-threaded (will have to check). So I believe the meaningful comparison should be 8 seconds vs 1.5 seconds.

About the type Matrix[M, N]: my intention would be to have a Nim wrapper which is type-safe and knows dimensions at compile time. I have found Nim static[T] types very convenient for this. For instance, Nim is able to infer dimensions, and keeps track of those numbers while I apply operations. It also allows me to allocate everything on the stack, unless dimensions get really big. The fact that behind the scenes I just pass a pointer is due to the BLAS interface, which I am in fact trying to hide.

Do you have in mind any particular downside in doing this?

Araq (orginal) [2015-05-29T10:30:06+02:00] view original

I'm referring to the var: * does not modify its arguments.

Varriount (orginal) [2015-05-29T11:41:38+02:00] view original

The only thing that can be done, I think, is to run a profiler and see what is causing the slowdown.

andrea (orginal) [2015-05-29T11:52:31+02:00] view original

@Araq: the reason why I am using a var there is that otherwise I get Error: expression has no address. This is something I wanted to fix, but I postponed this after having found what is causing the slowdown. Ideally I would like this to work for immutable matrices allocated either on the stack or on the heap.

@Varriount: Yeah, I just came back to work and I plan to try that, as well as various combinations of matrix size and backend libraries for all languages.

andrea (orginal) [2015-05-29T16:36:59+02:00] view original

Still not finding the reason of the slowness. Anyway, I have made the following experiment:

Implementation    | 10 multiplications | 100 multiplications
                  |                    |
Numpy ATLAS       |        1350ms      |      13322ms
Breeze ATLAS      |        1359ms      |      13536ms
Breeze Netlib     |        4222ms      |      41128ms
Breeze Java (?)   |        6596ms      |      69687ms
Nim ATLAS         |        7182ms      |      71451ms
Nim MKL threaded  |        1686ms      |      15261ms
Nim MKL single    |        6324ms      |      58766ms
C MKL threaded    |         271ms      |       1827ms
C MKL single      |         794ms      |       6438ms

The fact that in all implementations them time for 100 multiplications is about 10x the time for 10 multiplications should confirm that the work is not optimized away.

The only guess I am left with is that the BLAS libraries may be slower because in Nim I am allocating matrices on the stack. I will add more details as soon as I try other things.

Varriount (orginal) [2015-05-29T19:15:10+02:00] view original

@andrea Could you post the profiler results?

andrea (orginal) [2015-05-29T20:59:31+02:00] view original

Here it is

total executions of each stack trace:
Entry: 1/4 Calls: 8/22 = 36.% [sum: 8; 8/22 = 36.%]
  :anonymous 8/22 = 36.%
  makeMatrix 10/22 = 45.%
  blas 22/22 = 1.0e+02%
Entry: 2/4 Calls: 8/22 = 36.% [sum: 16; 16/22 = 73.%]
  genericReset 12/22 = 55.%
  genericReset 12/22 = 55.%
  genericReset 12/22 = 55.%
  blas 22/22 = 1.0e+02%
Entry: 3/4 Calls: 4/22 = 18.% [sum: 20; 20/22 = 91.%]
  genericReset 12/22 = 55.%
  genericReset 12/22 = 55.%
  blas 22/22 = 1.0e+02%
Entry: 4/4 Calls: 2/22 = 9.1% [sum: 22; 22/22 = 1.0e+02%]
  makeMatrix 10/22 = 45.%
  blas 22/22 = 1.0e+02%

For reference, this is what I am running

when defined(mkl):
  const header = "mkl.h"
  when defined(threaded):
    {. passl: "-lmkl_intel_lp64" passl: "-lmkl_core" passl: "-lmkl_gnu_thread" passl: "-lgomp" .}
    static: echo "--USING MKL THREADED--"
  else:
    {.passl: "-lmkl_intel_lp64" passl: "-lmkl_core" passl: "-lmkl_sequential" passl: "-lpthread" .}
    static: echo "--USING MKL SEQUENTIAL--"
else:
  when defined(atlas):
    {.passl: "-lcblas".}
    const header = "atlas/cblas.h"
    static: echo "--USING ATLAS--"
  else:
    {.passl: "-lblas".}
    const header = "cblas.h"
    static: echo "--USING DEFAULT BLAS--"

type
  Matrix32*[M, N: static[int]] = array[N, array[M, float32]]
  Matrix64*[M, N: static[int]] = array[N, array[M, float64]]
  Matrix*[M, N: static[int]] = Matrix64[M, N]
  TransposeType = enum
    noTranspose = 111, transpose = 112, conjTranspose = 113
  OrderType = enum
    rowMajor = 101, colMajor = 102

proc sgemm(ORDER: OrderType, TRANSA, TRANSB: TransposeType, M, N, K: int, ALPHA: float64,
  A: ptr float64, LDA: int, B: ptr float64, LDB: int, BETA: float64, C: ptr float64, LDC: int)
  {. header: header, importc: "cblas_sgemm" .}

template asPtr[M, N: static[int]](a: Matrix64[M, N]): ptr float64 = cast[ptr float64](a.addr)

proc makeMatrix(M, N: static[int], f: proc (i, j: int): float64): Matrix64[M, N] =
  for i in 0 .. < N:
    for j in 0 .. < M:
      result[i][j] = f(i, j)

proc `*`*[M, N, K: static[int]](a: var Matrix64[M, K], b: var Matrix64[K, N]): Matrix64[M, N] {. inline .} =
  sgemm(colMajor, noTranspose, noTranspose, M, N, K, 1, a.asPtr, M, b.asPtr, K, 0, result.asPtr, M)

when isMainModule:
  import math, times, nimprof
  
  var
    mat1 = makeMatrix(1000, 987, proc(i, j: int): float64 = random(1.0))
    mat2 = makeMatrix(987, 876, proc(i, j: int): float64 = random(1.0))
  
  let startTime1 = epochTime()
  for i in 0 .. < 10:
    discard mat1 * mat2
  let endTime1 = epochTime()
  echo "We have required ", endTime1 - startTime1, " seconds to do multiply matrices 10 times."

andrea (orginal) [2015-06-01T22:21:48+02:00] view original

After some more experiments, it seems that I finally found out that the issue lies in using the BLAS function sgemm, which is meant for single precision use, as opposed to dgemm.

Now, I am not sure why working with single precision should slow things that much - possibly the reason is that this are of memory was initialized with float64 values, so who knows what the contents were when intepreted as float32...

In any case, it seems that using dgemm I get a performance comparable with other languages. Thanks to everyone!

ozra (orginal) [2015-06-02T14:49:56+02:00] view original

Will you be refining the benchmarks and post them somewhere? I love looking at benchmarks. As long as their decently fairly implemented ;-)

andrea (orginal) [2015-06-02T15:56:22+02:00] view original

Sure, I plan to give BLAS and LAPACK a reasonable Nim interface and publish it on Nimble.

I will make appropriate benchmarks, but I expect the timing to be dependent on the underlying Fortran library. In fact, I do not see much varation, say, in calling ATLAS from either Python, Scala or C.

dex (orginal) [2015-06-06T17:09:09+02:00] view original

The Anaconda Python distribution does feature a commercial edition (free for academic use) with NumPy using multithreaded MKL as the backend. It is also possible (quite easy on Linux, very annoying on Windows) to manually link NumPy with OpenBLAS/MKL for considerably better performance in linear algebra. There isn't much point benchmarking thin wrappers to different libraries, but demonstrating that a reasonably easy to use Nim interface to MKL/OpenBLAS has negligible overhead would actually go very far in making Nim a viable language for scientific computing.

DSblizzard (orginal) [2015-07-30T09:58:46+02:00] view original

@andrea: Could you please show new results for Nim MKL single vs C MKL single? I plan to write program with math calculations and your results will allow to choose language. And it will not be only calls of functions from MKL, so choice of implementation language is important, Python will not suit.

andrea (orginal) [2015-07-30T14:29:26+02:00] view original

I do not them readily available right now, but they are essentially the same. The difference is by far the BLAS implementation.

If you want to try it yourself, you can use the wrapper I have written around BLAS (it is also going to support GPU too, see the cublas branch)

andrea (orginal) [2015-07-30T14:37:17+02:00] view original

Ok, I found the updated stats

Implementation    | 10 multiplications | 100 multiplications
                  |                    |
Numpy ATLAS       |        1350ms      |      13322ms
Breeze ATLAS      |        1359ms      |      13536ms
Breeze Netlib     |        4222ms      |      41128ms
Breeze Java (?)   |        6596ms      |      69687ms
Nim ATLAS         |        1434ms      |      14111ms
Nim MKL threaded  |         327ms      |       1811ms
Nim MKL single    |         694ms      |       6344ms
C MKL threaded    |         271ms      |       1827ms
C MKL single      |         794ms      |       6438ms

Keep in mind that this is just a very naif benchmark of a matrix multiplication 1000x987 by 987x876 10 or 100 times.

Mirror of forum.nim-lang.org

1268 :: Invoking BLAS functions (unexpected slowness)