nimforum mirror - Parallel coding in Nim (as compared to OpenMP/MPI)

Rooibos (orginal) [2020-06-17T19:42:21+02:00] view original

I've recently started to learn Nim (and nimpy also for Python interoperability) and the experience up to now is very good, including fast compilation. I've also learning Arraymancer, which also seems very nice for linear algebra. My interest is basically numerical calculations, so I've been reading various articles that compare various aspects of different languages for that purpose. For example, one of the articles that I found recently is:

Benchmarking Kernel Matrix Calculations in D, Julia, and Chapel https://github.com/dataPulverizer/KernelMatrixBenchmark

Here the author compares D, Julia, and Chapel for some calculations and summarizes his/her current impressions (based on coding experiences and performance). Though I expect Nim would be in a good position for such benchmarking, I am wondering the difficulty of parallel coding in Nim. For example, in the above benchmark programs,

D uses a syntax like "foreach( ... ; ....parallel(...) )", https://github.com/dataPulverizer/KernelMatrixBenchmark/blob/master/d/kernel.d#L219

Julia uses "threads", https://github.com/dataPulverizer/KernelMatrixBenchmark/blob/master/julia/KernelMatrix.jl#L129

and Chapel uses "forall" statements https://github.com/dataPulverizer/KernelMatrixBenchmark/blob/master/chapel/kernel.chpl#L139

And looking at recent threads (e.g., https://forum.nim-lang.org/t/6434 and https://forum.nim-lang.org/t/6426), I am wondering to what degree it is difficult (or easy) to learn parallel coding in Nim, particularly as compared to OpenMP/MPI (which I have basic knowledge and experience). Do you think it might be rather hard to learn parallelism in Nim at the moment, e.g., by requiring more experiences and understanding of technical aspects (like "channels")?

I am sorry for a very vague question but I appreciate it if you share any ideas/experiences. Thanks very much :)

mratsim (orginal) [2020-06-17T23:20:35+02:00] view original

Nice username, seems like you love tea ;).

Since you're interested in scientific computing, let's pick some examples from there, I will use matrix transposition.

You can use OpenMP in Nim, the openMP operator is ||, it's a bit rough because in the OpenMP section you cannot use seq/strings without precautions and you need to disable stacktraces so that the compiler doesn't use them as well:

proc ompNaiveTranspose(M, N: int, bufIn, bufOut: ptr UncheckedArray[float32]) =
  ## Transpose a MxN matrix into a NxM matrix
  
  # Write are more expensive than read so we keep i accesses linear for writes
  {.push stacktrace:off.}
  for j in 0||(N-1):
    for i in 0 ..< M:
      bufOut[j*M+i] = bufIn[i*N+j]
  {.pop.}

You can also emit raw C code as a fallback:

proc ompCollapsedTranspose(M, N: int, bufIn, bufOut: ptr UncheckedArray[float32]) =
  ## Transpose a MxN matrix into a NxM matrix
  
  # We need to go down to C level for the collapsed clause
  # This relies on M, N, bfIn, bufOut symbols being the same in C and Nim
  # The proper interpolation syntax is a bit busy otherwise
  {.emit: """
  
  #pragma omp parallel for collapse(2)
  for (int i = 0; i < `M`; ++i)
    for (int j = 0; j < `N`; ++j)
      `bufOut`[j*M+i] = `bufIn`[i*N+j];
  """.}

Otherwise for a pure Nim solution, I recommend that you use Weave. Disclaimer: I wrote it so I'm obviously biaised.

Matrix transpose:

proc weaveNestedTranspose(M, N: int, bufIn, bufOut: ptr UncheckedArray[float32]) =
  ## Transpose a MxN matrix into a NxM matrix with nested for loops
  
  parallelFor j in 0 ..< N:
    captures: {M, N, bufIn, bufOut}
    parallelFor i in 0 ..< M:
      captures: {j, M, N, bufIn, bufOut}
      bufOut[j*M+i] = bufIn[i*N+j]

Note that contrary to OpenMP Weave can deal perfectly with nested loops which is quite useful for say batching already parallel matrix multiplications.

Also the load balancer is very different from all implementations of OpenMP, TBB or runtime I'm aware of and it does not suffer from OpenMP and TBB grain size woes:

Assume you need to work on a matrix of 2000 elements, some operations like additions or copies should not be parallelized on most CPU because the bottleneck is memory not compute and throwing more threads when memory is a bottleneck will just give you all the disadvantages of multithreading without the advantages.

Now assume instead you need to compute cos/sin/exponentiation, there the work is significant so multithreading is very valuable.

But the runtime woes is if you write a generic parallel "apply kernel" function that would apply, copy/addition/cos/exp)

proc apply[T](dst: var Matrix[T], src: Matrix[T], kernel: proc(x: T): T) =
  # pseudo-code
  myParallelFor i in 0 ..< src.rows:
    myParallelFor j in 0 ..< src.cols:
      kernel(dst, src)

Most runtime split work before entering the loops but they have no way to understand if the kernel is compute heavy or trivial resulting in many worse case behaviors. Their workaround, if available is to supply a "grain size" but a library author would have trouble explaining and exposing all that to users (i.e. use 1000 for addition on a Xeon platinum, use 200 on a laptop, but also make sure that there are no other applications like a web browser running at the same time ...).

This was investigated in-depth for PyTorch here: https://github.com/zy97140/omp-benchmark-for-pytorch and Weave is the only runtime that addresses this in a clean and always efficient way for library authors and end-users.

You have plenty of examples on how to use Weave (and Nim OpenMP) in the benchmarks folder: https://github.com/mratsim/weave/tree/9f0c384f/benchmarks

And the raytracing demo: https://github.com/mratsim/weave/blob/9f0c384f/demos/raytracing/smallpt.nim#L228

The benchmarks folder includes a matrix multiplication algorithm written from scratch in Nim, it's multithreaded performance is higher than OpenBLAS on my machine (about 2.8 TFlops vs 2.7 TFlops) even though the single-threaded kernel is lower performance (160GFlops vs 190GFlops) meaning I achieve a 17.5x speedup on my 18 cores machine while OpenBLAS/OpenMP is in the 14.2x. And this is using large matrices (1920x1920) that are easy to distribute. Matrices like (200x1920) would be hard to split with OpenMP as it doesn't support nested parallelism.

Rooibos (orginal) [2020-06-18T18:43:00+02:00] view original

Thanks very much for detailed info and various links! I did not even know that OpenMP is supported via || (though with some restriction), so I will play around with it and also learn Weave (which seems more flexible and efficient).

And, I have several more questions...

Based on the above info, there are (roughly speaking) three different parallel approaches in Nim; is this understanding correct...?

OpenMP (via ||)

Weave

Threads, channels, etc, as described in the Nim manual and "Nim in Action", Chap.6 (and also experimental parallel statement: https://nim-lang.org/docs/manual_experimental.html#parallel-amp-spawn-parallel-statement)

2) In both OpenMP and Weave, is it essential to first cast seq to UncheckedArray and use the latter in a parallel region (e.g., inside for-loop with ||, or a block within init(Weave) ... exit(Weave))? In that case, is it also possible to cast Tensor (in Arraymancer) to UncheckedArray by getting a raw (unsafe) data pointer somehow?

3) In the example code of Weave for matrix transpose, captures are used for some variables:

parallelFor j in 0 ..< N:
    captures: {M, N, bufIn, bufOut}
    parallelFor i in 0 ..< M:
      captures: {j, M, N, bufIn, bufOut}

Here, is the meaning of captures similar to shared in OpenMP (roughly speaking)...?

I have installed Weave-0.4.0 (+ Nim-1.2.2) and tried the matrix transpose code shown in the Github page. Here, I also added the below code at the end of main() to print some array elements:

# In proc main():
  ...
  init(Weave)
  transpose(M, N, bufIn, bufOut)
  exit(Weave)
  
  # Show some elements.
  echo("input  [2 * N + 5] = ", input[2 * N + 5], " (= ", bufIn[2 * N + 5], ")")
  echo("input  [5 * N + 2] = ", input[5 * N + 2], " (= ", bufIn[5 * N + 2], ")")
  echo()
  echo("output [2 * M + 5] = ", output[2 * M + 5], " (= ", bufOut[2 * M + 5], ")")
  echo("output [5 * M + 2] = ", output[5 * M + 2], " (= ", bufOut[5 * M + 2], ")")

Compiling as nim c --threads:on test.nim gives the expected result:

input  [2 * N + 5] = 4005.0 (= 4005.0)
input  [5 * N + 2] = 10002.0 (= 10002.0)

output [2 * M + 5] = 10002.0 (= 10002.0)
output [5 * M + 2] = 4005.0 (= 4005.0)

On the other hand, if I moved exit(Weave) after all the above echo statements, the result changes to

input  [2 * N + 5] = 4005.0 (= 4005.0)
input  [5 * N + 2] = 10002.0 (= 10002.0)

output [2 * M + 5] = 0.0 (= 0.0)
output [5 * M + 2] = 0.0 (= 0.0)

Does this mean that exit(Weave) has the role of some "synchronization"(?) for parallel calculations, and so mandatory before accessing any UncheckedArray used in the parallelFor regions?

Again, in the matrix transpose code above, the input (of type seq) is cast to bufIn (of type UncheckedArray) by using the address obtained from input[0].unsafeAddr. On the other hand, .addr is used to cast output to bufOut. Is this difference important, or is it actually OK whichever of .unsafeAddr or .addr is used?

let input = newSeq[float32](M * N)
let bufIn = cast[ptr UncheckedArray[float32]]( input[0].unsafeAddr )
...
var output = newSeq[float32](N * M)
let bufOut = cast[ptr UncheckedArray[float32]]( output[0].addr )

I am sorry again for many questions! These are not urgent at all (I'm still learning more basic syntax...), but I would appreciate any hints and inputs again. Thanks very much :)

PS. "seems like you love tea ;)" Yes, I recently like to drink Rooibos tea (particularly at night), though I like coffee in the morning :)

jasonfi (orginal) [2020-06-18T19:41:49+02:00] view original

A link to an article I wrote: https://onlinetechinfo.com/concurrency-and-parallelism-in-nim (with lots of links).

hugogranstrom (orginal) [2020-06-18T22:06:51+02:00] view original

I can try to answer 5), that's really the only one I'm qualified to answer XD I'm also learning Weave now and it's a really nice experience and your other questions intrigues me as well.

The difference between addr and unsafeAddr is that addr only works on mutable things like var output for example. But it doesn't work on immutable things like let input. I guess this is to make sure that we consciously have to make the decision to use the unsafe variant if we want to change something that we have declared to be unchangeable. Hope this helped clear it up a bit :)

mratsim (orginal) [2020-06-19T11:13:00+02:00] view original

I'll have a thorough reply this evening

Rooibos (orginal) [2020-06-19T16:21:14+02:00] view original

@jasonfi, Thanks much for pointing to your article with various links! Now I'm reading chap.6 of "Nim in Action", which I found very readable and useful to get the key concepts of parallelism used...

@hugogranstrom, Thanks much for your comment, and I didn't know the difference between addr and unsafeAddr that you mentioned (and I didn't even notice the difference of let and var in the sample code... XD) I will check the manual more closely.

@mratsim, Thanks very much again, but please do take your time (no rush at all!) because I still need to read the sample directories (of Weave). Also, there seems a video conference (https://conf.nim-lang.org/) very soon (June 20), so I believe many people should be busy with it... (I will watch the "Multithreaded programs" section :)

mratsim (orginal) [2020-06-27T01:30:56+02:00] view original

Based on the above info, my understanding is that there are (roughly speaking) three different parallel approaches in Nim; is this correct?

OpenMP (via ||) Weave Threads, channels, etc, as described in the Nim manual and "Nim in Action", Chap.6 (and also experimental parallel statement: https://nim-lang.org/docs/manual_experimental.html#parallel-amp-spawn-parallel-statement)

Yes, you have OpenMP and raw threads (via createThread/joinThread), Nim theadpool, @yglukhov threadpool and Weave are build on top of the raw threads.

The difference between a threadpool and Weave (a multithreading runtime) is that a runtime is a threadpool + a scheduler to deal with load imbalance.

In both OpenMP and Weave, is it essential to first cast seq to UncheckedArray and use the latter in a parallel region (e.g., inside for-loop with ||, or a block within init(Weave) ... exit(Weave))? In that case, if we want to use Tensor (in Arraymancer) in parallel regions, is it necessary to first cast it to UncheckedArray somehow by getting a raw (unsafe) data pointer?

No. Internally tensors already use OpenMP (at the moment if compiled with -d:openmp) and Weave (in the future). At the moment you can't parallelize on top of tensors as OpenMP doesn't really support nesting.

3) In the example code of Weave for matrix transpose, captures are used for some variables:
  parallelFor j in 0 ..< N:
      captures: {M, N, bufIn, bufOut}
      parallelFor i in 0 ..< M:
        captures: {j, M, N, bufIn, bufOut}
  
Here, is the meaning of captures similar to shared in OpenMP (roughly speaking)...?

No, the captures are copied (but bufIn/bufOut are pointers to memory location so multiple threads access the same memory each through their own copy of the pointer. This requires the algorithm to not have data races or to use synchronization

I have installed Weave-0.4.0 (+ Nim-1.2.2) and tried the matrix transpose code shown in the Github page. Here, I also added the below code at the end of main() to print some array elements:
  # In proc main():
    ...
    init(Weave)
    transpose(M, N, bufIn, bufOut)
    exit(Weave)
    
    # Show some elements.
    echo("input  [2 * N + 5] = ", input[2 * N + 5], " (= ", bufIn[2 * N + 5], ")")
    echo("input  [5 * N + 2] = ", input[5 * N + 2], " (= ", bufIn[5 * N + 2], ")")
    echo()
    echo("output [2 * M + 5] = ", output[2 * M + 5], " (= ", bufOut[2 * M + 5], ")")
    echo("output [5 * M + 2] = ", output[5 * M + 2], " (= ", bufOut[5 * M + 2], ")")
  
Compiling as nim c --threads:on test.nim gives the expected result:
   input  [2 * N + 5] = 4005.0 (= 4005.0)
   input  [5 * N + 2] = 10002.0 (= 10002.0)
   
   output [2 * M + 5] = 10002.0 (= 10002.0)
   output [5 * M + 2] = 4005.0 (= 4005.0)
  
On the other hand, if I moved exit(Weave) after all the above echo statements, the result changes to
   input  [2 * N + 5] = 4005.0 (= 4005.0)
   input  [5 * N + 2] = 10002.0 (= 10002.0)
   
   output [2 * M + 5] = 0.0 (= 0.0)
   output [5 * M + 2] = 0.0 (= 0.0)
  
Does this mean that exit(Weave) has the role of some "synchronization"(?) for parallel calculations, and so mandatory before accessing any UncheckedArray used in the parallelFor regions?

It's not intended as such.

The synchronization that Weave supports are:

syncRoot to empty the tasks in the runtime, only callable from the root thread (i.e. that called init(Weave)) when you call exit(Weave) a syncRoot(Weave) is done internally to make sure that there is no task left in the runtime and only then the runtime is shut down, hence the synchronization you noticed.

sync(x) with x a handle returned from:
- spawn
- awaitable for loops
- for loop with reduction

spawnOnEvent(event, foo) and spawnOnEvents(event1, event2, event3, foo) with event a handle created by newFlowEvent and foo a function that will be scheduled only when trigger is called on the event. This allows modeling task dependencies beyond the caller/callee relationships

for loop dependent triggered on an event

for loop with individual iteration dependent on individual iteration from another for loop: This allow fine-grain modeling of a first for loop A that process data, for example an image, and when a pixel or an iteration A[i] is ready, directly schedule the corresponding transformation B[i] in the next loop B, even if A[i-1] or A[i+1] are not ready. This is in particular quite useful in iterative algorithm like this one:
where after p[i][j] is done you can do iteration t+1 without waiting for p[i+2][j+2]
1. Again, in the matrix transpose code above, the input (of type seq) is cast to bufIn (of type UncheckedArray) by using the address obtained from input[0].unsafeAddr. On the other hand, .addr is used to cast output to bufOut. Is this difference important, or is it actually OK whichever of .unsafeAddr or .addr is used?
```
    let input = newSeq[float32](M * N)
    let bufIn = cast[ptr UncheckedArray[float32]]( input[0].unsafeAddr )
    ...
    var output = newSeq[float32](N * M)
    let bufOut = cast[ptr UncheckedArray[float32]]( output[0].addr )
    
```
I am sorry again for many questions! These are not urgent at all (I'm still learning more basic stuff & syntax right now), but I would appreciate any hints and inputs again. Thanks very much :)

PS. "seems like you love tea ;)" Yes, I recently like to drink Rooibos tea (particularly at night), though I like coffee in the morning :)

unsafeAddr is when the parameter you take the address of is immutable when you use let or a function signature you promise that it's not mutable and the compiler enforces that, but once you have a pointer, the compiler cannot help you anymore hence you need to "opt-in" to this unsafe mode.

Mirror of forum.nim-lang.org

6443 :: Parallel coding in Nim (as compared to OpenMP/MPI)