nimforum mirror - A little guidance on threading needed.

xioren (orginal) [2024-01-19T08:08:40+01:00] view original

I am working on a project that spawns threads that operate on a global memory array (seq[seq[uint64]]) and am trying to wrap my head around the locks module and passing variables to a threaded funtion. If I pass a single variable it works, but fails with N>1 variables.

Slightly modifying the example from the docs:

(this doesn't work)

import std/locks

var
  L: Lock

var x = 0

proc threadFunc(a, b: int) {.thread.} =
  acquire(L) # lock stdout
  x.inc(a)
  echo b
  release(L)


proc doIt() =
  initLock(L)
  var thr: array[0..4, Thread[int]]
  for i in 0..high(thr):
    createThread(thr[i], threadFunc, (i, i*i))
  joinThreads(thr)
  
  deinitLock(L)

doIt()
echo x

I have tried variations of Thread[int] as well

Thread[x, y: int]

Thread[int, int]

So I am at an impasse as the function in my actual code takes a handful of varied arguments. Any pointers?

xioren (orginal) [2024-01-19T08:55:50+01:00] view original

And beyond that, while I can access the global var x in a modified example from the docs, in my code, that uses a global var, I get this error when trying to create a thread:


Error: 'processSegment' is not GC-safe as it accesses 'globalMem' which is a global using GC'ed memory

I tried changing it from a var to a ref object type but no luck.

PMunch (orginal) [2024-01-19T09:18:57+01:00] view original

To pass multiple arguments to a thread you need to wrap them in a tuple:

import std/locks

var
  L: Lock

var x = 0

proc threadFunc(a, b: int) {.thread.} =
  acquire(L) # lock stdout
  x.inc(a)
  echo b
  release(L)


proc doIt() =
  initLock(L)
  var thr: array[0..4, Thread[int]]
  for i in 0..high(thr):
    createThread(thr[i], threadFunc, (i, i*i))
  joinThreads(thr)
  
  deinitLock(L)

doIt()
echo x

That being said working on a global seq[seq[T]] sounds like it could be trouble. At least it used to be trouble pre-ARC, not sure if it is any longer.

Araq (orginal) [2024-01-19T10:18:23+01:00] view original

That should be:

import std/locks

var
  L: Lock

var x = 0

proc threadFunc(a: (int, int)) {.thread.} =
  acquire(L) # lock stdout
  x.inc(a[0])
  echo a[1]
  release(L)


proc doIt() =
  initLock(L)
  var thr: array[0..4, Thread[(int, int)]]
  for i in 0..high(thr):
    createThread(thr[i], threadFunc, (i, i*i))
  joinThreads(thr)
  
  deinitLock(L)

doIt()
echo x

But you should just use Weave or Malebolgia for threading, these also include examples how to process arrays in parallel etc.

mratsim (orginal) [2024-01-19T10:46:11+01:00] view original

You'll have to tell us the algorithm you're trying to implement or context because:

If your threads only rarely need concurrent access to the array, you'll be fine, but you might as well use std/atomics fetchAdd and skip locks.

If you have constant concurrent access to the array, the lock will be a bottleneck and your code might be slower than serial code. fetchAdd may help but I would expect it would be still slower due to each update invalidating the cache of the other cores leading to "cache thrashing"

If threads can access to distinct cells in the array, you can parallelize without locks or atomics and enjoy good speedup.

PMunch (orginal) [2024-01-19T13:47:58+01:00] view original

I see I managed to copy-paste the original snippet and not my edited version. This is what I meant to send:

import std/locks

var
  L: Lock

var x = 0

proc threadFunc(y: tuple[a, b: int]) {.thread.} =
  acquire(L) # lock stdout
  x.inc(y.a)
  echo y.b
  release(L)


proc doIt() =
  initLock(L)
  var thr: array[0..4, Thread[tuple[a, b: int]]]
  for i in 0..high(thr):
    createThread(thr[i], threadFunc, (i, i*i))
  joinThreads(thr)
  
  deinitLock(L)

doIt()
echo x

xioren (orginal) [2024-01-19T18:23:21+01:00] view original

Thank you for the responses.

@mratsim Argon2, 90% done implementing Argon2 just trying to get the threading working. And you are right, the threads access the "memory" in parallel but they are all assigned different regions which simplifies things. For better or worse I seem to be getting away with using a pointer to the memory so I guess I will stick with that.

Araq (orginal) [2024-01-19T21:17:34+01:00] view original

It also says:

{.deprecated: "use the nimble packages `malebolgia`, `taskpools` or `weave` instead".}

mratsim (orginal) [2024-01-19T21:39:43+01:00] view original

Looking at the algorithm on Wikipedia: https://en.wikipedia.org/wiki/Argon2

Function Argon2
   Inputs:
      password (P):       Bytes (0..232-1)    Password (or message) to be hashed
      salt (S):           Bytes (8..232-1)    Salt (16 bytes recommended for password hashing)
      parallelism (p):    Number (1..224-1)   Degree of parallelism (i.e. number of threads)
      tagLength (T):      Number (4..232-1)   Desired number of returned bytes
      memorySizeKB (m):   Number (8p..232-1)  Amount of memory (in kibibytes) to use
      iterations (t):     Number (1..232-1)   Number of iterations to perform
      version (v):        Number (0x13)       The current version is 0x13 (19 decimal)
      key (K):            Bytes (0..232-1)    Optional key (Errata: PDF says 0..32 bytes, RFC says 0..232 bytes)
      associatedData (X): Bytes (0..232-1)    Optional arbitrary extra data
      hashType (y):       Number (0=Argon2d, 1=Argon2i, 2=Argon2id)
   Output:
      tag:                Bytes (tagLength)   The resulting generated bytes, tagLength bytes long
   
   Generate initial 64-byte block H0.
    All the input parameters are concatenated and input as a source of additional entropy.
    Errata: RFC says H0 is 64-bits; PDF says H0 is 64-bytes.
    Errata: RFC says the Hash is H^, the PDF says it's ℋ (but doesn't document what ℋ is). It's actually Blake2b.
    Variable length items are prepended with their length as 32-bit little-endian integers.
   buffer ← parallelism ∥ tagLength ∥ memorySizeKB ∥ iterations ∥ version ∥ hashType
         ∥ Length(password)       ∥ Password
         ∥ Length(salt)           ∥ salt
         ∥ Length(key)            ∥ key
         ∥ Length(associatedData) ∥ associatedData
   H0 ← Blake2b(buffer, 64) //default hash size of Blake2b is 64-bytes
   
   Calculate number of 1 KB blocks by rounding down memorySizeKB to the nearest multiple of 4*parallelism kibibytes
   blockCount ← Floor(memorySizeKB, 4*parallelism)
   
   Allocate two-dimensional array of 1 KiB blocks (parallelism rows x columnCount columns)
   columnCount ← blockCount / parallelism;   //In the RFC, columnCount is referred to as q
   
   Compute the first and second block (i.e. column zero and one ) of each lane (i.e. row)
   for i ← 0 to parallelism-1 do for each row
      Bi[0] ← Hash(H0 ∥ 0 ∥ i, 1024) //Generate a 1024-byte digest
      Bi[1] ← Hash(H0 ∥ 1 ∥ i, 1024) //Generate a 1024-byte digest
   
   Compute remaining columns of each lane
   for i ← 0 to parallelism-1 do //for each row
      for j ← 2 to columnCount-1 do //for each subsequent column
         //i' and j' indexes depend if it's Argon2i, Argon2d, or Argon2id (See section 3.4)
         i′, j′ ← GetBlockIndexes(i, j)  //the GetBlockIndexes function is not defined
         Bi[j] = G(Bi[j-1], Bi′[j′]) //the G hash function is not defined
   
   Further passes when iterations > 1
   for nIteration ← 2 to iterations do
      for i ← 0 to parallelism-1 do for each row
        for j ← 0 to columnCount-1 do //for each subsequent column
           //i' and j' indexes depend if it's Argon2i, Argon2d, or Argon2id (See section 3.4)
           i′, j′ ← GetBlockIndexes(i, j)
           if j == 0 then
             Bi[0] = Bi[0] xor G(Bi[columnCount-1], Bi′[j′])
           else
             Bi[j] = Bi[j] xor G(Bi[j-1], Bi′[j′])
   
   Compute final block C as the XOR of the last column of each row
   C ← B0[columnCount-1]
   for i ← 1 to parallelism-1 do
      C ← C xor Bi[columnCount-1]
   
   Compute output tag
   return Hash(C, tagLength)

You need a threadpool that supports data parallelism / parallel for.

You can get away with OpenMP by using the || operator https://nim-lang.org/docs/system.html#%7C%7C.i%2CS%2CT%2Cstaticstring

Using https://github.com/mratsim/laser/blob/master/laser/openmp.nim for extra syntax sugar, you can use

for i in 0 || (omp_get_num_threads() - 1):
  myArray[i] = <...>

For OpenMP to work, you normally need to pass passc:-fopenmp and passl:-fopenmp either in the command line or via pragma. With the openmp.nim utility you just need to pass -d:openmp on the command-line.

(On a Mac, the default Clang deos not support OpenMP you have to install GCC or Clang from Homebrew).

xioren (orginal) [2024-01-20T03:42:40+01:00] view original

@mratsim Thanks for that in depth response. Will take be a bit to parse through it.

@Araq Yes I saw that but I am one of those people who hates having their code depend on someone elses code. As I am not writing commercial/production code, when at all possible I use only the standard library of a language. That is one (of many) reason(s) I like Nim, it has a strong and useful standard library.

Mirror of forum.nim-lang.org

10886 :: A little guidance on threading needed.