Just wrapped up implementing Argon2 in Nim. This was mostly as a learning experience and this is not production ready code. It lacks optimizations and security considerations. As far as I can tell this is the only Nim implementation. Hopefully someone smarter and more proficient at Nim than me can use it as a basis for implementing a production ready version. Comments and suggestions welcome.
proc processBlocks(ctx: Argon2Ctx) =
## spawn threads for parallel computation across different lanes and slices
let lanes = ctx.params.memoryCost div ctx.params.parallelism
let segments = lanes div syncPoints
var threads = newSeq[Thread[tuple[memPtr: ptr MemoryArray, n, slice, lane, lanes, segments, threads, memory, time: uint32, mode: Mode]]](ctx.params.parallelism)
for n in 0 ..< ctx.params.timeCost:
for slice in 0 ..< syncPoints:
for lane in 0 ..< ctx.params.parallelism:
createThread(threads[lane], processSegment, (ctx.memory, n, slice, lane, lanes, segments, ctx.params.parallelism, ctx.params.memoryCost, ctx.params.timeCost, ctx.params.mode))
joinThreads(threads)
Replace with
{.passc: "-fopenmp".}
{.passl: "-fopenmp".}
proc processBlocks(ctx: Argon2Ctx) =
## spawn threads for parallel computation across different lanes and slices
let lanes = ctx.params.memoryCost div ctx.params.parallelism
let segments = lanes div syncPoints
var threads = newSeq[Thread[tuple[memPtr: ptr MemoryArray, n, slice, lane, lanes, segments, threads, memory, time: uint32, mode: Mode]]](ctx.params.parallelism)
for n in 0 ..< ctx.params.timeCost:
for slice in 0 ..< syncPoints:
for lane in 0 || ctx.params.parallelism - 1:
createThread(threads[lane], processSegment, (ctx.memory, n, slice, lane, lanes, segments, ctx.params.parallelism, ctx.params.memoryCost, ctx.params.timeCost, ctx.params.mode))
joinThreads(threads)
for easy OpenMP
@arnetheduck Ah I should have know cheatfate would have already implemented this. Thanks.
@mratsim Thanks for the input. I started with threadpool but moved away as its listed as deprecated. My logic here is basically strait from Google's Go implementation:
for n := uint32(0); n < time; n++ {
for slice := uint32(0); slice < syncPoints; slice++ {
var wg sync.WaitGroup
for lane := uint32(0); lane < threads; lane++ {
wg.Add(1)
go processSegment(n, slice, lane, &wg)
}
wg.Wait()
}
}
I don't know Go but are you saying the spawning of a thread here works Differently in Nim ? If not then I don't see how I could end up with more or less threads than the Go implementation and having the correct number of threads (obviously) is crucial to the algorithm. Granted this is my first working with threading so it's all new to me.
Does the difference in behavior mean I am ending up with a different number of threads (as mratsim suggested) given the analogous code?
Yes, it does mean that.
Sorry to belabor the point but getting this right is important to me and I am unable to verify what you guys are saying. I really can't see how using createThread is giving me the WRONG number of threads. From the documentation:
import std/locks
var
thr: array[0..4, Thread[tuple[a,b: int]]]
L: Lock
proc threadFunc(interval: tuple[a,b: int]) {.thread.} =
for i in interval.a..interval.b:
acquire(L) # lock stdout
echo i
release(L)
initLock(L)
for i in 0..high(thr):
createThread(thr[i], threadFunc, (i*10, i*10+5))
joinThreads(thr)
deinitLock(L)
The code explicitly creates 5 threads. Running the code results in 5 threads being created. So this seems to be working exactly as expected?
Just found a bench from Lemire: https://lemire.me/blog/2020/01/30/cost-of-a-thread-in-c-under-linux/
On Linux x86, 9000ns to create a thread.
A 1GHz PC has a cycle every 1ns. A modern Intel or AMD CPU can process 4 xor/shift/add per cycle, see https://uops.info/table.html and https://www.agner.org/optimize/instruction_tables.pdf. In reality the CPUs today are more like 3GHz, conservative, so 12 xor/shift/add per nanosecond.
Hence over 9000ns, they can issue 108k basic instructions.