nimforum mirror - Argon2 in Pure Nim.

xioren (orginal) [2024-01-22T22:00:06+01:00] view original

Just wrapped up implementing Argon2 in Nim. This was mostly as a learning experience and this is not production ready code. It lacks optimizations and security considerations. As far as I can tell this is the only Nim implementation. Hopefully someone smarter and more proficient at Nim than me can use it as a basis for implementing a production ready version. Comments and suggestions welcome.

https://github.com/xioren/argon2-nim

arnetheduck (orginal) [2024-01-23T08:47:11+01:00] view original

Here's one with you can compare with: https://github.com/cheatfate/nimcrypto/pull/74

mratsim (orginal) [2024-01-23T09:58:23+01:00] view original

Quick comments after skimming.

First of all, good job debugging cryptography is quite annoying.

You will likely want to switch to a threadpool, you are creating way too many threads here, it's slow and will oversubscribed your system: https://github.com/xioren/argon2-nim/blob/e141a020048e27540ca3dc638c4bcb11c8e9822f/argon2.nim#L427-L438

proc processBlocks(ctx: Argon2Ctx) =
  ## spawn threads for parallel computation across different lanes and slices
  let lanes = ctx.params.memoryCost div ctx.params.parallelism
  let segments = lanes div syncPoints
  
  var threads = newSeq[Thread[tuple[memPtr: ptr MemoryArray, n, slice, lane, lanes, segments, threads, memory, time: uint32, mode: Mode]]](ctx.params.parallelism)
  
  for n in 0 ..< ctx.params.timeCost:
    for slice in 0 ..< syncPoints:
      for lane in 0 ..< ctx.params.parallelism:
        createThread(threads[lane], processSegment, (ctx.memory, n, slice, lane, lanes, segments, ctx.params.parallelism, ctx.params.memoryCost, ctx.params.timeCost, ctx.params.mode))
      joinThreads(threads)

Replace with

{.passc: "-fopenmp".}
{.passl: "-fopenmp".}
proc processBlocks(ctx: Argon2Ctx) =
  ## spawn threads for parallel computation across different lanes and slices
  let lanes = ctx.params.memoryCost div ctx.params.parallelism
  let segments = lanes div syncPoints
  
  var threads = newSeq[Thread[tuple[memPtr: ptr MemoryArray, n, slice, lane, lanes, segments, threads, memory, time: uint32, mode: Mode]]](ctx.params.parallelism)
  
  for n in 0 ..< ctx.params.timeCost:
    for slice in 0 ..< syncPoints:
      for lane in 0 || ctx.params.parallelism - 1:
        createThread(threads[lane], processSegment, (ctx.memory, n, slice, lane, lanes, segments, ctx.params.parallelism, ctx.params.memoryCost, ctx.params.timeCost, ctx.params.mode))
      joinThreads(threads)

for easy OpenMP

Your inputs should not be string but proc foo[T: char|byte](ctx: var Context, msg: openArray[T]): bool. This will be allow using strings of seq[byte] seamlessly as inputs.

xioren (orginal) [2024-01-23T18:48:47+01:00] view original

@arnetheduck Ah I should have know cheatfate would have already implemented this. Thanks.

@mratsim Thanks for the input. I started with threadpool but moved away as its listed as deprecated. My logic here is basically strait from Google's Go implementation:


for n := uint32(0); n < time; n++ {
                for slice := uint32(0); slice < syncPoints; slice++ {
                        var wg sync.WaitGroup
                        for lane := uint32(0); lane < threads; lane++ {
                                wg.Add(1)
                                go processSegment(n, slice, lane, &wg)
                        }
                        wg.Wait()
                }
        }

I don't know Go but are you saying the spawning of a thread here works Differently in Nim ? If not then I don't see how I could end up with more or less threads than the Go implementation and having the correct number of threads (obviously) is crucial to the algorithm. Granted this is my first working with threading so it's all new to me.

Araq (orginal) [2024-01-23T20:29:44+01:00] view original

Yes, it works very differently. Golang's go keyword is roughly Nim's spawn as provided by a thread pool library (Weave, Malebolgia) and not createThread.

xioren (orginal) [2024-01-24T20:56:26+01:00] view original

@araq So if the behavior of createThread is not 1:1 analogous to Go's go routine, that doesn't bother me so long as it doesn't impact the implementation of the algorithm. Does the difference in behavior mean I am ending up with a different number of threads (as mratsim suggested) given the analogous code?

Araq (orginal) [2024-01-24T21:28:29+01:00] view original

Does the difference in behavior mean I am ending up with a different number of threads (as mratsim suggested) given the analogous code?

Yes, it does mean that.

xioren (orginal) [2024-01-25T17:43:01+01:00] view original

Sorry to belabor the point but getting this right is important to me and I am unable to verify what you guys are saying. I really can't see how using createThread is giving me the WRONG number of threads. From the documentation:


import std/locks

var
  thr: array[0..4, Thread[tuple[a,b: int]]]
  L: Lock

proc threadFunc(interval: tuple[a,b: int]) {.thread.} =
  for i in interval.a..interval.b:
    acquire(L) # lock stdout
    echo i
    release(L)

initLock(L)

for i in 0..high(thr):
  createThread(thr[i], threadFunc, (i*10, i*10+5))
joinThreads(thr)

deinitLock(L)

The code explicitly creates 5 threads. Running the code results in 5 threads being created. So this seems to be working exactly as expected?

mratsim (orginal) [2024-01-25T18:15:54+01:00] view original

You're creating timeCost *syncPoint*parallelism threads. Creating a thread is very costly and need to be amortized, a threadpool reuses them. You can do at least 50000 xor-shift-rotate-add while waiting for the OS to create a _single thread.

Just found a bench from Lemire: https://lemire.me/blog/2020/01/30/cost-of-a-thread-in-c-under-linux/

On Linux x86, 9000ns to create a thread.

A 1GHz PC has a cycle every 1ns. A modern Intel or AMD CPU can process 4 xor/shift/add per cycle, see https://uops.info/table.html and https://www.agner.org/optimize/instruction_tables.pdf. In reality the CPUs today are more like 3GHz, conservative, so 12 xor/shift/add per nanosecond.

Hence over 9000ns, they can issue 108k basic instructions.

xioren (orginal) [2024-01-25T21:23:01+01:00] view original

Okay that makes sense thank you. So it's a question of efficiency rather than correctness. I will look into threadpool.

xTrayambak (orginal) [2024-01-28T16:16:09+01:00] view original

Awesome! I've been trying to come up with my own Argon implementation, but I've never been able to finish it off. I'd like to ask though, why not make this a proper Nimble package instead of just Nim files scattered in a repository?

xioren (orginal) [2024-01-29T08:21:18+01:00] view original

This is a project I did mostly for fun and to be able to implement Argon2 into some other projects I have, like a file encryption script. If I made a Nimble package people might think this is production ready which it is not. I lack to knowledge of Nim to truly optimize the code and I lack the experience in cryptography to be confident all appropriate security measures are in place. As mentioned already nimcrypto is a Nimble package and provides Argon2.

Mirror of forum.nim-lang.org

10900 :: Argon2 in Pure Nim.