Hey all, today I wanted to do a quick post about three repos I've recently published and explain why you may find them useful.
JWTea
https://github.com/guzba/jwtea
This repo creates JSON Web Tokens and does the RSA signing in Nim (no OpenSSL or other external deps).
If you don't need JWT, you don't need this repo. If you do, such as for Google APIs, this repo is wonderful.
The RSA private key decoding and signing are done by Crunchy. More about Crunchy below. Doing all RSA in Nim means one less reason to even think about OpenSSL or other things. OpenSSL is pain. Others have gone to great effort to use BearSSL, perhaps to avoid OpenSSL pain. That is very impressive work. For me though, just using Nim is a more in line with my goals.
Crunchy
https://github.com/guzba/crunchy
Crunchy has SIMD / intrinsic accelerated procs for CRC-32, Adler-32, SHA-256 etc with more to be added over time. The goal is simple-signature one-line calls, eg crc32(s: string): uint32.
Crunchy is also where the RSA code used by JWTea and other experimental cryptography code of mine lives, at least for now.
Getting RSA signing working introduced me to nim-lang/bigints. I was able to get RSA powmod working right away with that, which was great, however the performance was not ok for RSA out of the box (5+ seconds for one RSA 2048 signature with -d:release). I have a little fork of nim-lang/bigints for now and was able to make powmod 10x faster (0.5s) but there is still more work to do here. It should take less than 0.05s.
Depot
https://github.com/guzba/depot
Depot is a lightweight lib for working with S3-compatible storage provider APIs including Amazon S3, Google Cloud Storage, Cloudflare R2 and Backblaze B2. The fact all these providers have compatible APIs is a wonderful accident of history.
The S3-compatible API these providers give you is just an HTTP API underneath the common wrappers like boto3. Depot provides the URL signing which is all that is needed to do any operation.
In the future, a full Nim wrapper around common operations will be nice but that has not been done. I have an allergic reaction to auto-generated wrappers so it won't be that.
Why should you care about object stores like S3 etc? In brief: behind to your VM provider, these object stores are the next most useful provider when building a web service.
Thanks for giving this a read!
For modular exponentiation: https://forum.nim-lang.org/t/7276#46102, the technique is to convert the BigInt to "Montgomery form" (see: https://eprint.iacr.org/2017/1057).
In short that allows you to do modular arithmetic, including modular multiplications, without modulo. Modulo/division takes about 55 cycles while an addition takes 1 (and you can schedule 2 to 4 per cycle). So this would naively makes your code 55x faster.
The conversion requires the modulus to be odd which is the case for all primes of interest (RSA modulo 2 is not very interesting).
My whole code for modular exponentiations is here: https://github.com/mratsim/constantine/blob/2931913/constantine/math/arithmetic/limbs_montgomery.nim#L726-L835. The file has few dependencies (Limbs[N] = array[N, Word]) and some architecture/compile agnostic add-with-carry multiply-and-add.
For SHA256, you might be interested in a SIMD implementation for non-sha enabled CPUs: https://github.com/mratsim/constantine/blob/2931913/constantine/hashes/sha256/sha256_x86_ssse3.nim
Thanks for the info about Montgomery form. I did see mention of it when doing some RSA reading but nothing more than that so far. I'll give this a look.
I see mention in your linked comment that the modulus is required to be constant / static by Constantine. Is that still true? I can find out for myself but there is no harm in asking. If it is required to be constant, is that critical or just how it is now?
I was also planning to look into https://en.wikipedia.org/wiki/RSA_(cryptosystem)#Using_the_Chinese_remainder_algorithm for signing. When contrasting CRT with Montgomery, are they both viable ways to powmod faster or is CRT the wrong area to investigate?
I have been doing a bit of cryptography stuff lately (did RC6-CBC and AES-256-GCM recently) and it has become clear there is an enormous amount I do not know. Go figure. It also kind of doesn't play to my strengths either, this work has been a bit like rubbing my brain against sandpaper. I got through it though!
I see mention in your linked comment that the modulus is required to be constant / static by Constantine. Is that still true? I can find out for myself but there is no harm in asking. If it is required to be constant, is that critical or just how it is now?
AFAIK in my link there https://github.com/mratsim/constantine/blob/2931913/constantine/math/arithmetic/limbs_montgomery.nim#L726-L835 the only thing static is spareBits. For some elliptic curves, for example Curve25519, you use 255-bit but you store it in 256-bit (4x64-bit words). Having spare bits in the physical storage enables carry-less optimizations similar to https://cryptojedi.org/peter/data/croatia-20150604.pdf / https://cryptojedi.org/peter/data/pairing-20131122.pdf
I have been doing a bit of cryptography stuff lately (did RC6-CBC and AES-256-GCM recently) and it has become clear there is an enormous amount I do not know. Go figure. It also kind of doesn't play to my strengths either, this work has been a bit like rubbing my brain against sandpaper. I got through it though!
For production-grade cryptography, it is very important that your code has no branches that depend on secret data, besides the size of it.
For example, this is absolutely forbidden: https://github.com/guzba/crunchy/blob/a282a89/src/crunchy/bigints.nim#L1353-L1354 That if depends on whether the secret key bits is 0 or 1 and can be retrieved by analysing
Similarly, the whole bigint backend is unsuitable for implementing cryptography because it has "if branches" everywhere that may expose secret data: https://github.com/guzba/crunchy/blob/a282a89/src/crunchy/bigints.nim#L117-L138
Furthermore, exposing one bit of secret data is equivalent to exposing everything, see "padding oracle" attacks: https://joyofcryptography.com/pdf/book.pdf
And attacks get more and more powerful with machine learning: https://eprint.iacr.org/2019/358
One trace is all it takes: Machine Learning-based Side-channel Attack on EdDSA
Profiling attacks, especially those based on machine learning proved as very successful techniques in recent years when considering side-channel analysis of block ciphers implementations. At the same time, the results for implementations public-key cryptosystems are very sparse. In this paper, we consider several machine learning techniques in order to mount a power analysis attack on EdDSA using the curve Curve25519 as implemented in WolfSSL. The results show all considered techniques to be viable and powerful options. The results with convolutional neural networks (CNNs) are especially impressive as we are able to break the implementation with only a single measurement in the attack phase while requiring less than 500 measurements in the training phase. Interestingly, that same convolutional neural network was recently shown to perform extremely well for attacking the AES cipher. Our results show that some common grounds can be established when using deep learning for profiling attacks on distinct cryptographic algorithms and their corresponding implementations.
See also: https://github.com/mratsim/constantine/wiki/Constant-time-arithmetics
Besides, secure cryptography often requires assembly, because even trying to be branchless with bit manipulation, for example to do a select let a = if a < 0: x else: y the compiler might rewrite it into a if branch instead of conditional move: https://www.cl.cam.ac.uk/~rja14/Papers/whatyouc.pdf
Lastly, for example your AES implementation has cache-timing issues due to indexing arrays depending on secret data: https://github.com/guzba/crunchy/blob/a282a89d/src/crunchy/aes256.nim#L76 Unfortunately this is one big flaw of AES (see https://www.bearssl.org/constanttime.html#aes), it is very hard to implement it in constant-time without hardware support, though attackers need to run some program in a VM colocated with the CPU that runs the AES encrypt/decrypt.
Re BigInt design, for RSA, this is a portable constant-time bigint design: https://www.bearssl.org/bigint.html#big-integer-design Constantine uses a different one, because I use intrinsics or assembly for carries in particular to guarantee the absence of branches.
It can be made faster using Karatsuba. Multiplication complexity is O(n²) with n the number of words in the bigint, RSA2048 requires 32 words hence a complexity of "1024". Karatsuba is ~O(n^1.58) hence about 234. So we can naively estimate a 4x increase (actually probably 2x because you trade multiplications for additions)
Regarding the CRT / Chinese Reminder Theorem, yes it will accelerate signing because you split the number into half and compute those costly O(n²) multiplications on something that is twice smaller leading to a 4x speedup as well, but you do those 4x faster compute on each half so total speedup is 2x.
Yeah I definitely don't encourage use of the RC6 or AES code in any real setting, I just wanted to learn the general mechanics being used and RC6 was a simpler place to start before moving to AES. I would use the x64 / arm64 intrinsics for "real" AES if I ever have that need.
My true goals for cryptography work are two-fold: