nimforum mirror - Procedural X86 inline assembler for Nim

mratsim (orginal) [2020-07-25T10:37:34+02:00] view original

Couldn't render post #40833.

Araq (orginal) [2020-07-25T11:20:25+02:00] view original

To protect cryptography secrets you need to ensure that the code generate will not use secret dependent branches as by timing the procedure we can recover part of the properties of the secret (or in the worst case each individual bits that compose it).

Yes! So true. Programming languages were not designed with "constant time" algorithms in mind so you need to use assembler. (I'm serious!)

However, I hope we'll also get ARM support. :-)

federico3 (orginal) [2020-07-25T12:36:59+02:00] view original

A little reminder for other readers: timing attacks pop up often when people implement their own cryptography or other security functions that require string equality checks.

Unless you have very strict performance requirements, there are well-know cryptographic libraries that have Nim wrappers and provide constant time comparisons.

You don't have to fight the Nim compiler and GCC all by yourself if are implementing some webapp.

cblake (orginal) [2020-07-25T14:22:32+02:00] view original

@federico3, for a web-app there may also be "autonomy"/non-dependency requirements. A Nim lib for assembly may be no better than a C lib for that, though.

But this is not a criticism of @mratsim. The idea is good and can also help performance in cases where it is not easy to guide compilers in auto-vectorization.

I hesitate to say this because I never like perfect to be the enemy of the good, but there are a gajillion "backward compatible" CPUs now with almost a 16-member bitmask of avx512 versions down to avx2/avx/sse/...just on x64. Supporting a "biggest set" with a library of little assembly impls to provide what is missing (at run-time in a "fat binary" that detects CPU features) for lesser CPUs would also be nice. That's obviously at least a little harder to do "constant time"-constrained. Then we could all just write assembly-like code against the "most complete avx512 F,CD,ER,PF,VL,DQ,BW,IFMA,VMBI,..." and have it compile and run on any Intel or AMD chip. The "xmmintrin.h" style interfaces in C/C++ compilers sort of support this, in theory, but tends to be incomplete/not realizing this vision in practice. There are also definitely some instructions hard to make efficient on older CPUs.

Anyway, CPUs these days have a lot more versions and instruction variety than many people I know are aware of. This idea I mention is probably more ambitious than the scope/dreams of mratsim's work here, though I would be delighted to be wrong on that. :-)

cantanima (orginal) [2020-07-25T15:28:07+02:00] view original

This is a bit of a distraction from your main point, but from your GitHub:

As cryptography rely in big integers computation, a constant time big int implementation is needed.

Out of curiosity: that's not really possible, is it? unless you bound "big". My definition of "big" is arbitrary-size (you mention GMP), so I want to be sure..

mratsim (orginal) [2020-07-25T15:55:59+02:00] view original

@federico

A little reminder for other readers: timing attacks pop up often when people implement their own cryptography or other security functions that require string equality checks.
Unless you have very strict performance requirements, there are well-know cryptographic libraries that have Nim wrappers and provide constant time comparisons.
You don't have to fight the Nim compiler and GCC all by yourself if are implementing some webapp. :)

Yes to using audited and battle-tested crypto code. Unfortunately even OpenSSL is not fully constant-time and rely on "blinding" to hide secrets and a significant number of libraries also rely on exponent blinding (for RSA) or scalar blinding (for elliptic curve). A very recent paper allows unblinding (https://eprint.iacr.org/2019/1220.pdf). Unfortunately that exponent/scalar is usually the user secret key that must be protected at all cost.

Second, if you have very strict perf requirements for common use cases (RSA, ECDSA, ...) it's likely that OpenSSL or libsodium has you covered, however my use case is cryptographic pairings, in particular for zero-knowledge proofs. There is currently no widespread secure and audited implementation at the moment and the main ones (in Rust, pioneered by the Zcash project) are both non-constant-time and slow. Regarding perf requirement this article will give you a taste on the optimizations companies are willing to go for https://medium.com/loopring-protocol/zksnark-prover-optimizations-3e9a3e5578c0, the idea behind is to prove/verify batches of financial transactions on a blockchain.

@cblake

@federico3, for a web-app there may also be "autonomy"/non-dependency requirements. A Nim lib for assembly may be no better than a C lib for that, though.
But this is not a criticism of @mratsim. The idea is good and can also help performance in cases where it is not easy to guide compilers in auto-vectorization.
I hesitate to say this because I never like perfect to be the enemy of the good, but there are a gajillion "backward compatible" CPUs now with almost a 16-member bitmask of avx512 versions down to avx2/avx/sse/...just on x64. Supporting a "biggest set" with a library of little assembly impls to provide what is missing (at run-time in a "fat binary" that detects CPU features) for lesser CPUs would also be nice. That's obviously at least a little harder to do "constant time"-constrained. Then we could all just write assembly-like code against the "most complete avx512 F,CD,ER,PF,VL,DQ,BW,IFMA,VMBI,..." and have it compile and run on any Intel or AMD chip. The "xmmintrin.h" style interfaces in C/C++ compilers sort of support this, in theory, but tends to be incomplete/not realizing this vision in practice. There are also definitely some instructions hard to make efficient on older CPUs.
Anyway, CPUs these days have a lot more versions and instruction variety than many people I know are aware of. This idea I mention is probably more ambitious than the scope/dreams of mratsim's work here, though I would be delighted to be wrong on that. :-)

I only care about multiprecision arithmetic which compilers cannot optimize properly even with intrinsics (_addcarry_u64, _addcarryx_u64, _mulx_u64) or uint128.

For floating point (SSE, AVX, AVX512, ...) fortunately the compilers can generate very good code with the intrinsics and there is no need to resort to assembly in my experience of implementing various linear algebra, computer vision and machine learning kernels.

Regarding CPU features autodetection, @awr1 cpuwhat is good and what I use internally: https://github.com/awr1/cpuwhat

@cantanima

This is a bit of a distraction from your main point, but from your GitHub:
As cryptography rely in big integers computation, a constant time big int implementation is needed.
Out of curiosity: that's not really possible, is it? unless you bound "big". My definition of "big" is arbitrary-size (you mention GMP), so I want to be sure..

Indeed, in cryptography you always have an upper-bound that is known at compile-time. For example if you use RSA-2048, all your numbers would be 2048 bits (or related). If you use a 256-bit elliptic curve, all your numbers will be 256-bit.

In practice this means that my BigInt are represented by BigInt[128], BigInt[256], BigInt[2048] or fancier bitwidths like BigInt[127], BigInt[254], BigInt[381].

And constant-time means that you should always do the same operations and memory accesses even if one of the inputs is all zero and you could technically early return for example.

Regarding big integer you have 2 kind of libraries:

arbitrary-precision libraries like GMP with dynamic size

fixed-precision libraries like mine with sizes known at compile-time

cantanima (orginal) [2020-07-25T20:04:29+02:00] view original

Indeed, in cryptography you always have an upper-bound that is known at compile-time.

Thank you! It's nice that modern machines' vector instructions make O(1) complexity possible. I'm a bit surprised that programming languages haven't generally made this available even in standard libraries (not even C & C++). I haven't kept up with those things because my mathematics deals with polynomials which can grow very long (tens of thousands of terms or even more), and the upper bounds isn't known in advance, so O(1) is essentially a fantasy for my code. :-(

mratsim (orginal) [2020-07-26T09:36:23+02:00] view original

Just to clarify, the usual understanding of cryptographic constant-time and O(1) is different. Both does relate to timing but O(1) is about how timing changes with data size while constant-time is about how for a fixed data size there is no timing difference whatever the data is.

As an example suppose you have 2 arrays


[0, 1, 0, 0, 0]
[0, 0, 0, 0, 0]

A constant-time == would be checking all fields. An fast implementation may use early return and exit as soon as it compares 1 with 0. However the algorithm is usually understood has being O(n) with n length of the arrays (and n bounded to 5 here).

I.e. a constant-time algorithm does not mean O(1), the time to compute may depend on array length (bits in a bigint), but for a certain fixed size, it always take the same time, there is no best/worst case.

Concrete example, modular substraction would be

proc submod_variable_time(a: var BigInt, b, modulus: BigInt) =
  a -= b
  if a < 0:
    a += modulus

proc submod_constanttime(a: var BigInt, b, modulus: BigInt) =
  let underflowed = a.sub(b) # Substraction that returns the last borrow if any
  a.conditional_add(modulus, underflowed) # Does "a[i] = if underflowed: a[i] + modulus[i] else: a[i]"

Mirror of forum.nim-lang.org

6589 :: Procedural X86 inline assembler for Nim