nimforum mirror - Alternative to powmod in Nim

Dadadani (orginal) [2020-12-22T11:39:25+01:00] view original

I am working on a mtproto client in Nim and the key exchange process requires handling big integers (also primes) with modular exponentation. Currently i am using the stint library using powmod, but i noticed that it is really slow (5 seconds in -d:release). The numbers are like this

powmod(3, 5353298235368615715086015989500386287229041352472878638071518666228109600330630011721412968955794129220900515437949650083539753416976065858168797034438041953432098671933385654459949759916692249948136470581833023646089705937609998186267612873792624931572076651507178185195496822727062814431558558200731675762237505627859229808348408417805267472578881525847798784260164598636480829749664441945255164579524883980750891720882036701554976549860300202114749723141961486427137188380250834468000036033392412994077172956841103525797390019597180655219915585699771458390767411239503401306475092402193456770859145072367454040329, 25135566567101483196994790440833279750474660393232382279277736257066266618532493517139001963526957179514521981877335815379755618191324858392834843718048308951653115284529736874534289456833723962912807104017411854314007953484461899139734367756070456068592886771130491355511301923675421649355211882120329692353507392677087555292357140606251171702417804959957862991259464749806480821163999054978911727901705780417863120490095024926067731615229486812312187386108568833026386220686253160504779704721744600638258183939573405528962511242337923530869616215532193967628076922234051908977996352800560160181197923404454023908443)

Now, it seems like a complex operation, but for example, python's equivalent of powmod, tooks only 59 milliseconds. Is this an issue related to the stint library (maybe not optimized?), and are there any alternative of doing this faster in Nim?

pietroppeter (orginal) [2020-12-22T12:29:54+01:00] view original

From a cursory look at implementation of powmod in stint, implementation of pow in python, keeping as reference the different methods to compute modular exponentiation, I would say there are indeed difference in the implementations:

stint seems to use right-to-left binary method

python seems to use left-to-right binary method with additional optimizations in case of large exponent (see here)

My guess is that this difference in implementation actually gives the different timings for your case, but I have not looked into the details of this and look forward to what others more knowledgable than me will say.

For alternative implementations you might want to try GMP, of which the most uptodate wrapper is this one. Note that this, being based on GMP, is ultimately bound to a GPL license.

mratsim (orginal) [2020-12-22T12:43:38+01:00] view original

You should use a cryptographic library for securing keys, not a general purpose library that does not optimize for security.

In particular since you use powmod I assume you use it for RSA? In that case powmod is a critical function that can be easily attacked.

Assuming you need cryptographic securities you have:

GMP constant-time powmod: mpz_powm_sec, raw GMP is a bit of a pain to use see an example in my test suite here https://github.com/mratsim/constantine/blob/e89429e/tests/t_bigints_mul_vs_gmp.nim#L63-L104

OpenSSL BigNum bn.h via nimssl https://github.com/genotrance/nimssl, not sure where the generated interface is but you likely want BN_mod_exp_mont_consttime

If you feel adventurous you can extract my limb-level secure Montgomery modular exponentiation here: https://github.com/mratsim/constantine/blob/e89429e/constantine/arithmetic/limbs_montgomery.nim#L384-L609

Now regarding Stint vs Python, the slowness has the following causes:

carries are unnecessarily recomputed in multiple pass https://github.com/status-im/nim-stint/issues/87. Refactor is in progress to fix that but I don't have much time https://github.com/status-im/nim-stint/pull/104

Compilers are poor are bigints, see GCC codegen https://gcc.godbolt.org/z/2h768y

Stint doesn't use assembly while Python does

Stint use case is 256-bit numbers and besides correctness no optimization was done for thoe sizes.

The fastest way to modular exponentiate with a odd modulus is to convert to Montgomery domain then do exponentiation by squaring. The second best way is to do Barret Reduction instead of full modular reduction after each operation. Stint does naive multiplication then reduction or squaring then reduction.

I can post an example with Constantine and benchmarks later in the day but as mentioned before, while I need very optimized bigint for cryptographic size in Constantine, I don't provide an API to the raw internals.

Dadadani (orginal) [2020-12-22T13:51:18+01:00] view original

I'm currently also using stint for RSA, i initially used openssl but since it was adding pkcs1 padding (and mtproto doesn't want that) i opted for using this library.

But the main issue is not related to that problem, what i am doing is explained here: https://core.telegram.org/mtproto/auth_key#presenting-proof-of-work-server-authentication

Specifically, when g_b is calculated and the auth_key (powmod(gA, b, dhPrime)) And this is the part where everything is "slow".

Anyway, i will try to use external library if there are not any native alternative, thank you.

mratsim (orginal) [2020-12-22T14:25:45+01:00] view original

Here is an implementation of modular exponentiation in Constantine. 8ms on my machine

# nim c -d:danger -r --outdir:build build/pow_bench.nim

import
  ../constantine/[
    arithmetic,
    io/io_bigints,
    config/precompute,
    arithmetic/limbs_modular,
    arithmetic/limbs
  ],
  std/[monotimes, times]

proc main() =
  
  # a = 3
  let a = BigInt[2048].fromHex"3"
  
  # e = 5353298235368615715086015989500386287229041352472878638071518666228109600330630011721412968955794129220900515437949650083539753416976065858168797034438041953432098671933385654459949759916692249948136470581833023646089705937609998186267612873792624931572076651507178185195496822727062814431558558200731675762237505627859229808348408417805267472578881525847798784260164598636480829749664441945255164579524883980750891720882036701554976549860300202114749723141961486427137188380250834468000036033392412994077172956841103525797390019597180655219915585699771458390767411239503401306475092402193456770859145072367454040329
  let exponent = BigInt[2048].fromHex"2a6802a7d7bfdf88b3790a49e0f8244ba89110ef49d96515b3064490f22097aeb2651ad2957c7fbc7d224387f606649b2e7364f0b61919ac71dbd7350503385523342bfca2067188858407dc68c84c15b52690f26572dab6d2bafb14cbe049e9c9b7b6ef90d1bef58a38d2687c207d0545c1cd04970587ef531e4c9788f6bdcf38c3011c4df69c904d761e505c5d4dcc4fa17ccbfa7a8d56750aae38984be4760c637cb2b6f8c68764d985047400e17fac9950701f194cb5c06500ab41ad98282a0b0af6204f15e053a34df72b0b5bf6cdc9afad4bcacd2348d65d1dfa144380262f23be8edad05948472691f6a2131cab76bb1c408f44b5a9fc32d92a9eb109"
  
  # For now the modulus is required to be a constant in Constantine
  # as the Montgomery magic constant is expected const,
  # this can easily be lifted for RSA
  # M = c71caeb9c6b1c9048e6c522f70f13f73980d40238e3e21c14934d037563d930f48198a0aa7c14058229493d22530f4dbfa336f6e0ac925139543aed44cce7c3720fd51f69458705ac68cd4fe6b6b13abdc9746512969328454f18faf8c595f642477fe96bb2a941d5bcd1d4ac8cc49880708fa9b378e3c4f3a9060bee67cf9a4a4a695811051907e162753b56b0f6b410dba74d8a84b2a14b3144e0ef1284754fd17ed950d5965b4b9dd46582db1178d169c6bc465b0d6ff9ca3928fef5b9ae4e418fc15e83ebea0f87fa9ff5eed70050ded2849f47bf959d956850ce929851f0d8115f635b105ee2e4e15d04b2454bf6f4fadf034b10403119cd8e3b92fcc5b
  const modulus = BigInt[2048].fromHex"c71caeb9c6b1c9048e6c522f70f13f73980d40238e3e21c14934d037563d930f48198a0aa7c14058229493d22530f4dbfa336f6e0ac925139543aed44cce7c3720fd51f69458705ac68cd4fe6b6b13abdc9746512969328454f18faf8c595f642477fe96bb2a941d5bcd1d4ac8cc49880708fa9b378e3c4f3a9060bee67cf9a4a4a695811051907e162753b56b0f6b410dba74d8a84b2a14b3144e0ef1284754fd17ed950d5965b4b9dd46582db1178d169c6bc465b0d6ff9ca3928fef5b9ae4e418fc15e83ebea0f87fa9ff5eed70050ded2849f47bf959d956850ce929851f0d8115f635b105ee2e4e15d04b2454bf6f4fadf034b10403119cd8e3b92fcc5b"
  
  # Bench notes: not measure hex conversion because
  # in practice we use raw bytes which are significantly
  # less costly.
  # We do measure constant computation but "montyOne"
  # is an intermediate result of r2ModM.
  let start = getMonotime()
  
  # Montgomery constants
  const m0ninv = negInvModWord(modulus)
  let r2ModM = r2Mod(modulus)
  let montyOne = montyOne(modulus)
  
  # Can use "no-carry" fast path
  const hasNoCarryMul = useNoCarryMontyMul(modulus)
  const hasNoCarrySquare = useNoCarryMontySquare(modulus)
  
  # Montgomery conversion
  var aMont{.noInit.}: BigInt[2048]
  aMont.montyResidue(a, modulus, r2ModM, m0ninv, hasNoCarryMul)
  
  # Compute Montgomery exponentiation
  # using constant-time algorithm to protect
  # against side-channels and timing attacks
  # (similar to Spectre & Meltdown)
  aMont.montyPow(     # Alternatively montyPowUnsafeExponent if exponent is not a secret key
    exponent,         # exponent can also be raw bigendian bytes, the usual crypto serialization of bigint
    modulus,
    # Montgomery constant
    montyOne, m0ninv,
    # Optimizations
    windowSize = 5,   # Windowed exponentiation
    canUseNoCarryMontyMul = hasNoCarryMul,
    canUseNoCarryMontySquare = hasNoCarrySquare
  )
  
  # TODO accelerate squarings, currently they use the same codepath as multiplication
  # This would have ~15% perf increase of modular exponentiation.
  
  # Return to the canonical domain
  var r {.noInit.}: BigInt[2048]
  r.redc(aMont, modulus, m0ninv, hasNoCarryMul)
  
  let stop = getMonotime()
  
  # Print
  echo r.toHex()
  echo "Time: ", inMilliseconds(stop-start), " ms"

main()


0x730975c8d06d8e84fb96799b65b0b65db62a78d02bce03d1175ec9a55310cce50df956a2f937436d475d86979e9e4f75b931916b4203c68545a023e2faf0f426d6ec5940ddab15148d04ba7cc2fa16a02721a64451d6898202e87a849fcc5832deaa2077dd0a121d4db8083974d4b9aa84c871d4a82084949fa2e69f44a84af44054c24d33b9a7d7ed0571a39bd387cea0e918f28dc6c2b0c97b5e30fd5fea57948113c66d78d3b007f7150ae231c74c68a38220b8a64a3bde9f656a85f9e93043bdeae7dc5486101c1edfec7820b68caa8854ac7a7241da66464a190192644fab08610864ef42752c32f62f4ab4fb162b304ce11644e1ce0e553bca32a0aed0

Regarding RSA, don't use Stint, use nim-bearSSL https://github.com/status-im/nim-bearssl/blob/ba5f468/bearssl/decls.nim#L3672-L3688

Primitives: https://github.com/status-im/nim-bearssl/blob/ba5f468/bearssl/decls.nim#L3672-L3688

High-level wrapper: https://github.com/status-im/nim-libp2p/blob/63bf369/libp2p/crypto/rsa.nim#L745-L798

Usage: https://github.com/status-im/nim-libp2p/blob/63bf369/tests/testrsa.nim#L362-L378

This wrapper and usage have been security audited for use in blockchain clients

mratsim (orginal) [2023-05-25T09:14:26+02:00] view original

Since implementing fast powmod comes back every few months, I've written an issue to explain the high to medium level details on how to implement high-performance powmod here: https://github.com/status-im/nim-stint/issues/126

dlesnoff (orginal) [2023-05-25T11:06:21+02:00] view original

I have read your post, but I have not much time to answer you.

Barret and Montgomery techniques improve the reduction, which is crucial indeed. Unsigned integer arithmetic is slow, but 256 primes does not fit on floating point types, so it is basically our only solution. I am surprised that OpenSSL did not provide at that time OAEP and PSS for encryption and signature respectively, as alternatives to PKCS#1.5

arnetheduck (orginal) [2023-05-25T13:23:38+02:00] view original

The fundamental problem with stint currently is a consequence of its unfortunate recursive design which renders every single operation exponentially slower than it has to be in the number of bytes used for the size of the integer.

The first step to any stint work is to replace the implementation with a simple array-based backend - only then does it make sense to start thinking about anything beyond the most trivial implementations of anything: getting to this point is something of a priority that we might be looking into soon and this would likely make the library "good enough" for basic use including the one pointed out in this thread, ie this simple change would get it to a point where the OP likely wouldn't have bothered to write a thread about it, even without any compiler intrinsics, assembly code and so on ;)

mratsim (orginal) [2023-05-25T22:43:39+02:00] view original

The fundamental problem with stint currently is a consequence of its unfortunate recursive design which renders every single operation exponentially slower than it has to be in the number of bytes used for the size of the integer.

Actually, the main slowness issue is currently the carry propagation needs to recompute the carry. However, this can be fixed even with a recursive design by following the design here by adding an add-with-carry primitive: https://github.com/linbox-team/givaro/blob/de943160/src/kernel/recint/radd.h#L65-L70 (which I didn't because at the time it seemed unnecessary complex, alas)

mratsim (orginal) [2023-05-30T14:54:12+02:00] view original

I have finished implementing a new arbitrary-precision backend in Constantine for powmod i.e. different from the fixed precision BigInt[256] or BigInt[1024], it can deal with runtime sizes.

PR: https://github.com/mratsim/constantine/pull/242

Benchmarks show that perf is within 85% to 140% of GMP for 256 bits inputs, exponents, moduli.

A quick and dirty bench

A large bench with many iterations

Takeaways:

Compiler is Clang, GCC is noticeably slower (due to carry handling when they conflict with loop increments)

The final implementation is competitive with GMP even without assembly

further improvements like assembly (~30% estimated), sliding window (~15% to 20%) estimated and removing constant-time big-int modulo (unknown) can be added.

Stint and nim-bigints are 80x and 230x slower

Analysis of Stint

When Stint was released, the goal was to quickly have a replacement for ttmath https://github.com/status-im/nim-ttmath, a C++ header-only library. Due to C++, it caused many issues for new joiners at Nimbus with whack-a-mole error like "flexible array member" preventing usage in a seq (https://github.com/status-im/nim-ttmath/issues/10)

At the time (April 2018) the goal was to have uint256 and modular operations were not a target at all.

The fundamental problem with stint currently is a consequence of its unfortunate recursive design which renders every single operation exponentially slower than it has to be in the number of bytes used for the size of the integer.

To be more precise, the issue is in addition/substraction, Stint works by half sizes. With a layout like [a.lo, a.hi] and [b.lo, b.hi], it will compute r.lo = a.lo+b.lo and r.hi = a.hi+b.hi+carry. The issue is determining the carry, currently it checks r.lo < a.lo. For uint128, this is fine, it's u64 comparison, unfortunately for uint256 it becomes an uint128 comparison, for uint512 it becomes an uint256 comparison.

So the extra cost on uint256 is 3 passes over uint128 instead of 2, a factor 1.5x. This does not explain a 80x performance difference with the Constantine implementation.

As mentioned in the guide, DIV is a very slow operation 36 cycles and up to 95 cycle latency (if you want to reuse the result immediately)

In comparison, virtually everything else takes 1 cycle.

Unfortunately for modular arithmetic, DIV is used for both division and modulo/remainder.

Because it's so so slow, even naively converting multiplication in a series of modular additions is much faster: https://github.com/status-im/nim-stint/blob/94fc521/stint/modular_arithmetic.nim#L50-L66

func mulmod_internal(a, b, m: StUint): StUint {.inline.}=
  ## Does (a * b) mod m. Assume a < m and b < m
  ## Internal proc - used in powmod
  
  doAssert a < m
  doAssert b < m
  
  var (a, b) = (a, b)
  
  if b > a:
    swap(a, b)
  
  while not b.isZero:
    if b.isOdd:
      result = result.addmod_internal(a, m)
    a = doublemod_internal(a, m)
    b = b shr 1

For a random 256-bit int, i.e. 50% of bits (128) are 1 and the rest 0, this does 256 modular doublings and 128 additions, and this is still 3x faster than the 4 divisions that nim-bigints do (it uses uint32 words instead of uint64).

Note: this algorithm (double-and-add) is the same algorithm used to build exponentiation from multiplication (and both stint and nim-bigint implement exponentiation with same algorithm)

If we could use multiplication, which is quadratic in the number of words, i.e. a cost of 16 = 4² (since 256-bit = 4*64-bit words) we can account for an asymptotic 24x speedup.

Analysis of nim-bigints

nim-bigints uses explicit modulo which are very slow.

Besides it uses uint32 instead of uint64, unfortunately the cost is not a mere 2x. Multiplication costs grow with the square of the number of limbs, so we're looking at 4*uint64 or 8*uint32 with cost: 16 asymptotic operations vs 64, a factor 4x.

Furthermore nim-bigints does not reuse buffers and always allocate a new one, this is incredibly stressful to allocators. In particular with CPU speed nowadays cache and memory bandwidth is a bottleneck. Keeping data hot in cache is extremely important.

See:

https://mechanical-sympathy.blogspot.com/2013/02/cpu-cache-flushing-fallacy.html
- L1 cache: 3 cycles, i.e. as slow as 3 instructions to ~12 instructions (am Intel CPU can issue 4 independent additions without carry per cycle but only 2 shifts)
- L2 cache: 12 cycles
- L3 cache: 38 cycles
- DIV: 36 cycles, up to 95 latency

https://norvig.com/21-days.html#answers (numbers from ~2000)

This altogether makes nim-bigints 230x slower than Constantine/GMP.

mratsim (orginal) [2023-10-19T09:35:40+02:00] view original

Following my last PR, I'm now faster than GMP, even without assembly on those small-medium-sized bigints.

(with Clang, GCC codegen for bigints is meh)

4n0n4me (orginal) [2023-10-21T11:25:55+02:00] view original

Amazing! As a normal Nim learner, I am grateful to you for your responsibly making highly completed libraries for Nim. (not those which provide naive solutions without covering edge cases or seeking high performance.)

mratsim (orginal) [2023-10-23T12:23:38+02:00] view original

Thanks :)

On that front, Constantine as being integrated to Google continuous fuzzing framework OSS-fuzz in https://github.com/google/oss-fuzz/pull/10710

It's quite complete as it's fuzzed for both 32-bit and 64-bit.

arnetheduck (orginal) [2024-07-09T14:49:52+02:00] view original

Because it's so so slow, even naively converting multiplication in a series of modular additions is much faster:

dunno about this one - after fixing the insanely slow recursion, changing the implementation to a simple natural mul+divrem 100x:ed the performance - this is the textbook implementation and doesn't even try to be efficient (ie no special cases for different modulos, small numbers, montgomerys and whatnot. I'm sure this is trivial to both benchmark and implement better but the iteration approach sure doesn't look like a winner from this sample of one :)

https://github.com/status-im/nim-stint/pull/156

Mirror of forum.nim-lang.org

7276 :: Alternative to powmod in Nim

A quick and dirty bench

A large bench with many iterations

Takeaways:

Analysis of Stint

Analysis of nim-bigints