I am working on a mtproto client in Nim and the key exchange process requires handling big integers (also primes) with modular exponentation. Currently i am using the stint library using powmod, but i noticed that it is really slow (5 seconds in -d:release). The numbers are like this
powmod(3, 5353298235368615715086015989500386287229041352472878638071518666228109600330630011721412968955794129220900515437949650083539753416976065858168797034438041953432098671933385654459949759916692249948136470581833023646089705937609998186267612873792624931572076651507178185195496822727062814431558558200731675762237505627859229808348408417805267472578881525847798784260164598636480829749664441945255164579524883980750891720882036701554976549860300202114749723141961486427137188380250834468000036033392412994077172956841103525797390019597180655219915585699771458390767411239503401306475092402193456770859145072367454040329, 25135566567101483196994790440833279750474660393232382279277736257066266618532493517139001963526957179514521981877335815379755618191324858392834843718048308951653115284529736874534289456833723962912807104017411854314007953484461899139734367756070456068592886771130491355511301923675421649355211882120329692353507392677087555292357140606251171702417804959957862991259464749806480821163999054978911727901705780417863120490095024926067731615229486812312187386108568833026386220686253160504779704721744600638258183939573405528962511242337923530869616215532193967628076922234051908977996352800560160181197923404454023908443)
Now, it seems like a complex operation, but for example, python's equivalent of powmod, tooks only 59 milliseconds. Is this an issue related to the stint library (maybe not optimized?), and are there any alternative of doing this faster in Nim?
My guess is that this difference in implementation actually gives the different timings for your case, but I have not looked into the details of this and look forward to what others more knowledgable than me will say.
For alternative implementations you might want to try GMP, of which the most uptodate wrapper is this one. Note that this, being based on GMP, is ultimately bound to a GPL license.
You should use a cryptographic library for securing keys, not a general purpose library that does not optimize for security.
In particular since you use powmod I assume you use it for RSA? In that case powmod is a critical function that can be easily attacked.
Assuming you need cryptographic securities you have:
Now regarding Stint vs Python, the slowness has the following causes:
I can post an example with Constantine and benchmarks later in the day but as mentioned before, while I need very optimized bigint for cryptographic size in Constantine, I don't provide an API to the raw internals.
I'm currently also using stint for RSA, i initially used openssl but since it was adding pkcs1 padding (and mtproto doesn't want that) i opted for using this library.
But the main issue is not related to that problem, what i am doing is explained here: https://core.telegram.org/mtproto/auth_key#presenting-proof-of-work-server-authentication
Specifically, when g_b is calculated and the auth_key (powmod(gA, b, dhPrime)) And this is the part where everything is "slow".
Anyway, i will try to use external library if there are not any native alternative, thank you.
Here is an implementation of modular exponentiation in Constantine. 8ms on my machine
# nim c -d:danger -r --outdir:build build/pow_bench.nim
import
../constantine/[
arithmetic,
io/io_bigints,
config/precompute,
arithmetic/limbs_modular,
arithmetic/limbs
],
std/[monotimes, times]
proc main() =
# a = 3
let a = BigInt[2048].fromHex"3"
# e = 5353298235368615715086015989500386287229041352472878638071518666228109600330630011721412968955794129220900515437949650083539753416976065858168797034438041953432098671933385654459949759916692249948136470581833023646089705937609998186267612873792624931572076651507178185195496822727062814431558558200731675762237505627859229808348408417805267472578881525847798784260164598636480829749664441945255164579524883980750891720882036701554976549860300202114749723141961486427137188380250834468000036033392412994077172956841103525797390019597180655219915585699771458390767411239503401306475092402193456770859145072367454040329
let exponent = BigInt[2048].fromHex"2a6802a7d7bfdf88b3790a49e0f8244ba89110ef49d96515b3064490f22097aeb2651ad2957c7fbc7d224387f606649b2e7364f0b61919ac71dbd7350503385523342bfca2067188858407dc68c84c15b52690f26572dab6d2bafb14cbe049e9c9b7b6ef90d1bef58a38d2687c207d0545c1cd04970587ef531e4c9788f6bdcf38c3011c4df69c904d761e505c5d4dcc4fa17ccbfa7a8d56750aae38984be4760c637cb2b6f8c68764d985047400e17fac9950701f194cb5c06500ab41ad98282a0b0af6204f15e053a34df72b0b5bf6cdc9afad4bcacd2348d65d1dfa144380262f23be8edad05948472691f6a2131cab76bb1c408f44b5a9fc32d92a9eb109"
# For now the modulus is required to be a constant in Constantine
# as the Montgomery magic constant is expected const,
# this can easily be lifted for RSA
# M = c71caeb9c6b1c9048e6c522f70f13f73980d40238e3e21c14934d037563d930f48198a0aa7c14058229493d22530f4dbfa336f6e0ac925139543aed44cce7c3720fd51f69458705ac68cd4fe6b6b13abdc9746512969328454f18faf8c595f642477fe96bb2a941d5bcd1d4ac8cc49880708fa9b378e3c4f3a9060bee67cf9a4a4a695811051907e162753b56b0f6b410dba74d8a84b2a14b3144e0ef1284754fd17ed950d5965b4b9dd46582db1178d169c6bc465b0d6ff9ca3928fef5b9ae4e418fc15e83ebea0f87fa9ff5eed70050ded2849f47bf959d956850ce929851f0d8115f635b105ee2e4e15d04b2454bf6f4fadf034b10403119cd8e3b92fcc5b
const modulus = BigInt[2048].fromHex"c71caeb9c6b1c9048e6c522f70f13f73980d40238e3e21c14934d037563d930f48198a0aa7c14058229493d22530f4dbfa336f6e0ac925139543aed44cce7c3720fd51f69458705ac68cd4fe6b6b13abdc9746512969328454f18faf8c595f642477fe96bb2a941d5bcd1d4ac8cc49880708fa9b378e3c4f3a9060bee67cf9a4a4a695811051907e162753b56b0f6b410dba74d8a84b2a14b3144e0ef1284754fd17ed950d5965b4b9dd46582db1178d169c6bc465b0d6ff9ca3928fef5b9ae4e418fc15e83ebea0f87fa9ff5eed70050ded2849f47bf959d956850ce929851f0d8115f635b105ee2e4e15d04b2454bf6f4fadf034b10403119cd8e3b92fcc5b"
# Bench notes: not measure hex conversion because
# in practice we use raw bytes which are significantly
# less costly.
# We do measure constant computation but "montyOne"
# is an intermediate result of r2ModM.
let start = getMonotime()
# Montgomery constants
const m0ninv = negInvModWord(modulus)
let r2ModM = r2Mod(modulus)
let montyOne = montyOne(modulus)
# Can use "no-carry" fast path
const hasNoCarryMul = useNoCarryMontyMul(modulus)
const hasNoCarrySquare = useNoCarryMontySquare(modulus)
# Montgomery conversion
var aMont{.noInit.}: BigInt[2048]
aMont.montyResidue(a, modulus, r2ModM, m0ninv, hasNoCarryMul)
# Compute Montgomery exponentiation
# using constant-time algorithm to protect
# against side-channels and timing attacks
# (similar to Spectre & Meltdown)
aMont.montyPow( # Alternatively montyPowUnsafeExponent if exponent is not a secret key
exponent, # exponent can also be raw bigendian bytes, the usual crypto serialization of bigint
modulus,
# Montgomery constant
montyOne, m0ninv,
# Optimizations
windowSize = 5, # Windowed exponentiation
canUseNoCarryMontyMul = hasNoCarryMul,
canUseNoCarryMontySquare = hasNoCarrySquare
)
# TODO accelerate squarings, currently they use the same codepath as multiplication
# This would have ~15% perf increase of modular exponentiation.
# Return to the canonical domain
var r {.noInit.}: BigInt[2048]
r.redc(aMont, modulus, m0ninv, hasNoCarryMul)
let stop = getMonotime()
# Print
echo r.toHex()
echo "Time: ", inMilliseconds(stop-start), " ms"
main()
0x730975c8d06d8e84fb96799b65b0b65db62a78d02bce03d1175ec9a55310cce50df956a2f937436d475d86979e9e4f75b931916b4203c68545a023e2faf0f426d6ec5940ddab15148d04ba7cc2fa16a02721a64451d6898202e87a849fcc5832deaa2077dd0a121d4db8083974d4b9aa84c871d4a82084949fa2e69f44a84af44054c24d33b9a7d7ed0571a39bd387cea0e918f28dc6c2b0c97b5e30fd5fea57948113c66d78d3b007f7150ae231c74c68a38220b8a64a3bde9f656a85f9e93043bdeae7dc5486101c1edfec7820b68caa8854ac7a7241da66464a190192644fab08610864ef42752c32f62f4ab4fb162b304ce11644e1ce0e553bca32a0aed0
Regarding RSA, don't use Stint, use nim-bearSSL https://github.com/status-im/nim-bearssl/blob/ba5f468/bearssl/decls.nim#L3672-L3688
I have read your post, but I have not much time to answer you.
Barret and Montgomery techniques improve the reduction, which is crucial indeed. Unsigned integer arithmetic is slow, but 256 primes does not fit on floating point types, so it is basically our only solution. I am surprised that OpenSSL did not provide at that time OAEP and PSS for encryption and signature respectively, as alternatives to PKCS#1.5
The fundamental problem with stint currently is a consequence of its unfortunate recursive design which renders every single operation exponentially slower than it has to be in the number of bytes used for the size of the integer.
The first step to any stint work is to replace the implementation with a simple array-based backend - only then does it make sense to start thinking about anything beyond the most trivial implementations of anything: getting to this point is something of a priority that we might be looking into soon and this would likely make the library "good enough" for basic use including the one pointed out in this thread, ie this simple change would get it to a point where the OP likely wouldn't have bothered to write a thread about it, even without any compiler intrinsics, assembly code and so on ;)
The fundamental problem with stint currently is a consequence of its unfortunate recursive design which renders every single operation exponentially slower than it has to be in the number of bytes used for the size of the integer.
Actually, the main slowness issue is currently the carry propagation needs to recompute the carry. However, this can be fixed even with a recursive design by following the design here by adding an add-with-carry primitive: https://github.com/linbox-team/givaro/blob/de943160/src/kernel/recint/radd.h#L65-L70 (which I didn't because at the time it seemed unnecessary complex, alas)
I have finished implementing a new arbitrary-precision backend in Constantine for powmod i.e. different from the fixed precision BigInt[256] or BigInt[1024], it can deal with runtime sizes.
PR: https://github.com/mratsim/constantine/pull/242
Benchmarks show that perf is within 85% to 140% of GMP for 256 bits inputs, exponents, moduli.
When Stint was released, the goal was to quickly have a replacement for ttmath https://github.com/status-im/nim-ttmath, a C++ header-only library. Due to C++, it caused many issues for new joiners at Nimbus with whack-a-mole error like "flexible array member" preventing usage in a seq (https://github.com/status-im/nim-ttmath/issues/10)
At the time (April 2018) the goal was to have uint256 and modular operations were not a target at all.
The fundamental problem with stint currently is a consequence of its unfortunate recursive design which renders every single operation exponentially slower than it has to be in the number of bytes used for the size of the integer.
To be more precise, the issue is in addition/substraction, Stint works by half sizes. With a layout like [a.lo, a.hi] and [b.lo, b.hi], it will compute r.lo = a.lo+b.lo and r.hi = a.hi+b.hi+carry. The issue is determining the carry, currently it checks r.lo < a.lo. For uint128, this is fine, it's u64 comparison, unfortunately for uint256 it becomes an uint128 comparison, for uint512 it becomes an uint256 comparison.
So the extra cost on uint256 is 3 passes over uint128 instead of 2, a factor 1.5x. This does not explain a 80x performance difference with the Constantine implementation.
As mentioned in the guide, DIV is a very slow operation 36 cycles and up to 95 cycle latency (if you want to reuse the result immediately)
In comparison, virtually everything else takes 1 cycle.
Unfortunately for modular arithmetic, DIV is used for both division and modulo/remainder.
Because it's so so slow, even naively converting multiplication in a series of modular additions is much faster: https://github.com/status-im/nim-stint/blob/94fc521/stint/modular_arithmetic.nim#L50-L66
func mulmod_internal(a, b, m: StUint): StUint {.inline.}=
## Does (a * b) mod m. Assume a < m and b < m
## Internal proc - used in powmod
doAssert a < m
doAssert b < m
var (a, b) = (a, b)
if b > a:
swap(a, b)
while not b.isZero:
if b.isOdd:
result = result.addmod_internal(a, m)
a = doublemod_internal(a, m)
b = b shr 1
For a random 256-bit int, i.e. 50% of bits (128) are 1 and the rest 0, this does 256 modular doublings and 128 additions, and this is still 3x faster than the 4 divisions that nim-bigints do (it uses uint32 words instead of uint64).
Note: this algorithm (double-and-add) is the same algorithm used to build exponentiation from multiplication (and both stint and nim-bigint implement exponentiation with same algorithm)
If we could use multiplication, which is quadratic in the number of words, i.e. a cost of 16 = 4² (since 256-bit = 4*64-bit words) we can account for an asymptotic 24x speedup.
nim-bigints uses explicit modulo which are very slow.
Besides it uses uint32 instead of uint64, unfortunately the cost is not a mere 2x. Multiplication costs grow with the square of the number of limbs, so we're looking at 4*uint64 or 8*uint32 with cost: 16 asymptotic operations vs 64, a factor 4x.
Furthermore nim-bigints does not reuse buffers and always allocate a new one, this is incredibly stressful to allocators. In particular with CPU speed nowadays cache and memory bandwidth is a bottleneck. Keeping data hot in cache is extremely important.
See:
This altogether makes nim-bigints 230x slower than Constantine/GMP.
(with Clang, GCC codegen for bigints is meh)
Thanks :)
On that front, Constantine as being integrated to Google continuous fuzzing framework OSS-fuzz in https://github.com/google/oss-fuzz/pull/10710
It's quite complete as it's fuzzed for both 32-bit and 64-bit.
Because it's so so slow, even naively converting multiplication in a series of modular additions is much faster:
dunno about this one - after fixing the insanely slow recursion, changing the implementation to a simple natural mul+divrem 100x:ed the performance - this is the textbook implementation and doesn't even try to be efficient (ie no special cases for different modulos, small numbers, montgomerys and whatnot. I'm sure this is trivial to both benchmark and implement better but the iteration approach sure doesn't look like a winner from this sample of one :)