nimforum mirror - BigNum and BigInt timings on the Pollard Rho algorithm

DougT (orginal) [2022-12-13T19:36:04+01:00] view original

So far I am enjoying nim. I have a personal benchmark that I try on each new language that I try. It uses the Pollard Rho algorithm to find a factor or 2^257 - 1. I am using linux mint and a Ryzen 5600x cpu slightly overclocked (Geekbench single core score of 1730).

For my nim runs I use:: nim c -d:release --passC:-flto --passL:-s --gc:markAndSweep rho.nim

The algorithm iterates 3790200 times and produces a factor of 535006138814359

Using the BigInt library: 106.3 seconds Using the BigNum library: 1.38 seconds Using the OCaml language with zarith: 1.2 seconds Using the F# language : about 15 seconds

BigInt allows coding that closely follows the algorithm: x = (x*x + c) mod n # all variables are BigInts

BigNum is less readable, but acceptable to me: x = `mod`(x ,add(t2, mul(t1,x,x), c), n) or discard `mod`(x ,add(t2, mul(t1,x,x), c), n)

I find the discard statement ugly and the alternative only adds about .01 seconds to the run. t1 and t2 are BigInt temporary variables that are not used since add and mul produce a return value.

So my conclusion is the BigInt is not suitable for problems of the rho type with big numbers to factor, but is very readable and suitable for less intensive problems.

planetis (orginal) [2022-12-13T21:49:21+01:00] view original

I am assuming the slow one is the nim-lang library and the other the gmp wrapper? Also have you tried https://github.com/ykoba1994/bignumber.nim ?

DougT (orginal) [2022-12-13T22:06:08+01:00] view original

I installed bigints and bignum from nimble. My understanding is that bigints (the slower one) is entirely written in nim while bignum is a wrapper for gmp. I have not tried bignumber.nim. It appears to be a work in progress. It does not have gcd (which I need), although it is easy to write one. When I have some time I will give it a try.

DeletedUser (orginal) [2022-12-13T22:40:26+01:00] view original

Also https://github.com/status-im/nim-stint (more popular) and https://github.com/rockcavera/nim-nint128

dlesnoff (orginal) [2022-12-14T01:46:26+01:00] view original

Why the markandsweep GC exactly ? What about others GC's ? No wonder, there is no fast multiplication algorithm of integers in std/bigint. Nim-stint should be better, among the pure Nim's implementation, but is more cryptographic specific (I think it supports unsigned integers only).

DougT (orginal) [2022-12-14T14:54:14+01:00] view original

The markandsweep GC option was copied from other runs where it did improve the run time, but on the Pollard Rho it makes no difference.

shirleyquirk (orginal) [2022-12-14T16:53:53+01:00] view original

i think you're misreading the docs for bignum, in that it does provide infix overloads, so x =(x*x + c) mod n syntax works just fine

auxym (orginal) [2022-12-15T17:24:52+01:00] view original

Maybe also of interest:

https://github.com/mratsim/constantine/tree/master/constantine/math/arithmetic

DougT (orginal) [2022-12-15T17:33:11+01:00] view original

You are correct. The timings go up from

1.38 seconds to 3.32 seconds, which is tolerable for better readability.

dlesnoff (orginal) [2022-12-15T18:00:57+01:00] view original

We are still very much interested in your Pollard Rho's example for the nim-lang/bigints. Don't hesitate to add it in the examples dir and PR it in Github. It might serves me to test my integers multiplication algorithms.

DougT (orginal) [2022-12-15T19:01:19+01:00] view original

@diesnoff I am not very conversant with Github. If you send me an email, I will send you the code: doug at dougtelford.com

DougT (orginal) [2022-12-16T00:33:10+01:00] view original

If I compile bigints with


nim c -d:release --passC:-flto --passL:-s --mm:arc rho1.nim

I get improved timings of:


real	1m24.066s
user	1m24.060s
sys	0m0.000s

fsh (orginal) [2022-12-17T02:20:09+01:00] view original

Heya, I don't really have anything to contribute but I recently uploaded https://fsh.github.io/integers/integers.html which are intended to be ergonomic integers (wrapping GMP) and seems kind of maybe-possibly-peut-être adjacent to what you're looking for.

No guarantees re polish or bugs or other platforms (I make some low-end assumptions), and as far as I know I've been the only user.

As for opinion pieces on bigint libraries in general, I also wrote a little note on the rather sad state of GMP vs "anything else" at https://github.com/fsh/integers#why-gmp

mratsim (orginal) [2023-01-03T14:30:01+01:00] view original

As for opinion pieces on bigint libraries in general, I also wrote a little note on the rather sad state of GMP vs “anything else” at https://github.com/fsh/integers#why-gmp

Why GMP?
Unfortunately there is just no beating GMP when it comes to performance. It is the unequivocal gold standard.

All “home-made” or “from scratch” integer implementations by hobbyists, such as Nim’s official bigints (and even Python’s built-in int) are laughably slow compared to GMP, often on the order of magnitudes.

Alternative projects such as libtomath come much closer, but are still not quite as optimized as GMP.

The reason projects shy away from GMP is two-fold:

The “serious” reason is that GMP is dual-licensed with LGPLv3 and GPLv2, and many consider even the LGPL too restrictive or convoluted (e.g. static linking can be problematic), so this makes it simply a no-go for many projects.

The less practical reason is that implementing one’s own arbitrary precision integers from scratch is (at the onset) a very attractive and exciting project for many programmers, so it is a common mind trap to fall into.

The first reason is unavoidable and unfortunate. The second reason is very understandable (I feel the same pull), although it is incredibly frustrating when I’m stuck with these cowboy implemenations as a user.

Since I am of the opinion that (big) integers is a core data type that should be universally available, thus also implicitly demanding it should be as efficient as humanly possible, I’m always going to bite the bullet on the (L)GPL license.

Being as fast as GMP is a three-fold issue:

You need knowledge of your target architecture

You need knowledge of big integers algorithm

You need time

But it is doable, for example for size suitable for cryptography my library 3.8x faster than GMP on basic big int multiplication:

bench code: https://gist.github.com/mratsim/9bf2801d06e9df86dc3c98efcc825666

library: https://github.com/mratsim/constantine

The ratio is probably x8 on modular multiplication.

How?

In my case I benefit from knowing the sizes at compile time which removes a lot of need for if/then/else.

I have an assembler that allows me to use specific bigint instructions MULX, ADCX, ADOX (https://github.com/mratsim/constantine/blob/928f5155/constantine/math/arithmetic/assembly/limbs_asm_mul_x86_adx_bmi2.nim#L94-L105) using pure C or pure Nim loses me up to 60% performance (https://github.com/mratsim/constantine#inline-assembly) especially when using GCC as a compiler and despite using intrinsics: https://gcc.godbolt.org/z/2h768y

C code for add with carry on 256-bit

#include <stdint.h>
#include <x86intrin.h>

void add256(uint64_t a[4], uint64_t b[4]){
  uint8_t carry = 0;
  for (int i = 0; i < 4; ++i)
    carry = _addcarry_u64(carry, a[i], b[i], &a[i]);
}

GCC 9.2 ASM

add256:
        movq    (%rsi), %rax
        addq    (%rdi), %rax
        setc    %dl
        movq    %rax, (%rdi)
        movq    8(%rdi), %rax
        addb    $-1, %dl
        adcq    8(%rsi), %rax
        setc    %dl
        movq    %rax, 8(%rdi)
        movq    16(%rdi), %rax
        addb    $-1, %dl
        adcq    16(%rsi), %rax
        setc    %dl
        movq    %rax, 16(%rdi)
        movq    24(%rsi), %rax
        addb    $-1, %dl
        adcq    %rax, 24(%rdi)
        ret

Clang 9.0 ASM

add256:
        movq    (%rsi), %rax
        addq    %rax, (%rdi)
        movq    8(%rsi), %rax
        adcq    %rax, 8(%rdi)
        movq    16(%rsi), %rax
        adcq    %rax, 16(%rdi)
        movq    24(%rsi), %rax
        adcq    %rax, 24(%rdi)
        retq

Now porting my cryptographic optimizations to a generic arbitrary precision libary is something I started in https://github.com/SciNim/theo however then you get new time consuming engineering challenges beyond just the algorithmic ones, for example for add: https://github.com/SciNim/theo/blob/9172900/theo/op_addsub.nim#L27-L35

func add*(r {.noalias.}: var BigInt, a, b: BigInt) =
  ## BigInt addition
  # TODO:
  # - relax the aliasing constraint
  # - dispatch on fixed size add for add that fits in registers
  #   and build add recursively
  #   to avoid loop counters resetting carry chains.
  # - Support negative inputs
  # - Support compile-time

It is fairly common to reuse a buffer and do r.add(a, b) to limit expensive allocations. But to detect aliasing you need to check the pointer of r and a.

But then you might want your library to work at compile-time as well, and there is no pointer at compile-time, and suddenly you need 2 implementations for RT and CT.

And then you need to support negative integers and for each implementations you need to handle the 2 cases same sign and different sign

And then you need to handle the case where the number of limbs of bigints are different, to reduce codepath, if b < a you swap a and b pointers in your implementation, except at compile-time.

Then reimplement all in assembly to ensure codegen is decent

And then you need to come up with tests to ensure all of this is covered

And then you need to fuzz

Rince and repeat for all operations.

For multiplication, you need then to implement in level of overhead/acceleration: product scanning (comba) multiplication, Karatsuba multiplication, Toom-Cook multiplication (now that's if we have thousands of digits so stretch goal), FFT multiplication.

For division, expect days of debugging, everyone gets it wrong, even Google in Go and Javascript: https://skanthak.homepage.t-online.de/division.html

Now if someone is interested in implementing a fast bigint library for Nim, I can guide them, lots of engineering challenges were already solved in Constantine.

fsh (orginal) [2023-01-15T12:20:48+01:00] view original

Heya mratsim! When I first got into Nim, you're the person that kept coming up time and time again for the stuff I was searching--including feature requests--so I think we have a lot of overlapping interests (even go, judging by your github picture).

But it is doable, for example for size suitable for cryptography

Of course, I totally agree. I didn't say it outright in my note, but I was implicitly thinking of (and being frustrated with) the general use cases, such as Python's homemade bigints, etc. Having statically known sizes is a huge benefit when working with 𝔽_q or elliptic curves for specific applications, naturally, tho then again one could also argue the low level mpn_* API is part of GMP, and that could probably be used in the statically sized case for big speed-ups. Though getting the things to inline properly might be tricky, so that's a potential barrier. (Also, worth a note: in the general case, one potentially big thing I don't think have been taken to its full potential is something like what C++ does with short-string optimization. I was thinking of trying something like that myself at some point.)

But then you might want your library to work at compile-time as well

This is a specific topic I'm very interested in personally. A big frustration I have with extant languages is how most of the techniques for manipulating the boundary between comptime and runtime -- generics, static[], specialization of such, etc -- is invariably pretty limiting/buggy to work with compared to what I think it could/should be. I've been toying with some proof-of-concepts (compiling to Zig instead of C) to explore the design space here. I want to be able to write code for Matrix(T: type): type which has run-time rows/cols fields in its representation, and then be able to _specialize the type itself by assigning static comptime values to some fields -- also trying to figure out how to promote slices into static arrays etc when possible. The idea is for fields to be automatically lifted out of the runtime representation into the comptime representation like we currently would write Matrix[n: static[..], m: static[..], T]. I absolutely hate having to write code for the same algorithm twice -- one using runtime parameters, possibly branching on those to more specialized cases, and then again in order to use const generics etc.

pietroppeter (orginal) [2023-01-15T13:55:54+01:00] view original

for division, expect days of debugging, everyone gets it wrong, even Google in Go and JavaScript

Incidentally, it was recently found out that bigints has a bug in its division implementation: https://github.com/nim-lang/bigints/issues/123

mratsim (orginal) [2023-01-16T01:58:49+01:00] view original

A big frustration I have with extant languages is how most of the techniques for manipulating the boundary between comptime and runtime -- generics, static[], specialization of such, etc -- is invariably pretty limiting/buggy to work with compared to what I think it could/should be. I've been toying with some proof-of-concepts (compiling to Zig instead of C) to explore the design space here.

For bigint, Stint works and is tested at compile-time: https://github.com/status-im/nim-stint/blob/f366239/tests/test_uint_addsub.nim#L201-L202

But I don't think you can do the same with an arbitrary precision lib, detecting aliasing is very important for optimization to understand which buffers to use/reuse and there is no pointer/address at compile-time.

I want to be able to write code for Matrix(T: type): type which has run-time rows/cols fields in its representation, and then be able to _specialize the type itself by assigning static comptime values to some fields -- also trying to figure out how to promote slices into static arrays etc when possible.

I don't see where this is worthwhile:

You have a small matrix, size 4x4, then you have it on the stack and the compiler can use it's full array of constant folding, constant propagation and strength reduction.

Or the matrix is large and on the heap and you cannot remove the most expensive operation: copying memory.

Mirror of forum.nim-lang.org

9721 :: BigNum and BigInt timings on the Pollard Rho algorithm

Why GMP?