I tried to benchmark binary splitting algorithm to compute factorial with nim-lang/bigints.
There are some more complex (and faster) algorithms but they require polynomial fast multiplication and multipoint evaluation, which is a lot to implement (and I do not know algebra libraries in Nim that do this).
Here is my code (any feedback would be appreciated) :
import bigints
import std/[times, strutils]
template benchmark(benchmarkName: string, code: untyped) =
block:
let t0 = epochTime()
code
let elapsed = epochTime() - t0
let elapsedStr = elapsed.formatFloat(format = ffDecimal, precision = 3)
echo "CPU Time [", benchmarkName, "] ", elapsedStr, "s"
const one = 1.initBigInt
# Naive approach
func factorial(n: int): BigInt =
## Computes the factorial of n, i.e. the product of the first n integers
result = one
for i in 1 .. n:
result *= initBigInt(i)
func product(a, b: BigInt): BigInt =
## Computes the product of the integers in range a to b (included)
if b == a:
return a
elif b == (a + one):
return a * (a + one)
else:
let m = initBigInt((a + b) div 2.initBigInt)
return product(a, m - one) * product(m, b)
proc factorialBinarySplitting(n: int): BigInt =
## Computes the factorial of n, i.e. the product of the first n integers
## Uses binary splitting to limit binary cost
return product(one, initBigInt(n))
proc echoResults(function: proc (n: int): BigInt): void {.discardable.} =
## Echoes the output of a factorial function for small values
echo "2!", " ", function(2)
echo "3!", " ", function(3)
echo "4!", " ", function(4)
echo "5!", " ", function(5)
echo "24!", " ", function(24)
func testResults(factorial: proc (n: int): BigInt): void {.discardable.} =
## Tests the output of a factorial function for small values
doAssert factorial(2) == 2.initBigInt
doAssert factorial(3) == 6.initBigInt
doAssert factorial(4) == 24.initBigInt
doAssert factorial(5) == 120.initBigInt
doAssert factorial(24) == "620448401733239439360000".initBigInt
echo "Naive factorial test"
echoResults(factorial)
testResults(factorial)
echo ""
echo "factorial with Binary Splitting test"
echoResults(factorialBinarySplitting)
testResults(factorialBinarySplitting)
benchmark "Naive factorial":
discard factorial(123456)
benchmark "Factorial with binary splitting":
discard factorialBinarySplitting(123456)
I get these timings (with -d:release flag):
CPU Time [Naive factorial] 8.873s
CPU Time [Factorial with binary splitting] 3.892s
without -d:release flag, i get around 20s and 15s respectively. So I really appreciate the optimization of the flag ! These timings are a lot compared to Python which implement Karatsuba multiplication (I get around ~1 s and 230 ms for 123456! with naive and binary splitting, and only 23ms with dedicated function in Pari/GP for example). I would like to know how you would actually compute a big factorial (>21) with Nim ? (the std/math function is limited by the int type and as such 21! is greater than 2^64 does not fit in an int).
Do you use other things for benchmarking ? I am not sure to understand how profiling works with Nim. I should probably repeat the tests rather than doing once, but with such timings I am not patient enough to get more precision. I haven't yet read in detail the benchmarking of eratosthenes's sieve on the blog's website. I know this is closely related, and can hint me about more optimizations.
If you did any similar benchmarking, I would be glad to know.
Performance is surely better with stint than with bigints. Has someone tried ?
I use recursivity for binary splitting, but maybe I can use seqs to make the algorithm iterative. I am not sure it would improve performance, even the contrary. The idea, is that we compute the product of each factor two by two, obtain a new list of factors, compute each factor two by two, etc... reducing the size of our list/sequence by two each time. At the end, we obtain the product of each integer from the start. I think the memory management of the list would prevent from gaining something with this reduction in the binary size of the factors in each multiplication.
3) my frustration that Nim is slower than Python Apart from that my questions were :
1) What are the best tools to benchmark Nim @awr1 thanks I will look into it @gemath thanks I did not find any explanations of how to interpret the NimProf output, but it is greatly explained in your link
2) How would you actually efficiently compute a factorial in Nim ? @gemath answered me. Thanks. I know, we have not implemented yet Karatsuba in the bigints library, therefore only basecase multiplication is used (which is very slow). We have not made the greatest choice in the design too (uint32 instead of uint64 as base type). These bindings seemed old (uncomplete and unmaintained) to me, and they would be probably very slow on Windows, as GMP has not been developed for Windows. It defeats in some sense the cross-platform goal of Nim. Maybe we should use Boost for Windows platform ?
Here is an implementation using bignum and benchmarking with benchy.
import std/strutils
import benchy
import bignum
template one(): Int = newInt(1)
#[
template benchmark(benchmarkName: string, code: untyped) =
block:
let t0 = epochTime()
code
let elapsed = epochTime() - t0
let elapsedStr = elapsed.formatFloat(format = ffDecimal, precision = 3)
echo "CPU Time [", benchmarkName, "] ", elapsedStr, "s"
]#
# Naive approach
func factorial(n: int): Int =
## Computes the factorial of n, i.e. the product of the first n integers
result = one
for i in 1 .. n:
result *= newInt(i)
func product(a, b: Int): Int =
## Computes the product of the integers in range a to b (included)
if b == a:
return a
elif b == (a + one):
return a * (a + one)
else:
let m = (a + b) div 2.newInt
return product(a, m - one) * product(m, b)
proc factorialBinarySplitting(n: int): Int =
## Computes the factorial of n, i.e. the product of the first n integers
## Uses binary splitting to limit binary cost
return product(one, newInt(n))
proc echoResults(function: proc (n: int): Int) =
## Echoes the output of a factorial function for small values
echo "2!", " ", function(2)
echo "3!", " ", function(3)
echo "4!", " ", function(4)
echo "5!", " ", function(5)
echo "24!", " ", function(24)
proc testResults(factorial: proc (n: int): Int) =
## Tests the output of a factorial function for small values
doAssert factorial(2) == 2
doAssert factorial(3) == 6
doAssert factorial(4) == 24
doAssert factorial(5) == 120
doAssert factorial(24) == "620448401733239439360000".newInt
echo "Naive factorial test"
echoResults(factorial)
testResults(factorial)
echo ""
echo "factorial with Binary Splitting test"
echoResults(factorialBinarySplitting)
testResults(factorialBinarySplitting)
timeIt "Naive factorial":
keep factorial(123456)
timeIt "Factorial with binary splitting":
keep factorialBinarySplitting(123456)
I did not find any explanations of how to interpret the NimProf
Not sure about it either, I use valgrind. On Windows, Intel VTune might work. Saw somebody mention it here on the forum.
Maybe we should use Boost for Windows platform ?
Unfortunately, the one really fast backend supported by boost-multiprecision is GNU MP...
Maybe one of the other multiprecision-libs (MPFR, MPIR, MPC) is fast enough and easily installable on Windows. Otherwise, one would have to bite the bullet and install GNU MP via cygwin or use Microsofts own Linux runtime thingy (forgot its name).
It defeats in some sense the cross-platform goal of Nim.
Easy interfacing with C/C++ libs is one of the main benefits of Nim. If one uses that, the result is as portable as the C/C++ lib. Not Nims fault. The best solution would obviously be a Nim implementation of the Karatsuba (and others for longer numbers) algorithm.
I've implemented the same in Theo (https://github.com/SciNim/theo, renamed Megalo, for "number-theory").
import ../theo/[
datatypes, io_hex, io_int, op_mul, op_init, op_comparisons
]
import std/[times, strutils]
template benchmark(benchmarkName: string, code: untyped) =
block:
let t0 = epochTime()
code
let elapsed = epochTime() - t0
let elapsedStr = elapsed.formatFloat(format = ffDecimal, precision = 3)
echo "CPU Time [", benchmarkName, "] ", elapsedStr, "s"
# Naive approach
func factorial(n: uint): BigInt =
## Computes the factorial of n, i.e. the product of the first n integers
result.setOne()
var buf: BigInt
for i in 1'u .. n:
buf.fromInt(i)
result.mul(result, buf)
proc echoResults(function: proc (n: uint): BigInt): void {.discardable.} =
## Echoes the output of a factorial function for small values
echo "2!", " ", function(2).toHex()
echo "3!", " ", function(3).toHex()
echo "4!", " ", function(4).toHex()
echo "5!", " ", function(5).toHex()
echo "8!", " ", function(8).toHex()
echo "10!", " ", function(10).toHex()
echo "20!", " ", function(20).toHex()
echo "21!", " ", function(21).toHex()
echo "22!", " ", function(22).toHex()
echo "23!", " ", function(23).toHex()
echo "24!", " ", function(24).toHex()
echo "Naive factorial test"
echoResults(factorial)
benchmark "Naive factorial":
discard factorial(123456)
I get 3.942s on my i9-11980HK CPU, without binary splitting.
Main difference is: use of uint64 by default. This is not just a x2 improvement, assuming you have a bigint that requires 10 limbs with uint64, it becomes 20 limbs in uint32, but multiplication and division are O(n²) algorithm, this x2 limbs difference becomes x4 speed difference.
Then there is the GC allocation pressure. In this line result *= initBigInt(i), *= is actually written a = a * b at a low level, this means a temporary to store a*b, moving it to a, destroying a, same with initBigInt(i) within the loop. You should avoid allocating at all cost within a hot loop and it's really important for performance that a bigint library offers an API like proc binaryOp(r: var Bigint, a, b: BigInt) that can reuse buffers and deal with r aliasing with a or b or both (for doubling/squaring).
There are some more complex (and faster) algorithms but they require polynomial fast multiplication and multipoint evaluation, which is a lot to implement (and I do not know algebra libraries in Nim that do this).
Polynomial multiplication is actually simpler than bigint multiplication. If you look into integer representation it is like this:
a₀ + a₁2⁶⁴ + a₂2⁶⁴*² + a₃2⁶⁴*³ + ... + aₙ2⁶⁴*ⁿ
A polynomial is represented like this:
a₀ + a₁x + a₂x² + a₃x³ + ... + aₙxⁿ
The only difference is that integers can carry from the 2⁶⁴ base to the 2⁶⁴*² base. Polynomial can't, so you just need to implement integer multiplication without carries and you have polynomial multiplication.
Multipoint evaluation usually requires FFT for integers / Number Theoretic Transform. I'm not too sure how hard it is to implement for generic integers, but I have one for modular arithmetic with prime modulus: https://github.com/mratsim/constantine/blob/c2eb42b/research/kzg_poly_commit/fft_fr.nim#L54-L186
@enthus1ast Thanks, your compilation flags did improve the timings !
with -d:release :
CPU Time [Naive factorial] 6.429s
CPU Time [Factorial with binary splitting] 3.698s
with nim c -d:danger --opt:speed -d:lto -r factorial.nim:
CPU Time [Naive factorial] 3.458s
CPU Time [Factorial with binary splitting] 1.269s
@mratsim I tried your code :
with -d:release :
CPU Time [Naive factorial] 7.989s
with nim c -d:danger --opt:speed -d:lto -r factorial.nim:
CPU Time [Naive factorial] 4.878s
I get faster execution time, but no hasty conclusion can be made : I only tried on one factorial computation.
Main difference is: use of uint64 by default. This is not just a x2 improvement, assuming you have a bigint that requires 10 limbs with uint64, it becomes 20 limbs in uint32, but multiplication and division are O(n²) algorithm, this x2 limbs difference becomes x4 speed difference.
The algorithms for addition and multiplication are not the same though. The advantage of using uint32 on 64 bits architecture, is the ability to handle overflow easily : we can do a multiplication with base types and handle overflow after and without tests by a simple mask on the 32 upper bits. I think we would like to change the base type in bigints, but that would require to rewrite from scratch the lib. You already said that twice in bigints github issues tracker, we should definitely take this into account. GC allocation pressure might be easier to handle.
Multipoint evaluation usually requires FFT for integers / Number Theoretic Transform. I forgot about this. This is definitely quite technical to handle limbs correctly and find thresholds between algorithms. Karatsuba algorithm is not hard to implement, I have a PR in which it is implemented correctly but it does not work yet at CT and I have not handled how to switch from base case to Karatsuba multiplication yet.The complex part is to handle comparison of limb.
There are many reasons for it, but Python developpers refused to implement natively better multiplication algorithms than Karatsuba. You can look at the following python forum post : https://discuss.python.org/t/faster-large-integer-multiplication/13300 Tim Peters who implemented Karatsuba says that he regrets adding it to Python. He prefers specialized libraries like decimal. Bigints has not the purpose to change standard int multiplication, and it betters stays like a dedicated lib.
Polynomial multiplication is actually simpler than bigint multiplication. If you look into integer representation it is like this:
a₀ + a₁2⁶⁴ + a₂2⁶⁴*² + a₃2⁶⁴*³ + ... + aₙ2⁶⁴*ⁿ
A polynomial is represented like this:
a₀ + a₁x + a₂x² + a₃x³ + ... + aₙxⁿ
The only difference is that integers can carry from the 2⁶⁴ base to the 2⁶⁴*² base. Polynomial can't, so you just need to implement integer multiplication without carries and you have polynomial multiplication.
Yes, that is true but implementation depends on the base field. If you multiply polynomials with elements in GF(2^k) with k <= 64, this can be done very efficiently with only limb.