Hello all
I was trying to optimize some modular arithmetic in a C++ program, and I had the idea to try it in Nim first. The idea is to pack two (or more) small integers into a 64-bit integer, perform a bunch of arithmetic, checking after addition or multiplication whether one of them has grown large as an ordinary natural number before performing the modulus.This is a pretty easy way to get parallel modular arithmetic on the cheap, & I think it's pretty common in some high-performance computation software.
In Nim, the unoptimized version took about 30x longer:
$ time ./test_packed_row
101 7
127 256
101 7
real 0m0.701s
user 0m0.695s
sys 0m0.004s
$ time ./test_unpacked_row
101 7
127 256
101 7
real 0m20.289s
user 0m19.918s
sys 0m0.105s
Great! Much better than I expected, but great! Now to implement it in C... but in C, the un-optimized version took only about 3.5x longer.
$ time ./test_packed_row_c
101 7
real 0m0.729s
user 0m0.720s
sys 0m0.005s
$ time ./test_unpacked_row_c
101 7
real 0m2.635s
user 0m2.616s
sys 0m0.011s
So the un-optimized Nim version is mysteriously 10 times slower then the un-optimized C version. Does anyone know what causes this? Is it the mod operator? It seems to me that that's the only thing that the optimized version does much less.
Here's the un-optimized Nim code. I compiled both Nim versions with the -d:release option.
let
m2 = 10
m1 = 3*m2
b1 = 127
b2 = 256
mul = 102
var
a1 = 101
a2 = 7
echo a1, " ", a2
echo b1, " ", b2
echo "multiplier: ", mul
let stop = 1000000000
for each in 1..stop:
a1 += (mul*b1)
a2 += (mul*b2)
a1 = a1 mod (1 shl m2)
a2 = a2 mod (1 shl m2)
echo a1, " ", a2
Here's the un-optimized C code.
#include<stdlib.h>
#include <stdio.h>
int main() {
long long m2 = 10;
long long m1 = 3*m2;
long long b1 = 127;
long long b2 = 256;
long long mul = 102;
long long a1 = 101;
long long a2 = 7;
for (long i = 0; i < 1000000000; ++i) {
a1 += (mul*b1);
a2 += (mul*b2);
a1 = a1 % (((long long )1) << m2);
a2 = a2 % (((long long )1) << m2);
}
printf("%lld %lld\n", a1, a2);
}
I can supply the optimized versions if needed, but I don't think it's needed.
If it matters, the compiler is GCC-8.1.0, on an old-ish MacBook Pro (13-inch, Mid 2012).
Here's the un-optimized Nim code.
Can you show us the optimized Nim code too, so that we can see if we can reproduce this large discrepancy?
Can you show us the optimized Nim code too...
Sure. I left it out because the post was already long.
let
m2 = 10
m1 = 3*m2
b1 = 127
b2 = 256
mul = 102
a1 = 101
a2 = 7
a = (a1 shl m1) + a2
b = (b1 shl m1) + b2
echo a1, " ", a2
echo b1, " ", b2
echo a div (1 shl m1), " ", a mod (1 shl m1)
echo b div (1 shl m1), " ", b mod (1 shl m1)
echo "max in segment: ", (1 shl m1) - 1
echo "work modulo: ", (1 shl m2)
echo "basic number: ", a
echo "multiplier: ", mul
echo "adding: ", (mul*b) div (1 shl m1), " ", (mul*b) mod (1 shl m1)
var c = a
let mask = (1 shl (2*m1 - 1)) + (1 shl (m1 - 1))
let stop = 1000000000
for each in 1..stop:
c += mul*b
let overflow = c and mask
if overflow != 0:
let r1 = (c div (1 shl m1)) mod (1 shl m2)
let r2 = (c mod (1 shl m1)) mod (1 shl m2)
c = (r1 shl m1) + r2
echo (c div (1 shl m1)) mod (1 shl m2), " ", (c mod (1 shl m1)) mod (1 shl m2)
In case you want it, here's the optimized C code.
#include <stdlib.h>
#include <stdio.h>
int main() {
long long m2 = 10;
long long m1 = 3*m2;
long long b1 = 127;
long long b2 = 256;
long long mul = 102;
long long a1 = 101;
long long a2 = 7;
long long a = (a1 << m1) + a2;
long long b = (b1 << m1) + b2;
long long c = a;
long long mask = (((long long )1) << (2*m1 - 1)) + (((long long )1) << (m1 - 1));
for (long i = 0; i < 1000000000; ++i) {
c += mul*b;
long long overflow = c & mask;
if (overflow != 0) {
long long r1 = (c / (1 << m1)) % (1 << m2);
long long r2 = (c % (1 << m1)) % (1 << m2);
c = (r1 << m1) + r2;
}
}
printf("%lld %lld\n", (c / (1 << m1)) % (1 << m2), (c % (1 << m1)) % (1 << m2));
}
Use const instead of let when you can, or just build with --implicitStatic:on. Measurements on my computer:
./optimized_nim 1.01s user 0.00s system 99% cpu 1.012 total
./optimized_c 1.00s user 0.00s system 99% cpu 1.010 total
Is it the mod operator?
Seems to be very likely.
For
a1 = a1 % (((long long )1) << m2);
it is easy to detect for the compiler that you are doing a modulo division by a power of two, which is a plain bit masking operation. (Well, when signed numbers are involved, an additional bit test may be necessary.)
For
a1 = a1 mod (1 shl m2)
shl() is a proc call with unknown result, so no change for backend compiler to discover it, so a full division is done, which is slow, very slow compared to most other operations. See Agner Fogg.
So you may try a mask or shift op in Nim. Division by 1^x should be something like and masking with (1^x - 1), so maybe
a1 = a1 and ((1 shl m2) - 1)
Well that is more a guess that early in the morning, you may think about it yourself, and be carefully when numbers are signed, maybe casts may be necessary. Such masking ops are generally simpler with unsigned numbers.
Division by 1^x should be something like and masking with (1^x - 1)...
In the actual application my modulus will be unknown at compile time, so I can't apply any optimizations like that. But if I understand you, you're saying that this might be an artifact of my using a modulus that is a power of 2, because the C compiler recognizes the issue, but not when Nim code is compiled to C.
To test this, I changed the modulus to 500. It does slow down the un-optimized version, but only slightly:
$ time ./test_unpacked_row_c
101 7
real 0m4.821s
user 0m4.789s
sys 0m0.019s
The optimized C version remains at the same speed, as does the optimized Nim code. The unoptimized Nim code now runs at about the same speed as the unoptimized C version:
$ time ./test_unpacked_row
101 7
127 256
101 7
real 0m4.394s
user 0m4.324s
sys 0m0.024s
But if I change it back to (1 shl m2), I get the same explosion in time. So it looks to me as if the bigger problem is Nim's shl operator.
PS The numbers are unsigned for modulo arithmetic, or should be (if you work only with positives). I'm getting the same results when I try it in several languages, so either I'm making the same error in several languages or that's not a problem here. :-)
In the actual application my modulus will be unknown at compile time
I don't told you that the modulus should be a compile time constant, plain (1 << x) with positive integer x should be enough for c compiler to see that result is a power of two, so division can be replaced by masking.
So it looks to me as if the bigger problem is Nim's shl operator.
Yes, in some way. As the compiler does know nothing about its result. Maybe we can try a static variant of shl() for the case shl(1, x) where first argument is constant 1? Is shl() inlined at least? (Recently I discovered that ^ operator is not, and there is no optimization for x ^ 2, so x * x is much faster. I created an issue for this...)
I don't told you that the modulus should be a compile time constant
I'm pretty sure I misunderstood what you said, but I also think you misunderstand what I said. But never mind; we seem to be on the same wavelength after.
But my guess is that Nim will not emit (1 << n) for shl(1, n) in C backend.
I looked at the generated C code and saw this:
TM_hYZ6NrhnQO9a1dUvVg04ZAw_9 = modInt(a1_a9aE9aeHBWPDQIgzGTl1UCoA, (NI)((NU64)(((NI) 1)) << (NU64)(m2_S5vIBaVSdEQjyT9byyzUwlg)));
I don't see what the issue is at all.
I modified the Nim code further:
let m3 = parse_Int(read_line_from_std_in "Input the modulus:")
#...
a1 = a1 mod m3
a2 = a2 mod m3
This should eliminate all the compiler optimization you've mentioned, right? yet the disparity in time remains.That is interesting, and maybe the Nim devs can comment on it later.
But why do you not first try to avoid mod at all in your Nim code? At least for your first example you can replace mod by and operation, then Nim speed should be very similar to C speed. In your final code, when you need a true modulo division, then we have to investigate why Nim needs it, and C maybe can avoid it.
C code is generally contained in functions, so that is what C compilers like.
Nim wraps the program in a main procedure already, so that in itself can't be the problem. For example, nim's translation of the program I started this thread off with has this function:
int main(int argc, char** args, char** env) {
cmdLine = args;
cmdCount = argc;
gEnv = env;
NimMain();
return nim_program_result;
}
There has to be something deeper than that.There has to be something deeper than that.
You may read the comments of Andreas and Mr Felsing for example here:
https://forum.nim-lang.org/t/3788
But as I remember there is no reall explanation.
shl, mod and other basic functions are translated to raw << and mod. The C compiler can do the same division by magic numbers or constant propagation for both Nim and C.
On another hand, the way to do fast modular arithmetic is to avoid using the modulo operation, test if you would rollover and add or sub accordingly.
There has to be something deeper than that.
The problem isn't the code, it's the variables. If you don't put them in a function, they are global variables, which limits the optimizer/code generator. And obviously, once you have them as a function's local variables, the code must be put inside that function in order to access them, too.
Part of the problem is that the code generated by Nim does not make module-local variable static (otherwise, the optimizer could infer that they are de facto local). But that's difficult to do, as templates and inline procs couldn't access them if they were actually static.
The problem isn't the code, it's the variables. If you don't put them in a function, they are global variables, which limits the optimizer/code generator.
That makes sense; thank you. Someone once told me something similar about JVM programming. Still, why isn't it a problem in the packed version? Even now, I haven't wrapped it in a procedure, yet it outperforms the unpacked. It's only the unpacked version that generated the error.
Variables in Nim outside of procedures are accessible from any procedure in the same module. I think that is why they need to be converted to global variable in C code.
https://godbolt.org/ show you assembly code that might help you find why some code is slow.
Here is C code that is simplified version of your code. It seems compiler optimized % operator to and instruction.
I moved all local variables to global variable. % operator become idiv instruction and the values of variables used in the loop are loaded to register before main loop. Compiler might expecting these variable could be changed before calling main function. Or compiler doesn't know what printf function do and it might think printf could change these global variables and call main function.
When I add static to these variable, generated asm code seems almost same to local variable version.
Still, why isn't it a problem in the packed version?
It's always difficult to tell what an optimizer does, but I suspect it has to do with the fact that the other version doesn't mutate any global variables other than c. (Both each and overflow are not globally visible.) So there isn't much to be gained by putting c in a register.
C semantics allow you to assume that even global variables do not spontaneously change their values (unless marked volatile), so code using global variables in a read-only fashion allows you to apply normal optimization techniques, such as common subexpression elimination and loop-invariant code motion, to them.
In principle, you can also infer some things about global variables that do get mutated, but those optimization techniques are harder and/or more expensive.
All this is pure speculation, though. I'd have to look at the generated object code. It's always difficult to predict what a modern optimizer will do to any given piece of code.