So I'm cleaning up some code, and came across this problem.
I simplified one line of code, which caused a 4+ seconds performance hit.
I have an array of bytes defined as: a: seq[unit8] initialized to all 0s. Inside a loop, called millions of times, I'm setting the lsb to 1 when required.
The first way I did it was like this:
a[k] = a[k] or 1
Then I said, oh, I can simplify that to this:
a[k] = 1
Yes, it gives the correct answers, but now the program runs 4+ seconds slower with just this one change to the entire codebase.
Is this a Nim problem or a GCC problem?
I compiled it in 0.17.2 and 0.18.0, and on GCCs 4.9.2, 7.3.0 and 7.3.1, and the characteristic of the behavior is consistent.
I looked at the generated C code for both versions. They have exactly the same structure (reference names are different) and exactly the same number of total lines-of-code (loc).
Here's the C code generated for the different Nim versions snippets.
while (1) {
if (!(k < Kn)) goto LA18;
seg->data[k] = (NU8)(seg->data[k] | ((NI) 1));
k += prime;
} LA18: ;
while (1) {
if (!(k < Kn)) goto LA18;
seg->data[k] = ((NU8) 1);
k += prime;
} LA18:
So why does the 2nd snippet produce a binary that's 95 bytes bigger and 4+ seconds slower?
I guess more people would regard your post serious if you would not use terms like "the program runs 4+ seconds slower".
What does that mean when maybe total runtime is a few hours?
And you did not tell us your CPU type and OS (32 or 64 bit), ARM, MIPS or x86 processor?
Generally, the result is not so surprising. A MOV or a XOR assembly instruction needs very equal cycle count, see
http://www.agner.org/optimize/instruction_tables.pdf
And finally, you have to look at the assembly code to see the C compiler magic.
For an example of that magic, you may read one of the last entries of the great Lemire blog:
https://lemire.me/blog/2018/04/12/for-greater-speed-try-batching-your-out-of-cache-data-accesses/
I am still wondering about the results.
EDIT: s/XOR/OR/
Is this a Nim problem or a GCC problem?
Neither. Modern CPUs are complex beasts and sometimes it's impossible to say what is going on. the XOR version is slightly longer, this means different instruction cache behaviour and/or branch prediction behaviour.
seg->data[k] = (NU8)(seg->data[k] | ((NI) 1));
// vs:
seg->data[k] = ((NU8) 1);
Well, in the first case, you use NI-sized xor operator, as NU8 xor NI == NI. Int-sized operations often tend to be faster than int8-sized analogues (especially if vectorization is involved!). Additionally, if _mm_shuffle_epi8 was used, int->int8 conversion was blazingly fast.
As for the second example, if compiler knew k range is divisible by 4, it could use ((NU32)(((NU8)1)<<24 | ((NU8)1)<<16 | ((NU8)1)<<8 | ((NU8)1))) as a single int32 value and do ultra-fast memory write (certainly faster than the first variant). BUT does it really know it is divisible by 4?
In order to be taken super, suPER, SUPER seriously, let me be hermetically sealed, and scientifically rigorous, in presenting this problem.
Base Hardware:
System76 laptop with Intel I7 6700HQ cpu, 2.6-3.5 GHz clock, 4 cores, 8 threads, 16GB of memory, and 128GB SSD.
Base OS:
Linux kernel 4.11.11, gcc 4.9.2|clang 3.9.1, PCLinuxOS 64-bit.
(This is a pre Meltdown and Spectre patched system)
VB OSs: Linux kernels 4.15.x-4.16.x, gcc 7.3.0/1 : clang 3.9.1-6.0
(These are post Meltdown and Spectre kernel patched systems)
The results shown here are from my base system, but the problem is consistent with any combination of kernel, and gcc or clang, on VB based systems for 64-bit Linux distros.
Here are the gists of the tested code:
The difference between the code is line 239 in twins_sieve.
twinprimes_test.nim
https://gist.github.com/jzakiya/e140e9f3d660059631b2bb09487220f9
twinprimes_test1.nim
https://gist.github.com/jzakiya/8f7768c8c9f6e925b200c5f463a2f95c
They were compiled with flags as follows, and run on a quiet system.
(Rebooted, opened only a terminal, and ran tests.)
nim c --cc:gcc --d:release --threads:on twinprimes_test.nim
nim c --cc:gcc --d:release --threads:on twinprimes_test1.nim
then run
echo 500_000_000_000 | ./twinprimes_test
echo 500_000_000_000 | ./twinprimes_test1
echo 1_000_000_000_000 | ./twinprimes_test
echo 1_000_000_000_000 | ./twinprimes_test1
then compile as
nim c --cc:clang --d:release --threads:on twinprimes_test.nim
nim c --cc:clang --d:release --threads:on twinprimes_test1.nim
and run
echo 500_000_000_000 | ./twinprimes_test
echo 500_000_000_000 | ./twinprimes_test1
echo 1_000_000_000_000 | ./twinprimes_test
echo 1_000_000_000_000 | ./twinprimes_test1
For either compiling with gcc or clang, as the input values become bigger, twinprimes_test1 times become increasingly slower, as a percentage of twinprimes_test, approaching on order of 10% for the two data points shown. For bigger inputs the differences grow larger.
(On a good note, I was pleasantly surprised to see clang has faster times for this particular architecture, at least on my base system, as it had always been slower.)
Input Number | twinprimes_test | twinprimes_test1 |
| gcc 4.9.2 | clang 3.9.1 | gcc 4.9.2 | clang 3.9.1 |
------------------------------------------------------------------
5e11 | 28.926 | 28.241 | 31.759 | 30.970 |
1e12 | 63.285 | 61.042 | 67.842 | 66.678 |
Even though the devs have exhibited an extreme lack of curiosity (willful blindess) to acknowledge this problem, even as just a user, you should pay attention to it, though.
I only discovered this behavioral phenomena because of serendipty. How many places in your (or Nim's) codebase are similar occurences of just this code phenomena lurking, unbekownst to its potential performance hit?
This is also, obviously, a potential security vector.
I've already identified the nim source code difference that (re)produces the problem. I've already identified the compiled code C output created for the Nim source code. What is needed is a forensic analysis of the assembly code differences, which I don't know how to do, nor really have the inclination (or time) to do if I did.
It would also obviously be interesting (and rigorous) to see if|how this phenomena exists on different hardware (AMD, ARM, PowerPc, etc), and OS (BSD, Windows, Mac OS, IOS, Android, etc) systems.
If the devs don't have even a basic level of intellectual inquisitiveness (pride?) to understand why this phenomena exists (and would have to ultimately fix it ), I don't know what more data, motivation, or incentive, is needed.
Mr jzakiya,
this is indeed an interesting observation, and in your last post you gave a very good description of it.
But your attitude is still a bit strange. As you already found out, Nim generates fully valid C code, similar code as a C programmer may write. So it is more a problem of C compilers and CPU hardware. (Maybe Intel or Microsoft Compiler backends would create better optimized code.)
You really should not assume that Nim devs spent much time on your observation, as they are busy preparing V 1.0 -- and you seems to be not a mayor monetary sponsor.
Maybe gcc or clang devs would be interested in your observation, but that people generally need minimal hand written code samples to investigate issues.
If the devs don't have even a basic level of intellectual inquisitiveness (pride?) to understand why this phenomena exists (and would have to ultimately fix it ), I don't know what more data, motivation, or incentive, is needed.
Shrug I'm sorry I know so much about problems like these that I'm unwilling to spend time on them.
https://youtu.be/2EWejmkKlxs?t=1401
Chandler Carruth talks for half an hour about a problem like this without having a real explanation for the effects he measured.
@jzakiya
So you DID inspect C code differences and also checked another compiler, but you still blame Nim, not gcc, for that time difference? Even to the point of calling it a security issue?
So you DID inspect C code differences and also checked another compiler, but you still blame Nim, not gcc, for that time difference? Even to the point of calling it a security issue?
Excuse me @Udiknedormin, maybe you aren't fluent in English? I have not blamed Nim or gcc. I have asked the question, is it a Nim or gcc problem? Since gcc, and clang, produce the same characteristic of the problem for the same Nim code (at least on my Intel I7 cpu system), maybe it's something else. I don't know. To me it's an interesting anomaly that needs fuller investigation.
Also, I merely made the generic statement that this could become a security issue. Any time you have code that behaves in an unexpected way, someone may be able to exploit it. Why do you think Google, Mozilla, et al, pay bounties for people to find security bugs in their code? (In fact, Mozilla funded the creation (and uses) Rust to eliminate a whole class of coding problems afflicting its browser code.)
It would be nice if you all chill out, and stop being so defensive, and focus in on the technical merits, and implications, of the issue.
Excuse me @Udiknedormin, maybe you aren't fluent in English?
This is my last warning. Watch your tone.