Language | Time in s | compiled with |
---|---|---|
Nim | 0.456 | nim cpp -d:release fib.nim |
Nim | 3.845 | nim c -d:release fib.nim |
C++ | 3.843 | g++ -O3 -o fib fib.cpp |
C | 3.902 | gcc -O3 -o fib fib.c |
why in the world is the Nim version compiled so much faster ?
then I tried a slightly different version of Nim code
proc fib(n: uint64): uint64 =
if n > 2.uint64 : return fib(n - 1) + fib(n - 2)
return n
echo fib(46)
Language | Time in s | compiled with |
---|---|---|
Nim | 0.450 | nim cpp -d:release fib.nim |
Nim | 3.209 | nim c -d:release fib.nim |
its nice to see Nim to perform so well here - but why is the gap between the C and CPP compilation for the same code?
On my end nim cpp is twice slower than nim c.
Oh and btw, whichever version of code you use — the result is not correct ;) Try running fib(2) and you'll see what I'm talking about.
I also got nim c 's code running twice as fast as nim cpp 's code. See some previous discussion: https://forum.nim-lang.org/t/1779 - there are cases where Nim's generated C code "inspires" the C compiler to "unroll the last few recursions" and save 4x or 8x or maybe even 16x depending the number of funcalls (depending on the unroll amount). It's not quite constant folding. disassembly and looking for callq can be helpful here.
I think the result is correct by the more old school definition starting from "1 1 2 3 5" (with indexing starting at zero), not "0 1 1 2 3 5" as per https://en.wikipedia.org/wiki/Fibonacci_number { or maybe the indexing is not what @miran expects, but he uses a ";-)" }. Anyway, there is sort of "more than one right answer" in a couple of ways depending on index origin or 0 inclusion.
If Nim compiles to C Nim is not faster than C but Nim is able to produce faster C code than human programmers.
Effectively making it faster than C ;)
I am not quite sure what "original" benchmark you mean - perhaps the 4.6 s for Nim vs 6.0 s for C on the original github page at the top of this thread. That may be small scale code layout/caching/branch predictor effects as you say.
However, the difference in this thread of 8X is almost certainly not compile-time evaluation of fib(46), nor a coincidence, nor some generalizable Nim is effectively faster than C { though ;) is noted! :-) }. Compile-time evaluation would likely be many thousands of times or more faster than that - faster even than the memoized/linearized versions of the calculation which are already O(1000x) faster. Indeed, one gets "too many iterations" trying to const x = fib(46) from the Nim VM.
The merely ~8X faster observed twice in this thread is something else which I have already explained in the prior thread and in this thread linking to that. And that explanation/approach does generalize a little, and even across programming languages. Certain iterative/recursive code that does very little work like simple adds or swaps { permutations, I'm looking at you :-) } can often be sped up by either manually or automatically unpacking/unrolling the last few recursions so that more work is done on a per-function call basis or said otherwise there are fewer function calls.
No gcc doesn't -- at least not for me with a factor of about 7X, almost identical to @adrianv 's results. I used both an earlier gcc-7 and a gcc-8.2 just now.
If you don't want to disassemble your binary, that's ok. You can check this with gcc -pg and gprof which will count calls for you.
I also gave a complete example in C that does the optimization I mention in the prior thread which I will reproduce here because the new forum code hides it behind "load more posts". With fib(30) I get 118793 calls. With fib(40) I get 14612525 calls. With even a low fib(20) I get 950 calls. Does 950 calls seem like compile-time calculation to you? With fib(46) I get 262211393 calls.
#include <stdio.h>
float fibonacci(int x) {
return x < 2 ? (float)x : fibonacci(x - 1) + fibonacci(x - 2);
}
void NimMainInner() {
printf("%.0f\n", fibonacci(46));
}
int main() {
#ifdef FIB_SLOW
NimMainInner();
#else
void (*volatile inner)(void) = NimMainInner;
inner();
#endif
return 0;
}
If one simply calls NimMainInner() directly instead of through the volatile function pointer inner(), fib(20) generates 10942 calls instead of 950.
I understand my example uses floating point, not integer arithmetic, but @adrianv 's numbers make it sure seem like the same situation. @adrianv can also compile with -pg and run gprof to check. Or he could adapt this C and compile it the same way his nim.cfg or whatever compiles his Nim generated C.
I'm not quite sure what more evidence you would require. It isn't hard to calculate how many function calls doubly recursive Fibonacci requires.
@cblake thanks for the explanation. I will check with the profiler. From what you said I tried this and get almost the same results - now for gcc and g++ and for clang a speed up of about 4x.
proc fib(n: int): uint64 =
if n > 1 and n != 30: return fib(n - 1) + fib(n - 2)
if n <= 1: return 1
let x {.global.}: auto = fib(29) + fib(28)
return x
echo fib(46)
as a side note: reordering the ifs brought about 50% speedupI think the result is correct by the more old school definition starting from "1 1 2 3 5" (with indexing starting at zero), not "0 1 1 2 3 5"
The OP's code doesn't produce either of those sequences.
there is sort of "more than one right answer" in a couple of ways depending on index origin or 0 inclusion.
This code doesn't produce any of the right answers.
There have been fewer complaints lately, and in some ways that's a sign of a growing community which is good, but resurrecting zombie threads with irrelevant side points is the pet peeve of others. So, @jibal may want to work on being less easily triggered or else open new threads & link back in the future. Since @jibal did resurrect this one to the top of the index and follow on to my attempt to appease him with (self-righteous|self-deprecating), defending-the-innocent justification, I felt a reply warranted, but will add some hopefully interesting non-redundant information along the way.
The only thing I say incorrect in the paragraph that @jibal quoted is to suggest OP.fib(0) is correct (even that is qualified by "I think" as in "I suspect" and that slight error is already granted just above). Just from the full text of the post @jibal quotes, I clearly explicitly took @miran's fib(2) example to indicate an issue with index origin (meaning 0 1 1 2.. vs 1 1 2... or even 0 0 0 1 1 2...). OP.fib(n) is fine for n >= 1, matching the "1 1 2 3 5.." variant after the very first slot in the F(0)=F(1)=1 case (vs F(1)=F(2)=1 or whatever). "old school" may be wrong since Fibonacci himself probably did F(1)=F(2)=1, but that's pretty vague. Maybe the "new school" is the "old school", etc. It's a common enough gotcha to stay right at the beginning of the wiki page I linked to.
Anyway, while it's true the OP boundary condition was wrong, and fib(0) should be 1 there, @miran's text suggests a different issue, since in 0-origin terms Fib(2)=2 as does the OP code. Sure, @miran may have meant both the missing 1 and the indexing shift, but, given the literal text of his reply, my quoted paragraph still seems appropriate. Many apologies to @miran if he knows for sure that he was offended at the time, if he can even remember his thoughts 2 years later - surely partly why zombie thread resurrection is disliked.
Besides resurrecting to correct my "maybe-miscorrection", @jibal's point is also irrelevant to the main question (which @miran very likely understood from his "btw", though more allowance for in-the-moment conversational flow is very natural and I'm not criticizing that). The main topic of this thread was cross-backend comparisons. You can call the 0 1 2 3 5.. series "Fibtastic" if you want, but Fibtastic(46) still equals 0-index-origin Fibonacci(46) and still had a performance mystery. { Fibtastic does one less recursion making it ~1.62x faster for large n, but that only matters for cross-impl comparisons if someone naively thought "fibtastic(46) == fibonacci0origin(46) => runtime should be equal". }
That returns us to the main response to the actual main question to which I can add some more information that may help someone, somewhere, someday. You really need call counts because the work done can vary - a lot depending on what backend compilers do and that's become hard to predict/control. For example, in my FP C program above (yes, it does 0 1 1 2 3 5.. so 46 -> 47, depending/if you care, and yeah C code in a Nim forum but the backend has enough noise!), the time can vary tons from many minor code changes. On gcc-10.2 -O3 on Gentoo Linux x86_64, glibc-2.32-r1, compile and run as-is and it runs in 0.137 seconds now (i7-6700k at 4.7GHz) doing only 63041957 fibonacci calls. Change the printf to exit((int)fibonacci(46) & 255); makes it 2.3x slower and does 148876861 calls (2.36x). Instead add a global count variable, count+=1 at the top of fibonacci & add to the printf in NimMainInner - 4.49 seconds (32.8x) slower and 1980780474 calls (31.42x). Etc., etc. Time ratios track call count well enough, but call count varies a lot -- and it's not just the C vs C++ backend and not just the NimMainInner call structure. Many things (possibly like @adrianv's if-reordering) can impact it. But what is it? I may not have explained it (or even understood it) well in the past.
Disassembling NimMainInner (for the fast variant) you can see gcc generated fib(42), fib(41), .. fib(36) or some such and does the (cheap) linear combination. So, what gcc is really doing is "partially inlining of recursion" (that's probably a better term than "unpacking" or "unrolling".. Sorry!). In this case, that pays huge dividends because the recursion cost of fib(42) is exponentially cheaper (base of golden ratio) than fib(46) and the others cheaper still. gcc applies similar tactics inside the recursive function, compounding the benefit. (You can do a similar thing just manually expanding the recursion yourself with a Pascal's triangle.)
It's distinct from "tail call" -> iteration style transform (which, e.g., does not change code much size based on the size of the loop) or the remember the "first 34" from compile-time which uses more data state. A future compiler could surely inline even more levels or compose the memoization with partial inlining to create >>30x work variations, although 1.5 orders of magnitude is already a lot! These tricks also do not rely upon FP arithmetic or vectorization, though that helps gcc for me. So, this "benchmark" will only become more fraught with peril in the future which would not be so bad except that it's so darn popular. In its 2nd life, maybe it can become a measure of these tricks not "basic language costs" as it tends to be now...
The optimization also might apply to other recursive call scenarios that don't have the hyper efficient alternatives Fibonacci does, although rarely given how finicky it seems today. What has kept it so finicky? I don't know. Ask the gcc maintainers (or the clang people why they can't do it, if that's even still the case). I was honestly a little surprised the copy-paste of years old code worked for me here.
Now, maybe @adrianv's results came from something else. I never said I knew for sure, just that this was the best first thing to check. I could only say so much with so much info and ability to reproduce. The initial ratios, both myself @miran's inability to get the same 8.5x ratio, and @adrianv's last post timing variations all smell like this to me. I stand by my main advice to just Always Measure Calls As Well as Time. That's literally the most (only??) important point in this whole thread that deserves repeating far & wide. (It's also a good exercise to learn to compile the Nim generated C code manually to do PGO builds anyway.) @adrianv, if he's even still around Nim/the forum/etc., may well no longer even have access to the same build environment (none of which he told us) to reproduce his found 8.5x integer code effect - yet another reason to dislike zombie threads. I know I sure don't have easy access to the ancient gcc-5 setup with which I first found the FP behavior myself.
With this I've said about as much on this topic as I likely will ever care to in one convenient link-back. Cheers to all & keep the threads fresh! ;-)
" So, @jibal may want to work on being less easily triggered "
GTH.