nimforum mirror - the Fibonacci benchmark

adrianv (orginal) [2018-09-29T13:42:30+02:00] view original

Hi, on twitter I saw that Nim performed better on fibonacci benchmark than C and C++ (https://github.com/drujensen/fib) and wondered why. I did my own tests and these are quite interesting:

Language	Time in s	compiled with
Nim	0.456	nim cpp -d:release fib.nim
Nim	3.845	nim c -d:release fib.nim
C++	3.843	g++ -O3 -o fib fib.cpp
C	3.902	gcc -O3 -o fib fib.c

why in the world is the Nim version compiled so much faster ?

then I tried a slightly different version of Nim code

proc fib(n: uint64): uint64 =
  if n > 2.uint64 : return fib(n - 1) + fib(n - 2)
  return n


echo fib(46)

Language	Time in s	compiled with
Nim	0.450	nim cpp -d:release fib.nim
Nim	3.209	nim c -d:release fib.nim

its nice to see Nim to perform so well here - but why is the gap between the C and CPP compilation for the same code?

DeletedUser (orginal) [2018-09-29T14:23:30+02:00] view original

I think this was posted a while ago and the answer was that G++ found a way to constant fold Nim's C++ output. This means it probably evaluated fib(46) before the program ever ran. A bit disappointing, sure, but the C output benchmark is still very good.

DTxplorer (orginal) [2018-09-29T14:41:44+02:00] view original

If Nim compiles to C Nim is not faster to C but Nim may produce faster C code than human programmers.

miran (orginal) [2018-09-29T15:17:13+02:00] view original

On my end nim cpp is twice slower than nim c.

Oh and btw, whichever version of code you use — the result is not correct ;) Try running fib(2) and you'll see what I'm talking about.

cblake (orginal) [2018-09-29T16:53:33+02:00] view original

I also got nim c 's code running twice as fast as nim cpp 's code. See some previous discussion: https://forum.nim-lang.org/t/1779 - there are cases where Nim's generated C code "inspires" the C compiler to "unroll the last few recursions" and save 4x or 8x or maybe even 16x depending the number of funcalls (depending on the unroll amount). It's not quite constant folding. disassembly and looking for callq can be helpful here.

I think the result is correct by the more old school definition starting from "1 1 2 3 5" (with indexing starting at zero), not "0 1 1 2 3 5" as per https://en.wikipedia.org/wiki/Fibonacci_number { or maybe the indexing is not what @miran expects, but he uses a ";-)" }. Anyway, there is sort of "more than one right answer" in a couple of ways depending on index origin or 0 inclusion.

dom96 (orginal) [2018-09-29T17:15:53+02:00] view original

If Nim compiles to C Nim is not faster than C but Nim is able to produce faster C code than human programmers.

Effectively making it faster than C ;)

zahary (orginal) [2018-09-29T19:28:46+02:00] view original

Don't get overexcited. This is probably just a coincidence coming from the random addresses of the compiled functions and branch instructions that affect things like the branch prediction unit in the CPU.

cblake (orginal) [2018-09-29T20:36:53+02:00] view original

I am not quite sure what "original" benchmark you mean - perhaps the 4.6 s for Nim vs 6.0 s for C on the original github page at the top of this thread. That may be small scale code layout/caching/branch predictor effects as you say.

However, the difference in this thread of 8X is almost certainly not compile-time evaluation of fib(46), nor a coincidence, nor some generalizable Nim is effectively faster than C { though ;) is noted! :-) }. Compile-time evaluation would likely be many thousands of times or more faster than that - faster even than the memoized/linearized versions of the calculation which are already O(1000x) faster. Indeed, one gets "too many iterations" trying to const x = fib(46) from the Nim VM.

The merely ~8X faster observed twice in this thread is something else which I have already explained in the prior thread and in this thread linking to that. And that explanation/approach does generalize a little, and even across programming languages. Certain iterative/recursive code that does very little work like simple adds or swaps { permutations, I'm looking at you :-) } can often be sped up by either manually or automatically unpacking/unrolling the last few recursions so that more work is done on a per-function call basis or said otherwise there are fewer function calls.

DeletedUser (orginal) [2018-09-29T21:48:14+02:00] view original

Nim doesn't evaluate fib(46), GCC does. And it doesn't get up to fib(46), it evaluates up to fib(34) then does the remaining calculations at runtime. I'm sure this exact same occurence was mentioned in an older thread.

cblake (orginal) [2018-09-29T22:19:16+02:00] view original

No gcc doesn't -- at least not for me with a factor of about 7X, almost identical to @adrianv 's results. I used both an earlier gcc-7 and a gcc-8.2 just now.

If you don't want to disassemble your binary, that's ok. You can check this with gcc -pg and gprof which will count calls for you.

I also gave a complete example in C that does the optimization I mention in the prior thread which I will reproduce here because the new forum code hides it behind "load more posts". With fib(30) I get 118793 calls. With fib(40) I get 14612525 calls. With even a low fib(20) I get 950 calls. Does 950 calls seem like compile-time calculation to you? With fib(46) I get 262211393 calls.

#include <stdio.h>

float fibonacci(int x) {
    return x < 2 ? (float)x : fibonacci(x - 1) + fibonacci(x - 2);
}

void NimMainInner() {
    printf("%.0f\n", fibonacci(46));
}
int main() {
#ifdef FIB_SLOW
    NimMainInner();
#else
    void (*volatile inner)(void) = NimMainInner;
    inner();
#endif
    return 0;
}

If one simply calls NimMainInner() directly instead of through the volatile function pointer inner(), fib(20) generates 10942 calls instead of 950.

I understand my example uses floating point, not integer arithmetic, but @adrianv 's numbers make it sure seem like the same situation. @adrianv can also compile with -pg and run gprof to check. Or he could adapt this C and compile it the same way his nim.cfg or whatever compiles his Nim generated C.

I'm not quite sure what more evidence you would require. It isn't hard to calculate how many function calls doubly recursive Fibonacci requires.

mratsim (orginal) [2018-09-30T00:36:56+02:00] view original

Besides constant folding, there might be code alignment that can cause massive perf difference.

cblake (orginal) [2018-09-30T01:39:24+02:00] view original

It might be a lot of things, but I'm telling you - there is a gcc optimization, very finicky about context (and probably compiler flags, too), that reduces function calls by almost exactly his speed-up factor. There's really no need to guess if I'm right, though. Just add -pg to the gcc flags (in both cases), run gprof on the output, and count funcalls in both cases to confirm.

adrianv (orginal) [2018-09-30T15:11:02+02:00] view original

@cblake thanks for the explanation. I will check with the profiler. From what you said I tried this and get almost the same results - now for gcc and g++ and for clang a speed up of about 4x.

proc fib(n: int): uint64 =
  if n > 1 and n != 30: return fib(n - 1) + fib(n - 2)
  if n <= 1: return 1
  let x {.global.}: auto = fib(29) + fib(28)
  return x


echo fib(46)

as a side note: reordering the ifs brought about 50% speedup

jibal (orginal) [2020-09-05T00:56:47+02:00] view original

I think the result is correct by the more old school definition starting from "1 1 2 3 5" (with indexing starting at zero), not "0 1 1 2 3 5"

The OP's code doesn't produce either of those sequences.

there is sort of "more than one right answer" in a couple of ways depending on index origin or 0 inclusion.

This code doesn't produce any of the right answers.

cblake (orginal) [2020-09-05T01:19:18+02:00] view original

Ok. Fair enough. It was just a btw/side point not too critical to the performance questions (unless >= 2 vs > 2 changes gcc's optimizer behavior/you're calculating #funcalls/etc.)

jibal (orginal) [2020-09-05T01:22:28+02:00] view original

I have a pet peeve about people "correcting" people who are correct.

cblake (orginal) [2020-09-06T17:07:57+02:00] view original

There have been fewer complaints lately, and in some ways that's a sign of a growing community which is good, but resurrecting zombie threads with irrelevant side points is the pet peeve of others. So, @jibal may want to work on being less easily triggered or else open new threads & link back in the future. Since @jibal did resurrect this one to the top of the index and follow on to my attempt to appease him with (self-righteous|self-deprecating), defending-the-innocent justification, I felt a reply warranted, but will add some hopefully interesting non-redundant information along the way.

The only thing I say incorrect in the paragraph that @jibal quoted is to suggest OP.fib(0) is correct (even that is qualified by "I think" as in "I suspect" and that slight error is already granted just above). Just from the full text of the post @jibal quotes, I clearly explicitly took @miran's fib(2) example to indicate an issue with index origin (meaning 0 1 1 2.. vs 1 1 2... or even 0 0 0 1 1 2...). OP.fib(n) is fine for n >= 1, matching the "1 1 2 3 5.." variant after the very first slot in the F(0)=F(1)=1 case (vs F(1)=F(2)=1 or whatever). "old school" may be wrong since Fibonacci himself probably did F(1)=F(2)=1, but that's pretty vague. Maybe the "new school" is the "old school", etc. It's a common enough gotcha to stay right at the beginning of the wiki page I linked to.

Anyway, while it's true the OP boundary condition was wrong, and fib(0) should be 1 there, @miran's text suggests a different issue, since in 0-origin terms Fib(2)=2 as does the OP code. Sure, @miran may have meant both the missing 1 and the indexing shift, but, given the literal text of his reply, my quoted paragraph still seems appropriate. Many apologies to @miran if he knows for sure that he was offended at the time, if he can even remember his thoughts 2 years later - surely partly why zombie thread resurrection is disliked.

Besides resurrecting to correct my "maybe-miscorrection", @jibal's point is also irrelevant to the main question (which @miran very likely understood from his "btw", though more allowance for in-the-moment conversational flow is very natural and I'm not criticizing that). The main topic of this thread was cross-backend comparisons. You can call the 0 1 2 3 5.. series "Fibtastic" if you want, but Fibtastic(46) still equals 0-index-origin Fibonacci(46) and still had a performance mystery. { Fibtastic does one less recursion making it ~1.62x faster for large n, but that only matters for cross-impl comparisons if someone naively thought "fibtastic(46) == fibonacci0origin(46) => runtime should be equal". }

That returns us to the main response to the actual main question to which I can add some more information that may help someone, somewhere, someday. You really need call counts because the work done can vary - a lot depending on what backend compilers do and that's become hard to predict/control. For example, in my FP C program above (yes, it does 0 1 1 2 3 5.. so 46 -> 47, depending/if you care, and yeah C code in a Nim forum but the backend has enough noise!), the time can vary tons from many minor code changes. On gcc-10.2 -O3 on Gentoo Linux x86_64, glibc-2.32-r1, compile and run as-is and it runs in 0.137 seconds now (i7-6700k at 4.7GHz) doing only 63041957 fibonacci calls. Change the printf to exit((int)fibonacci(46) & 255); makes it 2.3x slower and does 148876861 calls (2.36x). Instead add a global count variable, count+=1 at the top of fibonacci & add to the printf in NimMainInner - 4.49 seconds (32.8x) slower and 1980780474 calls (31.42x). Etc., etc. Time ratios track call count well enough, but call count varies a lot -- and it's not just the C vs C++ backend and not just the NimMainInner call structure. Many things (possibly like @adrianv's if-reordering) can impact it. But what is it? I may not have explained it (or even understood it) well in the past.

Disassembling NimMainInner (for the fast variant) you can see gcc generated fib(42), fib(41), .. fib(36) or some such and does the (cheap) linear combination. So, what gcc is really doing is "partially inlining of recursion" (that's probably a better term than "unpacking" or "unrolling".. Sorry!). In this case, that pays huge dividends because the recursion cost of fib(42) is exponentially cheaper (base of golden ratio) than fib(46) and the others cheaper still. gcc applies similar tactics inside the recursive function, compounding the benefit. (You can do a similar thing just manually expanding the recursion yourself with a Pascal's triangle.)

It's distinct from "tail call" -> iteration style transform (which, e.g., does not change code much size based on the size of the loop) or the remember the "first 34" from compile-time which uses more data state. A future compiler could surely inline even more levels or compose the memoization with partial inlining to create >>30x work variations, although 1.5 orders of magnitude is already a lot! These tricks also do not rely upon FP arithmetic or vectorization, though that helps gcc for me. So, this "benchmark" will only become more fraught with peril in the future which would not be so bad except that it's so darn popular. In its 2nd life, maybe it can become a measure of these tricks not "basic language costs" as it tends to be now...

The optimization also might apply to other recursive call scenarios that don't have the hyper efficient alternatives Fibonacci does, although rarely given how finicky it seems today. What has kept it so finicky? I don't know. Ask the gcc maintainers (or the clang people why they can't do it, if that's even still the case). I was honestly a little surprised the copy-paste of years old code worked for me here.

Now, maybe @adrianv's results came from something else. I never said I knew for sure, just that this was the best first thing to check. I could only say so much with so much info and ability to reproduce. The initial ratios, both myself @miran's inability to get the same 8.5x ratio, and @adrianv's last post timing variations all smell like this to me. I stand by my main advice to just Always Measure Calls As Well as Time. That's literally the most (only??) important point in this whole thread that deserves repeating far & wide. (It's also a good exercise to learn to compile the Nim generated C code manually to do PGO builds anyway.) @adrianv, if he's even still around Nim/the forum/etc., may well no longer even have access to the same build environment (none of which he told us) to reproduce his found 8.5x integer code effect - yet another reason to dislike zombie threads. I know I sure don't have easy access to the ancient gcc-5 setup with which I first found the FP behavior myself.

With this I've said about as much on this topic as I likely will ever care to in one convenient link-back. Cheers to all & keep the threads fresh! ;-)

jibal (orginal) [2020-09-07T00:26:14+02:00] view original

" So, @jibal may want to work on being less easily triggered "

GTH.

mratsim (orginal) [2020-09-07T13:56:32+02:00] view original

I think this thread may have outlived is usefulness

Mirror of forum.nim-lang.org

4253 :: the Fibonacci benchmark