original article: https://atilanevesoncode.wordpress.com/2018/12/31/comparing-pythagorean-triples-in-c-d-and-rust/
here's nim version I proposed: see https://github.com/atilaneves/pythagoras/pull/4
sample results from the article for simple version:
Simple CT (ms) RT (ms)
clang debug 59 599
clang release 69 154
dmd debug 11 369
dmd release 11 153
ldc debug 31 599
ldc release 38 153
rustc0 debug 100 8445
rustc0 release 103 349
rustc1 debug 6650
rustc1 release 217
looks like nim release version of simple.d compiles faster and runs a tiny bit faster than D version ldc release of simple.d
one thing I don't like about the way I wrote nim version of simple.d is how I translated this:
for (int z = 1; ; ++z)
foo
into this:
var z = 1
while true:
foo
z.inc
it's bad for 2 reasons:
My first thought would be to make a simple infinite iterator function for z.
the boilerplate of that iterator code aside, I think this solves the scope problem and the cognitive load problem and makes it look more elegant, but I know iterators are expensive. I wonder how expensive for this case...
For what it's worth: I c2nim'd the simple.cpp and slightly adapted it to have a limit parameter to (using i) limit the number of computed triples. Compile time on my Ryzen box and using gcc as the backend was around 1.6s the first time and about 0.25 s for following runs (said Nim). Execution time of the release compiled code was about 220 ms.
My main motivation was to follow a hunch, namely that this isn't about generators, lambdas or whatever but about a) a very poor algorithm (no surprise there; after all it's even called "simple" (as in "naive" I presume)) and b) careless loop nesting.
One of the major rules wrt performance is to be careful with loops. Another rule is to help the compiler by good hinting.
So, using a better algo plus having only 2 loops (and using a microsecond timer rather than crude time for measurement) I arrived with the following which (release mode) does the job - incl printing! (which amazed me. Kudos to the Nim team!) - in about 2 ms.
const plimit = 1000 # how many triplets to compute (1000 like in the blog code)
proc pyth3s(limit: int) =
var found = 1
var m: int = 2
while true:
var n: int = 1
while n < m:
let m1: int = m * m
let n1: int = n * n
let c: int = m1 + n1
let a: int = m1 - n1
let b: int = (m * n) shl 1
echo $a & ", " & $b & ", " & $c
if found >= limit:
return
found.inc
n.inc
m.inc
let t0 = getHRtime() # microsecond timer
pyth3s(plimit)
let t1 = getHRtime()
echo "Done. First " & $plimit & " pyth. triplets computed in " & $(t1 - t0) & " musec"
Thoughts:
Obviously reducing the loops to 2 is a very major performance improvement. But I think (might be wrong) that explicitly introducing m1 and n1 is helpful as a hint to the compiler. Moreover limiting the use of vars to the reasonable minimum and to prefer let plays on one of Nims strengths. Yes, the code looks longer but wrt performance the decisive factor isn't how it looks to a human but what the compiler can make out of it.
Every language has nested loops. My view is that the original article about C++20 ranges conceived this test/benchmark to be about the cost, if any, of abstractions, not exactly the performance of "nested loops however you write it" as suggested by Timothee's code or "the fastest algorithm for generating Pythagorean triples" (see https://en.wikipedia.org/wiki/Pythagorean_triple for an explanation of moerm's much better algorithm - you should definitely use something like moerm's algo if you actually had a need for many triples for some reason).
A lot of what is creating variation in results here, and why I am posting this reply, is that this particular case is more sensitive than most to compiler optimizations and assumptions/code layout, etc. With the below program, I got a full 2.02x speed-up using gcc's profile-guided optimization (145.0 ms to 71.8 ms; both best of 10 trials on a i7-6700K at 4.85 GHz on Linux, gcc-8.2.0 with Gentoo patches rev6). Usually, I only get 1.10x to 1.15x speed-ups. Thus, this case is more sensitive to optimizations.
iterator toInfinity(n=1): int =
var z = n
while true:
yield z
inc(z)
iterator triples(n=1000): array[3, int] =
var i = 0
block all:
for z in toInfinity(1):
for x in 1 .. z-1:
for y in x+1 .. z-1:
if x*x + y*y == z*z:
yield [x, y, z]
inc(i)
if i == n:
break all
import times
proc main() =
let t0 = now()
for triple in triples():
echo triple
echo now() - t0
main()
When Timothee's current code is instrumented with times.now() and not compiled with PGO it is 1.5x faster than the above program (96 ms). When both programs are similarly compiled with PGO the performance basically matched to almost within measurement error (68.13 ms best of 10 trials, but noise/range over those trials was over 2 ms). (All I did was fiddle with the loop ranges.)
In other words, Nim's iterators seem to be a zero cost abstraction, as advertised, at least when compiled with PGO on gcc in this use case. This also makes sense if you look at the C code Nim generates -- just 3 nested loops anyway. Maybe one could pick nits that 71.8 vs 68.13 is not quite zero. Fair enough, but 5% is much less than the 2x variation sensitivity from PGO. Heck, given the giant 2x sensitivity, one of those "genetic algorithm to optimize the optimization flags" approaches could probably do even better than PGO and have the two Nim versions jockey back & forth for fastest. Looking at the generated C code, that seems likely to me. The point of this post is that in this case, compilation strategy/optimization options matter more than people might expect (though several commenters did sort of seemed to know that already).
For the curious, moerm's better algorithm with PGO runs in just 600 microseconds on the same machine, and 680 microseconds without PGO, a more typical 1.13x factor, and much smaller than just using a better algo.
Personally, I think that it might be a little nicer to be able to return to exit iterators rather than using the block all: ... break all construct. I could see bug-finding arguments (iterators != proc/func) for not allowing that, though. Anyway, I think Nim mostly wins (or at least ties) in the "elegance contest" here.
Also, for anyone trying to reproduce these effects/do profile-guided-optimization with Nim in general, the way I drive my PGO compilation is with a script that boils down to:
nim c -c -d:r PROG.nim
$cc -fprofile-generate $nimcache/r/PROG/*.c -o PROG
./PROG
$cc -fprofile-use $nimcache/r/PROG/*.c -o PROG
You may also need some -O3 -fwhole-program, -flto, -march=native in that $cc variable, and definitely need at least -fno-strict-aliasing.I fully agree on Nim indeed being a good language. My point though wasn't "I can do faster code than ...".
My point was that one should a) think about optimization starting from "what's actually the point and what's the bottleneck or the most promising approach?" (in this case it was "use a better algorithm") and also "how much optimization do I need and what is it worth? (e.g. in dev time)", b) avoid obvious problems (like nesting loops without need), and c) help the compiler by providing clear hints.
I also copied your code verbatim (modulo the now(); I prefer my own routine because I know it's directly getting the monotonic clock), compiled it with the exact switches used by timothee and the above code from you took around 210 ms on my system (debian, Ryzen (in a VM), gcc 8.2.0).
And I'm not surprised. While you are right and Nim has excellent iterators the basic problem still is 3 loops and an if in the inner most loop (and a bad algo). Maybe my Ryzen is a bit more or a bit less sensitive than your CPU in that regard but any kind of branching (loops, ifs) risk to trash the L1 and often enough L2 too.
And btw, Nim's iterators, as great as they are, are not zero cost. One single change in your code, namely replacing for z in toInfinity(1): with for z in 1 ..< 0x7FFFFFF: made the code run almost 10% faster.
But I had another interesting point in my first post: Why use any language XYZ? Why not use, for instance, Python? What's the point, the difference? (Looking from the perspective I'm interested in here) the answer is: Python means "way less dev. time than C but way slower code (runtime)". Where is Nim on that axis? That (imo) is an immensely important point and one where Nim really shines: You get a dev. time not far from Python -and also- a run time very close to C.
That's why I do not even expect and desire Nim to ever reach 100% of C code (runtime) speed. What I want is a language that makes it easy to think about and focus on my task, the algorithm and still get near C speed. From what I see nobody and nothing gets even close to Nim in that crucial point that is directly related both to productivity and code quality and I still can profile and optimize real hot spots in C.
@moerm: Your algorithm uses Euclid's formula, which (1) does not exhaustively enumerate all non-primitive Pythagorean triples (for example, it'll skip 9^2+12^2=15^2) and (2) does not enumerate them in the same order as the original algorithm. To get that right, you have to jump through a few additional hoops.
I threw this D code together a couple of days ago as a proof of concept, which is a bit more complex, but matches the output of the original implementation exactly. It should be pretty trivial to port to Nim for anybody who is interested. It will still beat the pants off the original, simply because that's what happens if you compare an O(n log n) time complexity algorithm to an O(n^3) one.
@cblake: Yes, the original blog post is about the cost of abstractions (or rather, being able to avoid that). However, once you're dealing with double or triple nested loops, algorithmic optimization compared to compiler optimization begins to dominate, because the constant factor that compilers can give you quickly becomes secondary for O(n^k) time complexity with k >= 2. (I could write Ruby or Python code that would beat the C++ implementation for not very big values of n), and the composability of iterators/ranges is not adequate for the most interesting algorithmic optimizations.
My understanding of the original article was that it was about elegant abstractions and their costs. IMO Nim really shines here. This thread shows how low cost iterators really are in Nim. They are far cheaper in Nim than I thought they were. Kudos to the Nim core team!
@moerm also shows a tangential super power of Nim, which is that you not only get very low cost high level abstractions, you have enough expressiveness to do optimizations on par with C when you need to.
Oh, I got your point and tried to emphasize that. I wasn't arguing against you anywhere that I know of. I totally agree with your a,b,&c. I suspect we don't disagree on anything real at all, but are just discussing different aspects. I tried to express praise for your approach (if you happened to need to do this particular thing) and gave a reference link. :-) Please note that nothing I say below should be taken as any specific rebuttal to something you said.
My point was less about exactly zero vs non-zero abstraction cost (I even noted my own 5% variation) and more about the cost being very sensitive to compilation strategy, PGO as a particular example. I would say, looking at the generated C code, that this is not much about "Nim language iterator overhead aka 'cost'" and more just about "C compiler assembly code generation variability". That's an important distinction as slowness does not necessarily translate into "work for Nim core to improve anything". Backends vary and could change a bunch of stuff and be different in the next version of gcc/clang/whatever, for example. So, Nim kind of needs to target a "broad cross section" of what works well on the various backends, not overfit to one. (Not saying you said otherwise, just backing up importance of the distinction.)
You mentioned compiling my code on Ryzen, but not using gcc's PGO. PGO is mostly just a very strong version of your (c) hints idea where the compiler generates extra code to measure what it thinks will help it make better code generation decisions. (Yeah, a person or some GA picking optim flags/luck/etc. could all still maybe do even better sometimes as I already alluded to, but at ever increasing developer costs.)
To add a little more color quantitatively, I just did the same PGO experiment with the same version of gcc on a Ryzen 2950X clocked at 3.80 GHz and saw times go from 140 ms down to 85 ms..(best of 10 trials for both). That's not quite the same 2x speed-up as for the Intel case. Yeah, maybe Ryzen has better CPU predictors or predictor warm-ups or maybe it gets a bit luckier with "non-PGO" -O3 yadda-yadda defaults, etc. Whatever the causes, 1.65X is still a lot more than PGO boosts I usually see of 1.05x to 1.15x (such as with your algo, for example). { Of course, the boost is still not as good as a better algorithm. I never said so, nor meant to imply it. If a better algo is handy and speed matters then by all means use it! At least some of the time, a better algo will not be as easy to come by as PGO. }
Anyway, I realize that it's a tangent from your original interest in this problem, but if you're as serious about (c) as a general guideline as you seem, and if you haven't already, then you should look into PGO sometime. My bet is that for this problem and for most backend compilers that a PGO Nim would be faster than non-PGO "pure C". Even though such a goal is explicitly not your objective, I think it's still interesting. Nim is one of the very few languages I've tried that in "typical" practice is often "about as efficient as C" (with all the usual disclaimers as to algos, etc.)
PGO can also be very little developer time extra cost. Once you get a little script set up, you just type nim-pgo pythag './pythag > /dev/null' for some pythag.nim file instead of nim c -d:release pythag.nim. If the program had less output (say the sum of the c in c^2=a^2+b^2) the shell IO re-direct wouldn't even be there. For this program, it took me almost no time to try PGO. And, of course, YMMV - that 2x on i7-6700k effect is among the biggest PGO boosts that I've ever seen. That seemed noteworthy. So, I noted to y'all. :-)
In terms of comparing Ryzen timings, running in a VM should not make much difference on this problem as it is a pure user-space CPU hot loop except for the time queries..That's really a best case scenario for VMs. Of course, if your host OS is time-sharing the CPU between your VM instance and some other process then that could slow things down a lot - arbitrarily much, really if the host OS is overloaded. Might help to take the minimum time out of more trials. You should be able to get 100..200 ms of one core all to yourself at some point. :-) My guess is that you could get that 210 ms down to under 100 ms (not that it really matters for anything practical in this exact case.
A better algo is better, but a lot of people - not just in this forum - are, rightly or wrongly, discussing this problem in the context of "programming language performance" which is the only reason I bothered to write both of these posts. The clang/ldc/rust/etc. numbers were all more closely clustered than 1.65x of the Ryzen PGO experiment in Timothee's very first link (217 ms/153 ms = only 1.42x).
TL;DR - A) For this problem, compiler back-end generation effects dominate Nim front end overheads (and possibly front-end overheads for other languages) for which the huge PGO boost is mostly just supporting evidence, and B) PGO - try it someday, it just might make a bigger difference than you expect without much dev work on your part.
We are in agreement if I understand you correctly.
I don't care whether Nim code runs 0.5% or 3% slower than C code. In fact, I think that whole benchmarking is irrelevant except for a rough overview ("Nim is within x% of C's speed").
Reason (and C/C++/D developers might want to read this carefully): Looking at the code created by - even very good - C compilers everyone with some not insignificant knowledge of ASM will concur that developing in ASM is the way to get the fastest code. Unfortunately though, ASM (next to some other issues like portability problems) also has the major disadvantage of being ridiculously far away from the problem domain. So, if we were serious about it we needed to make my axis longer to fit ASM quite a bit beyond C, at the extreme end of runtime speed but also extreme distance to problem domain.
In other words: If we really were obsessed with runtime speed we should chose ASM over C - but we don't, i.a. because we would pay with a very significant increase in development time, more and more difficult to spot bugs, etc. So the reality is that C developers already made a compromise trading development speed for runtime speed. Nim developers do the same - but with a much, much better deal; we get a ton of lower dev. time, fewer bugs, etc. for what in the end is a ridiculously low runtime price even if it happened to be 5% less speed than C.
And btw - albeit largely theoretically right now (I assume) - we even could compensate and reach C's RT speed due to Nims compiler having much more information than a C compiler (almost always has) due to factors like e.g. Nim strong static typing which allow the Nim compiler to generate C code that then again would allow the C compiler to create faster code.
Those are all fine points. Asm can sometimes make a bigger difference than people conditioned to not question compiler output expect (and there are many such people). This is especially with vector units. A few years back, I wrote an AVX2 vectorized "minimum" function that ran 24x (twenty four times) faster than a C one. That's a much bigger time ratio than most "typical use" comparisons of many programming language pairs (though obviously most programs do more things than compute minima). Auto-vectorization in gcc (at least) has gotten better since for that precise problem, though it can still miss many opportunities.
If you ever do need to write asm, those "intrinsics functions" like _mm256_add_ps (there are a hundred others) are often an easier entry point/way to integrate with your program than raw asm/.s files. You can use them from Nim, too. :-) See, e.g., https://github.com/numforge/laser README/code/etc. for how to use SIMD intrinsics.