The Youtube channel Dave's Garage has a series where he is implementing a prime counting algorithm in many different languages to see how the implementations stack up, all the while talking mostly about how the languages differ from one another in the code. Nim, Zig, and Rust seem to be the all-stars in the stats right now according to his most recent video. The code is available for modification on this Github Repo. The Nim code seems pretty well optimized to me at this point, but there may be room for improvement from various Nim experts.
I just wanted to bring this competition to the attention of the Nim community, as it may lead to a small influx of newcomers to the various forums and such as Nim gets more publicity from doing well on this channel's event.
I don't see the point of so many methods/dynamic dispatch there.
An easy optimization would be to use a seq[uint64] instead of seq[byte] to back the bitvector.
See my prime sieve: https://github.com/mratsim/number-theory/blob/8c071c89/src/primes.nim
Really cool, thanks for sharing!
Surprising he mentions V/Zig so much without even touching on Nim in the first video. I think that speaks to a deficiency in Nim's PR. Any thoughts why this might be?
I would push to get rid of the methods, surely there are languages without classes? What is the actual requirement/restriction here?
Any thoughts why this might be?
Both V and Zig sites mention simplicity as the very first thing. Nim is too complicated.
It uses a class to encapsulate the sieve, or a (closest) equivalent feature in your language if a class construct is not available. This class must contain the full state of the sieve. Each iteration should re-create a new instance of this class from scratch.
My understanding based off the above and the previous example is that this needs to be done using a ref type, but I could be wrong. I did not see if the alternative implementations do heap allocate the class. Just followed the previous example adding in a bitlist(Which as mratsim pointed out a large primitive would be a better backer for speed).
In messing around with various GC options I have been observing someone better performance on both --gc:boehm and --gc:arc
Can someone else validate this too?
there was a time when Nim didn't automatically turn on -d:release when -d:danger was used so I've gotten in the habit of using both when I want both; it doesn't do any harm to redefine symbols which are already defined.
I'm fairly certain this was never the case: https://github.com/nim-lang/Nim/blame/devel/config/nim.cfg#L68. But maybe there is a bug I missed :)
No reason to use both or make people feel like they should use both "just in case".
Btw thanks for your work again, hope to see a YouTube video with a few "wows" over Nim's speed :D
@dom96:
there was a time when Nim didn't automatically turn on -d:release when -d:danger was used...
I'm fairly certain this was never the case: https://github.com/nim-lang/Nim/blame/devel/config/nim.cfg#L68. But maybe there is a bug I missed :)
Yes, I think it must have been a bug you missed, but I can't remember how far back it was there. IIRC it was something to do with how the .cfg files were handled.
No reason to use both or make people feel like they should use both "just in case".
True, the bug has been long fixed, so I have removed the -d:release defines since that is implied by -d:danger which I kept.
Btw thanks for your work again, hope to see a YouTube video with a few "wows" over Nim's speed :D
No problem, it has been fun.
The "race" administrators have made me remove cache-sized page culling in order to be true to their "faithful to the base algorithm" specification, which has meant that the current version only runs about 18,000 passes per five seconds instead of the about 20,500 passes per five seconds using page culling (as timed on my machine). I did this because the other category classification is just a "lump" category into which is put all the implementations they don't know how to categorize and thus comparisons with those classifications of implementations is useless since some include various degrees of wheel factorization that the administrators didn't recognize as such.
Even though it is about twelve percent slower than before, my current contribution is still much faster than every other "faithful to base" version in any language on the site, with the nearest competitor the Rust "striped" implementation at about 10,700 passes in five seconds on my machine, followed by a C "faithful to base" version at about 8,625 passes per five seconds.
If that isn't "Wow!!!", I don't know what is ;-)
Of course, once my version is published, there isn't really any reason that the smart C and/or Rust programmers don't do as I've done in their language, once they understand my code (and my blog article may help with understanding, if they read it).
Just checking in to say when I last played with sieves about 2 years ago (in Rust) it was @GordonBGood's post here and on StackOverflow that gave me the most info. Great to see the existing research in Nim's community is paying off again and again.
I was just exploring different ideas so my code was a total mess. I used a sequence of bit-words with bits corresponding to odd numbers only as the data. I recall one of the most enjoyable things was implementing pre-sieving for a few starting primes. I wanted to do it as a zipped iterator of "wheels" (cycled [infinite] iterators) of masks. The unexpected blocking question for me was: How to calculate the shift (i.e. the first bit that needs to be unset) for the given mask word of the given prime. I was asking around and nothing came back except raised eyebrows. Almost broke my head over this, but figured this out finally. Finding that roll variable below was sooo satisfying.
type Block = usize;
const BLOCK_BYTES: Block = std::mem::size_of::<Block>() as Block;
const BLOCK_BITS: Block = BLOCK_BYTES * 8;
fn n_to_id(n: Block) -> Block {
(n-1) / 2
}
fn presieve(mut primes: Vec<Block>) -> Vec<Block> {
let first_primes = &[3, 5, 7, 11, 13, 17];
let unset_bit = |word: Block, n: Block| word ^ (1 << n);
let wheels = first_primes.iter()
.map(|p| {
let masks = (0..*p)
.map(|n| {
let shift = n_to_id(*p);
let roll = (n * (p - BLOCK_BITS % p) + shift) % p;
(roll..BLOCK_BITS).step_by(*p).fold(!0, |w, b| unset_bit(w, b))
})
.collect::<Vec<Block>>();
masks
}).collect::<Vec<Vec<Block>>>();
for (block, (mask3, (mask5, (mask7, (mask11, (mask13, mask17)))))) in primes.iter_mut().zip(
wheels[0].iter().cycle().zip(
wheels[1].iter().cycle().zip(
wheels[2].iter().cycle().zip(
wheels[3].iter().cycle().zip(
wheels[4].iter().cycle().zip(
wheels[5].iter().cycle())
),
),
),
),
) {
*block &= mask3 & mask5 & mask7 & mask11 & mask13 & mask17;
}
// Mark back first primes 3-17
primes[0] |= 0b101101110;
primes
}
All of this just to not baking in the wheels by hand. :D Also, this approach is extendable to further wheels without any manual calculations. I think this is probably useless but it was fun to concentrate on an isolated part of the bigger problem.
PS: Please, extend the list of languages enabled for the syntax highlighting in nimforum!
@Zoom:
With respect, I don't know the point of your example code (in Rust) is. One can't use pre-sieving in the "race" or else it wouldn't be "faithful to the base algorithm" and pre-sieving is only part of the story for a "wheeled" version as per my JavaScript article to which you linked.
Your Rust code actually resembles a recent C version in the "race" which got categorized as "other" because the site administrators perhaps couldn't see the relationship to wheel factorization.
But, yes, breaking through to solutions at whatever level is satisfying :-)
It's been quite some time since I've re-visited this project and the video on the results from the "Top Fuel" class hasn't been published yet, but looking at the results, especially on the fastest desktop AMD Ryzen 5950X, it's clear that the speed rankings as follows: Mike Barbar's Rust contributions, followed by my Nim contribution, followed by Zig, followed by my Haskell contribution, with a bunch of also-ran's. All of the top contenders get the last boost in speed from a "dense culling" technique developed by Mike and I that, when the composite number bit culling pattern is less than one (large) register, culls by large register and only stores that register and the next one upon completing the prime value culling sequence for that range, which for Rust, Nim, and Zig, is then auto-vectorized by the compiler to produce SIMD instructions; the main reason that the Haskell code isn't as fast is that GHC Haskell doesn't produce LLVM IR code of a form where the LLVM optimizer/compiler can recognize the SIMD patterns so that "best-case" optimization is not done for a loss of about 15 percent in performance for that fast CPU.
I could likely match or exceed the performance of the Rust contribution by changing the build process to the Clang/LLVM back end compiler rather than GCC and run the same optimizations in the same order as does Rust - this almost certainly would work as I have experimented with it; however, this would then mean that the code would run slower on some CPU's, just as the Rust contribution does. If I tuned the build process to use the back end most appropriate to the specific CPU, it would likely not pass the Race committee.
An extra two percent or so performance hardly seems the effort, especially as there is more to coding that just raw speed when it is this close: I encourage you to look at the form of the code in each of the top four languages, with the Zig code particularly hard to follow with the use of comptime notations in order to avoid the need to have macros available in the language, and with the Haskell code concise put perhaps a bit hard to understand if one isn't a functional programmer, although I prefer it.
For those wondering (like I was) about how to get the actual results (you need to click the "Leaderboard" under the filters or just click the following link): https://plummerssoftwarellc.github.io/PrimeView/report?id=davepl-1641120327.json&hi=False&hf=False&hp=False&fi=&fp=mt&fa=wh~ot&ff=uf&fb=uk~ot&tp=False&sc=pp&sd=True
So what I'm curious about is, could we port the mike-barber-bit to Nim and get the same or better results? or is there something else that's limiting the Nim solution from beating the Rust?
@dom96:
it's really hard to gauge which language is truly faster among the top 10(20?30?). I suppose the take away is: they are all pretty much on par.
How do you see that? In order to qualify for actual comparison, contributions must be single-threaded/base/faithful-to-base so from your link above for the fastest AMD Ryzen 5950X CPU (at 4.9 GHz turbo single-threaded), the rankings for the qualifying implementations are as follows for today:
4. Haskell: 58367 passes in five seconds. with the next closest at my contribution for the V language at 45657 and only my contributions for Chapel, Julia, and Crystal, as fast as half the speed of the leaders.
That seems pretty definitive to me that there are only four leaders and only four other languages that come anywhere close.
The relative results vary about +/- 1000 from day to day for tests on this CPU due to variations in the run time environments, such as applications in use, memory in use, order of use, state of the Garbage Collector if used, etc. and with the variations slightly higher than usual due to being run on a Hyper-V virtual machine, but the relative rankings stay about the same.
Some languages such as V and Chapel are held back by not having macros so that one can easily implement the full "dense culling" algorithm for small base prime values up to 127 and is limited to only base prime values up to 19 with the dense culling code using a prefix code generation step that stopped at this level due to the code size getting so large; for Julia, I had to limit the "dense" prime threshold value to only 13 because being an Ahead-Of-Time (AOT) compiled language, the optimization is reduced for too large macro generated code file sizes to reduce the amount of time one waits for code compilation before the code starts running. This is also why there aren't faster C/C++ versions as without true AST macros, one would have to resort to code generation and no one has seemed to have enough interest to do that; if someone did that for them, they should be somewhere in the rankings close to the V language (or all the way to the top for huge generated code sizes as for from Nim's macros) at the cost of ugly code.