nimforum mirror - Fastest Prime Sieve, in Nim

BLM2 (orginal) [2019-06-20T00:12:59+02:00] view original

In June (2019) I updated the Nim code to perform the SSoZ for Twin Primes. The new code (with release of Nim 0.20.0) produces the fastest known times, at least for inputs up to 10e14, as compared to Primesieve 7.4: https://github.com/kimwalisch/primesieve

Here's the code: twinprimes_ssoz,nim

https://gist.github.com/jzakiya/6c7e1868bd749a6b1add62e3e3b2341e

Here's the paper that explains the algorithm.

The Use of Prime Generators to Implement Fast Prime Sieves of Zakiya (SoZ), Applications to Number Theory, and implications for the Riemann Hypothesis

https://www.academia.edu/37952623/The_Use_of_Prime_Generators_to_Implement_Fast_Twin_Primes_Sieve_of_Zakiya_SoZ_Applications_to_Number_Theory_and_Implications_for_the_Riemann_Hypotheses

https://www.scribd.com/document/395415391/The-Use-of-Prime-Generators-to-Implement-Fast-Twin-Primes-Sieve-Of-Zakiya-SoZ-Applications-to-Number-Theory-and-Implications-for-the-Riemann-Hypoth

Here are timing comparisons to Primesieve on two (2) systems.

System 1


➜  ~ inxi -SCM
System:    Host: localhost.localdomain Kernel: 5.1.11-pclos1 x86_64 bits: 64 Desktop: KDE Plasma 5.16.1 Distro: PCLinuxOS 2019
Machine:   Type: Laptop System: System76 product: Gazelle v: gaze10 serial: <root required>
           Mobo: System76 model: Gazelle v: gaze10 serial: <root required> UEFI [Legacy]: American Megatrends v: 1.05.08
           date: 03/31/2016
CPU:       Topology: Quad Core model: Intel Core i7-6700HQ bits: 64 type: MT MCP L2 cache: 6144 KiB
           Speed: 1185 MHz min/max: 800/3500 MHz Core speeds (MHz): 1: 800 2: 800 3: 800 4: 800 5: 800 6: 800 7: 800 8: 800


            N          | twinprimes_ssoz |   primesieve  | Ratio |
    -------------------|-----------------|---------------|-------|
       100_000_000_000 |       4.628     |      5.837    |  1.26 |
       500_000_000_000 |      24.976     |     32.765    |  1.31 |
     1_000_000_000_000 |      50.713     |     69.705    |  1.37 |
     5_000_000_000_000 |     291.095     |    388.233    |  1.33 |
    10_000_000_000_000 |     621.321     |    821.755    |  1.32 |
    50_000_000_000_000 |    3028.456     |   4620.525    |  1.53 |
   100_000_000_000_000 |    6307.521     |   9724.568    |  1.54 |
   200_000_000_000_000 |   13603.531     |  20279.547    |  1.49 |

System 2


inxi -SCM
System:    Host: localhost Kernel: 5.1.9-pclos1 x86_64 bits: 64 Desktop: KDE Plasma 5.16.0 Distro: PCLinuxOS 2019
Machine:   Type: Server System: Supermicro product: C7X99-OCE v: 0123456789 serial: <root required>
           Mobo: Supermicro model: C7X99-OCE v: 1.01 serial: <root required> UEFI [Legacy]: American Megatrends v: 1.0a
           date: 12/11/2014
CPU:       Topology: 6-Core model: Intel Core i7-5820K bits: 64 type: MT MCP L2 cache: 15.0 MiB
           Speed: 2532 MHz min/max: 1200/3600 MHz Core speeds (MHz): 1: 1200 2: 1200 3: 1303 4: 1200 5: 1215 6: 1200 7: 1302
           8: 1200 9: 1201 10: 1201 11: 1376 12: 1201


        N            | twinprimes_ssoz | primesieve |  Ratio |
---------------------+-----------------+------------+---------
    100_000_000_000  |       3.403     |     5.335  |  1.57  |
    500_000_000_000  |      18.019     |    29.918  |  1.66  |
  1_000_000_000_000  |      39.680     |    62.239  |  1.57  |
  5_000_000_000_000  |     235.462     |   346.909  |  1.47  |
 10_000_000_000_000  |     512.071     |   730.726  |  1.43  |

GordonBGood (orginal) [2019-06-21T11:17:14+02:00] view original

@BLM2:

Your results are as valid for AMD or other processors as they are for Intel. The computer I am using is just a Windows tablet with a Intel Atom i5-Z8350 at 1.44 GHz (when multi-threaded), which is a great machine for the price but definitely at the low end of the performance scale: primesieve version 7.4 takes about 40 seconds to find the twin primes to a hundred million whereas your program takes about 25 seconds. I had a brief chance to play with a new laptop with an AMD Ryzen 5 2500U at 2.0 GHz multithreaded (slower than an equivalent Intel i5 in instructions per clock cycle by perhaps 20% to 25% but much less expensive) and the ratio was a little closer than on my Windows tablet but still a little bigger ratio than as you report for your Intel Core i7 notebook.

So I read your articles and your code to see why it's faster...

The following is not really on the topic of Nim but rather on your algorithm...

While it's great that you are doing innovative things with Nim as your chosen language, naming this sieving algorithm after yourself as something unique is stretching things: It is just a Sieve of Eratosthenes with a high degree of wheel factorization applied using residue bit planes as has been well known, at least by Berstein of Berstein and Atkin fame and I recall other implementations that used this same trick of separatiing the modulo residue bit planes prior to him; among them the reference from Professor Chris Caldwell: https://primes.utm.edu/nthprime/algorithm.php with a quote from that page as "The modified algorithm actually sieves many times, once for each residue relatively prime to some number with many small prime factors. (I used 30030 = 2*3*5*7*11*13.) So, for example, in the first sieve, all of the numbers have remainder 1 when divided by 30030, so instead of having the flag bytes represent 0,1,2,... as in the standard sieve algorithm, they represent 1,1+30030,1+2*30030,... This may sound like it creates more work than before, but it turns out to be faster since we only need to consider remainders which are relatively prime to 30030. In this way, the multiples of 2,3,5,7,11, and 13 are automatically eliminated, making the algorithm roughly 6 times faster. What's more, the modified algorithm can be easily parallelized, as many computers can each work on separate residues."

However, you are right that it is a great algorithm and something that Kim Walisch of primesieve missed as he persisted in byte packing the PG5 residues into a byte "because it works out - eight values in eight bits" and missed that by so doing he is reducing the effective range of each CPU L1 cache range with a subsequent loss of efficiency. However, he made up for this loss in other ways that you are currently not using in making the inner composite number culling loop(s) very efficient by using extreme loop unrolling which reduces his time per cull operation by a factor of about two from your implementation, with the technique described in his GitHub repo. This could be applied to your code to get this same gain and isn't that hard in Nim as you can use a hygenic code generation macro for the requisite 32 cases using about 100 lines of code to generate about 1600 lines of code at compile time. So your code could be about twice as fast as it is now, especially for ranges of hundreds of billions or so.

What you could claim as your own original implementation (and have ;-) ) is your method of reducing the number of composite number culls when finding twin primes by using the above bit planes but just not computing those bit planes that don't have an immediately adjacent bit plane (prime residue wise). By doing this, for a range of say a hundred billion you reduce the number of culls, as there are a total of 1485 residues out of a total of 5760 in the PG13 modulo generator you use to this range that qualify as having an immediately adjacent modulo bit plane "to the right" but you calculate two residue bit planes for each so 2970 bit planes worth. Thus you have reduced the number of culls to about 10 billion culls for this range (about 10% of the range) whereas primesieve culls them all then checks for twin primes so does about 30 billion culls (about 30%). So if your inner loop was as efficient as his, you should be about three times as fast but because your inner loop isn't as efficient you are in the order of only about 70% of the execution time (rather than 33%) although you gain more with increasing range due to processing in bigger ranges per segment because of the more efficient use of CPU L1 cache.

Understanding why your algorithm is faster, it's easy to see that it will always be faster due to the reduced amount of work for any CPU when compared to primesieve on that same CPU, with the gains greater for low end processors such as mine or current AMD Zen due to them not being able to take as great an advantage of the optimizations as used by primesieve.

It's also easy to see the optimizations that could make your code about twice as fast as it is currently...

Now, I am in process of writing a Sieve of Eratosthenes that uses all of the optimizations I mention here and more that should be up to about 20% faster than primesieve even for just counting primes, and if I adapted your twin primes technique to that it would also find twin primes much much faster. The code is written somewhat functionally as a Prime Generator so instead of pre-calculating the base primes to the square root of a given range, it uses recursion to add base primes as required as the range increases. As well, instead of dividing the work across the modulo residue bit planes, I divide the work for all bit planes for a given segment and thus it is scale-able to hundreds of thousands of cores if one has them rather than the number of modulo bit planes. The thing that is holding me up on this implementation is the current difficulty in using closures (as used in functional style code) across threads in current Nim with the limitation on the Garbage Collection only working per thread, but there are solutions. Note: I will not be calling the algorithm used "The Sieve of Goodsman", nor an adaptation of the Sieve of Zakiya, because it is exactly a maximally wheel factorized Sieve of Eratosthenes making it the "Ultimate Page Segmented Sieve of Eratosthenes with Multi-Threading". Although written in Nim (and fast), it could be translated to the idioms of C/C++, Rust, Swift, Haskell,, etc. and run just as fast, and will run as fast as possible within the limitations of the languages and the platforms on which they run for any compiled language such as C#, F#, Java, Kotlin, Scala, Clojure, JavaScript (when JIT compiled by such as Chromes V8 JavaScript engine or Mozilla's SpiderMonkey), and so on.

BLM2 (orginal) [2019-06-21T22:50:19+02:00] view original

Couldn't render post #31018.

GordonBGood (orginal) [2019-06-22T06:33:52+02:00] view original

@BLM2:

I'll reply here one more time before switching to email if necessary, as a few peripheral issues are open and readers may want to know the outcome:

The Classical Sieve of Eratothenes (CSoE) needs to search the whole integer number space up to some N (see https://en.wikipedia.org/wiki/Sieve_of_Eratosthenes). The Optimized SoE (OSoe) seeks to reduce the number space (by skipping even number, factors of 3, 5, etc) but fundamentally still use as their starting/end points the whole number space up to N.

In fact, the linked Sieve of Eratosthenes article specifically mentions use of wheel factorization and has tables showing the effect of "reduced number space"; for instance, for what you would call the "PG7" as wheel factorization using the prime factors of two, three, five, and seven for a range of a hundred billion (10^11) reduces the required number of cull operations from about 276 billion for the "classical" SoE to about 36.2 billion operations due to "reduced number space" and further to about 22.3 billion culling operations for what you would call "PG19". The odds-only SoE is just a trivial case of "PG2" and reduces the number of culling operations to about 113 billion for this range as per the table: https://en.wikipedia.org/wiki/Sieve_of_Eratosthenes#Algorithmic_complexity.

The OSoEs talk nothing about Prime Generators, residues, etc.

The article doesn't talk about actual implementation methods for wheel factorization because that is beyond its scope of describing the SoE algorithm; however, it has a formula whereby one can calculate the estimated number of culling operations for a given range and can then confirm that estimate using whatever implementation one chooses. It also goes beyond your implementations as in using optional "pre-culling" whereby for a given "PGX" one can pre-initialize with a wheel pattern for even larger primes than "X" and shows the resulting reduction in culling operations by doing this.

My methods are fundamentally different in conception and implementation. They are based on Prime Generator math...

Your methods use one of the most efficient implementations (perhaps the best in general terms) but as discussed above "Prime Generator Math" is just wheel factorization math as has been well known for many years in the field and is thus nothing unique.

They are also modern, as they are inherently structured to be optimally done in parallel.

As I mentioned in my first post, your method of doing this in parallel is just one possible implementation as is exactly the one chosen by Professor Caldwell in the article I quoted as he has used for perhaps the last 20 years. Before that, there was a 1971 paper on an implementation for a IBG 370/136 mainframe that implemented page segmented SoE and proposed refinements for what is essentially wheel factorization for "reduced number space" at the end of the article at: http://www.math.ubc.ca/~gerg/teaching/592-Fall2018/papers/1977.Bays.pdf. Application of wheel factorization to the SoE is well known in the art.

Your implementation of running the individual modulo residue bit planes in parallel is not necessarily the best one in all situations, although it may be the best for the specific problem of finding the twin primes over the ranges you propose. Running each successive segment over all modulo residue bit planes per thread has the advantage that it scales to any number of threads as per a supercomputer or compute platform such as the computation facilities of a modern GPU; although the code in calculation of starting addresses per segment is a little more complex, it doesn't then need to be done as often.

It's interesting you raise an issue of naming my work after myself. I mean after all, if you look on Wikipedia, besides the SoE, it has pages for the Sieve of Atkins, even though Daniel J. Bernstein as a grad student did all the programming for it https://en.wikipedia.org/wiki/Sieve_of_Atkin, and Sieve of Sundaram https://en.wikipedia.org/wiki/Sieve_of_Sundaram...

First, the Sieve of Atkin is deservedly named after Professor Atkin because it is a unique sieve with an academic paper which has been defended and recognized as being distinctly different than the Sieve of Eratosthenes. In fact, it also uses "baked in" wheel factorization of what you would call "PG5" but that comes about naturally due to the unique algorithm based on quadratic equations. What the paper and reference implementation by Daniel Bernstein fail to acknowledge is that a practical page segmented version of the SoA is very difficult to realize and does not show the stated O(N) asymptotic execution performance or even close and also that if one allows the compared SoE implementation to use maximum optimization including maximum wheel factorization it can never be faster for any practical range on any given CPU architecture. Bernstein did an excellent job in his reference implementation, but other than being mentioned in the paper, he has nothing more to do with the SoA named after Atkin than if I took his code and improved it (which I have done, but it still can never be faster than a maximally optimized SoE due to the added complexities of multiple varying span "culling" operations as compared to the constant span operations of SoE).

As to the Sieve of Sundaram, its only usefulness is that it is slightly simpler to implement than the SoE but is much slower due to culling by all odd numbers less than the square root of the range rather than culling by the base primes less than the square root of the range; if one does the slight modifications so as to cull only by the base primes, it is transformed into the "odds-only" SoE, which is a "PG2" wheel factorized sieve. Letting Sundaram's name be associated with the simplification can be argued as being justified as his simplification is a modification to the SoE algotithm, not just the implementation.

However, calling the SoA the Sieve of Bernstein due to his implementation or the Sieve of Goodsman if I published my improvements to his implementation is not appropriate because the algorithm that makes it work is entirely due to Professor Atkin. Kim Walisch does not call "primesieve" the Sieve of Walisch even though he combines many unique refinements from many sources to the basic Sieve of Eratosthenes algorithm (including but not limited to "PG7" wheel factorization in a different form, pre-culling up to 19, and many other "tricks"). The implementation I call Ultimate which also uses "PG7" or PG11" and pre-culling up to 19 or 23 does not deserve to be called the Sieve of Goodsman as it combines the history of improvements to the basic Sieve of Eratosthenes including Wheel Factorization, Page Segmentation, and the culminating touch that makes it faster than "primesieve" in the application of dividing the work into modulo residue bit planes as was used by Caldwell/Bernstein et all instead of packing multiple modulo residues into a byte as used by Walisch and others).

Thus, I don't think your implementation of the SoE with your implementation refinements should be named after you, but I'm not going to argue it further; it's up to you.

OTOH, your implementation "trick" of reducing the work done in finding the twin primes over a range is probably at least as unique as Sundaram's SoE simplification and I don't recall that being mentioned anywhere, so you can likely justify laying claim to that...

BLM2 (orginal) [2019-06-27T14:02:45+02:00] view original

A quick update.

I did some simple coarse optimizations that give some appreciable speed increases. But first some Nim issues to be aware of/correct.

Erata and Nim bugs

In proc twins_sieve (line 254), the seg array is inititalized as

var seg = newSeq[uint8](KB shr 3)

This could cause a runtime error. It could be one byte short, or be initialized effectively to var seg = newSeq[uint8](0) for small Kb. Compiling with -d:danger creates no runtime errors, and everything still works, but I don't know why (the D version produces runtime errors where expected, link below).

https://gist.github.com/jzakiya/ae93bfa03dbc8b25ccc7f97ff8ad0f61

I've now coded it correctly (as originally done) so it's now always at least 1 byte long and gives the correct length for all Kb values.

var seg = newSeq[uint8]((KB shr 3) + 1)

Starting with Nim 0.19.6 to present (0.20.0), the program sometimes hangs (needing a hard abort) for some small values (< 1e10). At first I thought it had something to do with the progress indicator on line 341 in the main proc twinprimes_ssoz, but even after I commented that line out it still hangs, so it must be a pure threading issue. Something must have changed since 0.19.4.0.

Performance Increases

As stated previously, a simple straightforward way (no big code changes needed) to increase speed is to better fine tune the segment sizes. Line 309

let B = Bn * 1024 * 8 # set seg size to optimize for selected

provides a very coarse way to simply experiment with segment sizes. Below are updated results of change to factor n for Bn * 1024 * n. I found using smaller values 4 and 6 for n gives appreciable better times for lower values than using 8 as a constant factor. Some results below.


            N          |   n = 8    |   n = 6    |   n = 4   |
    -------------------|------------|------------|-----------|
       100_000_000_000 |     4.628  |     4.624  |     4.684 |
       500_000_000_000 |    24.976  |    23.711  |    23.308 |
     1_000_000_000_000 |    50.713  |    48.298  |    47.981 |
     5_000_000_000_000 |   291.095  |   284.336  |   265.746 |
    10_000_000_000_000 |   621.321  |   618.032  |   575.923 |
    50_000_000_000_000 |  3028.456  |  3009.485  |  3038.179 |
   100_000_000_000_000 |  6307.521  |  6371.226  |           |

Basically n = 4 is fastest for N ~< 4e13, n = 6 for ~4e13 < N ~< 1e14, and n = 8, N >~ 1e14, with possibly n increasing as N increases.

I've now incorporated this switching scenario in the current code to reflect this.


  let range = Kmax - Kmin + 1      # set number of range resgroups, min of 1
  let n = if range < 37_500_000_000_000'u: 4 elif range < 975_000_000_000_000'u: 6 else: 8
  let B = Bn * 1024 * n            # set seg size to optimize for selected PG

https://gist.github.com/jzakiya/6c7e1868bd749a6b1add62e3e3b2341e

What I really need to do is recalibrate the settings in selectPG. Those settings are for the old (pre June 2019) implementation, and need to be fine tuned for the current implementation. This coarse tweaking shows there are likely more speedups from merely tweaking the PG cross-overs and segment sizes for given range values.

Another likely speedup is for P5, which now uses just 3 threads for its 3 twinpairs residues. My I7 has 8 threads, so I can divide the range for P5 into 2 equal sections for each twinpair and use 6 threads in parallel. This needs just a small change to the code to achieve.

More extensive speedups are achievable by eliminating all/most/many of the calculations done in nextp_init, similar to what primesieve does.

However, primesieve actually performs 3 different algorithms, for small, medium, and large ranges, using (large) precomputed wheel (PG) tables of constants. The large algorithm is very fast doing this, while trading off code size. (primesieve is a very large code base, with dozens of files totaling into thousands of lines of C++ code.)

Some people have noted appreciation that I can perform in 1 file of ~300 lines of Nim code (minus line comments) so much better. (At least up to around 1e14+ range. It takes so much time to run tests above that, I'm resource limited to fully verify optimizing it over the full 64-bit range.)

However, with the small simple changes presented here, the implementation is now even faster than primesieve over the tested ranges.

GordonBGood (orginal) [2019-06-27T17:33:40+02:00] view original

the implementation is now even faster than primesieve over the tested ranges.

As I've noted before, your implementation is only faster than "primesieve" when it limited to the task of finding twin primes, for which your trick reduces the work. If one took "primesieve"'s implementation and used your trick to only find twin primes, it would be about two and a half to three times faster. Even better, my implementation is simpler and already divides the work into 48 or 480 modulo residue bit planes, so it would be even easier to adapt it to only sieving the limited planes needed for twin/triplet/etc. primes.

primesieve is a very large code base,

You are right that "primesieve" is a large complex program. However, it has grown to be overly complex and that is part of what is costing an extra 20% in execution time. My program is just over 500 lines of code including comments and performs about 20% better than "primesieve" for just counting primes even though it also makes (automatic) adjustments in steps for different size sieving ranges.

fine tuning ... fine tuning...

A sign that one doesn't very will understand one's algorithm is when extensive empirical adjustments are made for performance; in contrast, one can predict and knows exactly why "primesieve" and my algorithm are as fast as they are. One of your bottlenecks is in your inner loop, which would run at about 3.5 CPU cycles per culling operation if it were run within a single CPU L1 cache size most of the time. When run multi-threaded on a Hyper Threaded CPU with a 32 Kilobyte L1 cache as most current Intel CPU's have, you can only use 16 Kilobytes per thread (half) for the highest culling speed. Even 3.5 cycles per cull isn't fast, as extreme loop unrolling gives as little as an average of about 1.25 cycles per culling operation for small base prime sizes and there is an intermediate method that can reduce cull operation cycles for smaller base prime values to about 1.875 cycles.

segment sizes...

As above, making your segment sizes too small limits your range and/or makes culling over larger ranges too inefficient as in too many base primes don't "hit" the segment with your naive implementation; too large a segment size with your simple implementation will mean more/most base primes "hit" but are slower because of cache associativity. That is why programs use different schemes for different ranges of base primes.

It takes so much time to run tests above that, I'm resource limited to fully verify optimizing it over the full 64-bit range.

One doesn't actually test over the 64-bit number range, which would take hundreds of years at your rate; one uses the segmentation to test over a limited range every order of magnitude of growth. But your current algorithm is limited in it's use of fixed size segments so it won't be very efficient over ranges of about 1e14.

You are putting a lot of effort into this what is a very limited use that is only proven useful for finding twin primes (not triples and higher) faster. It seems to me that the effort could be better spent in doing something more general, but YMMV.

BLM2 (orginal) [2019-06-27T19:51:14+02:00] view original

@GordonBGood

Please provide:

Link to full source code to your work.

Timing results comparisons between your code, my code, and primesieve.

Link to detailed writeup on your methodology and implementation details.

GordonBGood (orginal) [2019-06-28T01:28:34+02:00] view original

Please provide...

Working on it. I am currently sidetracked by considering the implications of the new run time memory management proposal to see how it impacts my work as to if its application can make things better. The old memory model's lack of support for threading is very likely the reason you are experiencing some very serious crashes when using parallel execution.

For the algorithm as to methodology, the general idea is pretty well described for in "Chapter 4" of my answer on how to implement the Page Segmented Sieve of Eratosthenes in JavaScript at: https://stackoverflow.com/questions/39312107/implementing-the-page-segmented-sieve-of-eratosthenes-in-javascript/55761023#55761023.

The JavaScript version of Chapter 4 isn't included because it wouldn't fit and because I stepped back from JavaScript (I hate it for anything more than about a 100 lines) in favour of finishing the Nim version with the plan to compile that as JavaScript as a quick way to do it. It's an algorithm, so it is applicable to any programming language within the limitations of what the language allows. For instance, JavaScript doesn't have pointers and thus can't have the extreme loop unrolling applied, but even the "odds-only" code does have the optimizations to cull about twice as fast by culling over one of the mask patterns at a time for the eight different mask patterns

Even the JavaScript version in odds-only can sieve to a billion in about a second on a fast modern desktop with a recent good JavaScript engine such as Chrome V8/FireFox Spider Monkey/Safari. Since using the modulo residue bit planes discussed here leaves the complexity per bit plane almost exactly the same, and the number of operations is reduced exactly as per the Wikipedia tables, this will be almost exactly a factor of four faster when wheel factorization is done (including using the "combo" technique with pre-culling some larger primes than the basic residue wheel) or about a quarter of a second in JavaScript according to the above environment.

With Nim and extreme loop unrolling using pointers without JIT compilation, it should be at least two and a half times faster than that, then multi-threading is added for a further factor of the number of effective cores...

BLM2 (orginal) [2019-06-29T14:53:39+02:00] view original

FYI.

I finished running timing comparisons with D. They are very consistent, with D a tad slower at a constant ratio. Since most of the work is done in the threading operations I suspect that's where the difference is. I used the latest LDC 1.16.0 D compiler (it uses LLVM), and Nim was compiled using GCC 8.3.1.

D source: https://gist.github.com/jzakiya/ae93bfa03dbc8b25ccc7f97ff8ad0f61


            N          |       Nim       |       D       | Ratio |
    -------------------|-----------------|---------------|-------|
       100_000_000_000 |       4.628     |      4.619    |  1.00 |
       500_000_000_000 |      23.308     |     24.071    |  1.03 |
     1_000_000_000_000 |      47.966     |     49.703    |  1.04 |
     5_000_000_000_000 |     265.746     |    276.847    |  1.04 |
    10_000_000_000_000 |     572.803     |    590.881    |  1.03 |
    50_000_000_000_000 |    3009.485     |   3046.913    |  1.01 |
   100_000_000_000_000 |    6307.521     |   6594.283    |  1.04 |
   200_000_000_000_000 |   13603.531     |  14706.743    |  1.08 |

I have a Rust version of the previous (older) algorithm I will convert (at some time), and may finally try to do a C++, since people want to see a C++ version.

Actually, this algorithm is a very good general benchmark because it exercises multiple components of a language (threading, memory allocation/management, compile time features, looping structures, standard libs, etc).

GordonBGood (orginal) [2019-08-04T07:04:32+02:00] view original

@BLM2, I'm still working on this, but since comparing single threaded performance is like racing cars where each is limited to firing on only one cylinder when they have at least four and maybe six, eight, twelve, thirty-two or sixty-four: one can compare performance per cylinder but some cars and driving techniques may not scale across the different cars for this limitation. Anyway, even if one is going to compare this way, it should be on the basis of performance in CPU clock cycles per core-pair since HyperThreading (HT)/Simultaneous Multi Threading (SMT) thread pairs usually share L1 and L2 cache resources for modern CPU's.

That is the problem with Dana Jacobsens's single threaded sieve comparison chart (http://ntheory.org/sieves/benchmarks.html) although it is somewhat useful for comparison by scaling clock speeds as described below.

The top contenders in her Chart are likely no longer the fastest, as Huang YuanBing's "ktprime" (K Tuple Prime) written in C (https://github.com/ktprime/ktprime) is actually up to about 10% faster than Kim Walish's "primesieve" when run single threaded producing 'k'=1 (the non-k) as one can determine by scaling his (single-threaded) 1e10 and 1e12 results on his i7-6700 CPU running at 3.9 Gigahertz with Dana Jacobsen's single threaded tests run on a i7-6700K CPU running at 4.2 Gigahertz. Unfortunately, Dana doesn't seem to see comparing single threaded results the way I do (or more likely she is comparing many algorithms that haven't been written to be multi-threaded) and "ktprime" hasn't yet had multi-threading added anyway, but I see no reason it wouldn't hold its lead over "primesieve" when multi-threading is added.

The code for "ktprime" is much more complex than yours as it is more general in being able to be used to count/find "K-Tuple Primes" in the general case and not just Twin Primes as your code does and also part of the extra lines of code are due to it being written in C and not Nim and as well having a full command line interface rather than just the basics, but the general principles of using "PGX" wheel factorization generators are the same. Your trick of boiling down the performance by eliminating some sieving of residual bit planes that aren't required for Twin Primes will still beat it for that special case if the same optimizations were applied to your code as he uses, but that is very much a special case.

Note that most newer algorithms don't even bother with sieving to a billion as that is now such a trivial case (about 40 milliseconds multi-threaded on a modern desktop) for modern algorithms and simple tricks that only work for these "smaller" sieving ranges can change the outcome but aren't generally applicable to larger ranges: thus, the comparisons should really start at ten billion or (even better) a hundred billion. For instance, "primesieve" runs about 20% faster sieving to a billion on a CPU with a 32 Kilobyte sieve buffer size than with a 16 Kilobyte sieve buffer size when run single threaded, but these gains don't translate to multi-threaded use as two threads share this same cache meaning that it effectively becomes about the same performance as if it were only 16 Kilobytes per thread. Also, the problem is that these better algorithms are too "grainy" at small ranges with possibly a single thread task sieving up to about a four billion number range at a time and a processor with eight threads thus processing a range of about 32 billion per step! What really counts in speed is the ability to sieve these larger ranges efficiently and in a way that scales well with multiple threads. Kim Walich's "primesieve" only scales moderately well with increasing range due to conflicting memory access HT/SMT threads sharing the same CPU L1 cache as described with "ktprimes" possibly scaling better and my own work where I always use just only half the L1 cache definitely scaling for multi-threads in a direct way.

...and may finally try to do a C++, since people want to see a C++ version.

I don't know that you need to go that far if you don't want to - I personally dislike C/C++ intensely and never write anything in them more than a couple of hundred lines of code, which is why I am with Nim: most people capable of reading and understanding C++ code should be able to easily translate your few hundred lines of Nim code and if it is just for the purposes of comparative benchmarks, it is a trivial task to install the Nim compiler that basically just generates C or C++ code from your code anyway. But "have at it" if you must and it will be an interesting exercise anyway in comparing the code.

In my experience, Nim can run just as fast for the same algorithm as C/C++ as long as one pays attention to memory allocation as in not using the Garbage Collector (GC) and is aware of when deepCopy operations will be triggered and how to avoid them. The main difficulty with writing multi-threaded code in Nim (which you are experiencing in your crashes) is not passing GC'ed values across threads, which can lead to some extra code needing to be used to define ones own non-GC'ed data types including closures (which are verboten across threads!). Other than that, Nim code approaches some of the most concise while performant code I have ever written!

Hopefully future improvements to the Nim memory management model will make writing multi-threading code much easier. The "newruntime" switch could fix the ability to use closures across threads, but unfortunately I can't try it currently as it doesn't work when threading is turned on.

Actually, this algorithm is a very good general benchmark...

I agree with you that primes sieving can be an excellent CPU and compiler benchmark if used knowledgeably; the problem is that many don't use it this way, comparing using "monolithic" sieves (one huge sieving array without page segmentation) which only measures main RAM access speed when used with large sieving ranges, or confusing the result determination time with the actual sieving time as in the time to count the number of found primes with the time to do the sieving, or the time to read the time with the actual thing being benchmarked (as in reading the time too often in a micro benchmark).

GordonBGood (orginal) [2019-08-05T12:27:47+02:00] view original

@mratsim

Not sure if it's the only bug but for now this is the one to fix for multithreading on --newruntime #11844.

It seems that @Araq has pushed a commit to fix that issue; all that remains is to compile the DEVEL branch and test or wait for a new release...

As to exploring different alternative multi-threading:

approaching the limits of OpenMP (nested parallelism especially)

I rejected OpenMP early as it is too C'ish and therefore somewhat cludgy although I have used it before with C and see how its use can be expanded beyond the basics available "out-of-the-box".

raw Threads is a too low-level abstraction

At first I thought using the low-level would be the answer as in easily emulating GoLang's green threads and that channels would work like GoLang channels; however, it kept getting more complex due to the limitations of not having GoLang's multi-threaded GC and therefore all the Nim limiations on the use of channels, meaning that one has to stoop to the use of low level features as in use of Lock and Cond and essentially building a Nim version of Haskell's "MVar" to pass through the channels. This has caused me to explore @Araq's recommended threadpool.

Nim's FlowVars have contention issues and no load balancing

Funny you should say that about FlowVar's as your link to the "pibench2" project doesn't use theadpool and therefore doesn't use FlowVar's. I agree that the linked project using the raw thread's isn't very efficient and should be re-written to use threadpool and spawn as per the following:

# Compute PI in an inefficient way but multi-theaded...
from cpuinfo import countProcessors
from threadpool import FlowVar, spawn, `^`
from times import epochTime

const DELTA = 1_000_000 # number of terms calculated per thread call!

let numprocs = countProcessors()

proc pi(numterms: int64): float =
  result = 0
  proc terms(k, klmt: int64): float {.inline.} =
    result = 0
    for i in countup(k, klmt, 2):
      let frst = i + i + 1; result += 1 / frst.float - 1 / (frst + 2).float
# [ # take out the space between the `#` and the `]` to comment this out!
  var rslt = newSeq[FlowVar[float]](numprocs) # acts as a rotary queue!
  for i in 0 ..< numprocs: # preload waiting queue!
    let k = i * DELTA; let klmt = k + (if k + DELTA > numterms: numterms - k else: DELTA - 1)
    rslt[i] = terms(k, klmt).spawn
  for i in 0 .. numterms div DELTA:
    let id = i mod rslt.len; result += ^rslt[id] # get values from queue!
    let k = (i + numprocs) * DELTA
    let klmt = k + (if k + DELTA > numterms: numterms - k else: DELTA - 1)
    rslt[id] = terms(k, klmt).spawn # add new tasks to queue! # ]#
#[ # put a space between the `#` and the `]` to uncomment this!
  for k in countup(0, numterms, DELTA): # do it without threading for comparison
    let xtra = (if k + DELTA > numterms: numterms - k else: DELTA - 1)
    result += terms(k, k + xtra) # ]#
  result *= 4

let start = epochTime()
let answr = pi(10_000_000_000)
let elpsd = (epochTime() - start) * 1000.0
echo answr, " in ", elpsd, " milliseconds."

The above code does the same calculations as the "pibench2" with default settings which calculates 10^10 = 10 billion terms for 10 "digits". One can vary DELTA to see that it starts to lose scaling across threads at something below about a hundred thousand when the thread task is something about a millisecond.

The above takes about 3.X seconds on a Sandy Bridge CPU with three cores at 2.5 Gigahertz to do a tenth of the job as per the following WandBox link: https://wandbox.org/permlink/9gOlb1WsjjgWFGIm but that isn't too certain as the two of the three threads will likely be shared cores; it takes about 54.2 seconds on my Intel Atom i5-Z8350 Windows tablet CPU (Silvermont) at 1.44 GHz and four cores and about 36.25 seconds on my Intel i3-2100 at 3.1 GHz with two cores and four threads. From all of that, your processor being a half again faster clock speed than this last and with nine times the number of cores as well as being seven generations newer might make your 18 core/36 thread machine able to run this in a half second but I would more expect it to be about two to three seconds. One can comment out the multi-threading code and uncomment the single threaded code to see that the code scales with the number of cores other than the uncertainty of Hyperthreading/Simultaneous Multi-Threading threads and the number of floating point execution units there are per core and how they are used by the CPU. It is easy to calculate from these times (especially single threaded ones) that it is taking about 30 CPU clock cycles per term calculation on these older generation CPU's so that seems about right and explains why we need at least about a hundred thousand terms calculated per thread task to make task switching reasonably efficient. However, number of clock cycles for floating point calculations could be greatly reduced in more modern CPU's.

spawn doesn't load balance any more than green threads on GoLang or Haskell do and it is up to the programmer to implement the algorithm to do this themselves; spawn threads do have the advantage that they avoid a lot of context switching "thread spin-up" time since the threads are allocated from the pool and thus multi-threading can be "finer" - still efficient for thread tasks taking shorter amounts of time such as less than about a millisecond rather than about ten milliseconds. It would be nice if channels worked with them, but it isn't that hard to implement a pool of FlowVar's as done in the above example code, so it isn't absolutely necessary.

It looks to me that the use of threadpool/spawn/FlowVar may be the best option, and if some automatic load balancing is required, to build a module on top that does this.

However, we are still left with the deepCopy for ref's and nested ref's such as in closures not being compatible, which is why I am hoping that the new run time may help. But then again, it may not, as the concept of ownership doesn't really fly well across threads unless ownership can be transferred to each thread, and such ownership transferring pretty much always mean a deepCopy, a reference counted memory management, or a multi-threaded Garbage Collection (which we are trying to avoid).

I'll do a little more work and report back as to it's use for this primes sieve...

GordonBGood (orginal) [2019-08-06T10:14:51+02:00] view original

@mratsim:

Nim's FlowVars have contention issues and no load balancing.

I have been doing some more thinking about your performance problem with the "pibench2" program and think that the problem may have been your 36 threads being run in with pinned threads in thread affinity - there should be very good reasons to pin threads to a core and this code isn't a good example of one of them. As I show in my code above, it isn't necessary, and the above code could reasonable easily be converted to use raw threads rather than spawn.

That said, there are advantages in reducing context switching overhead by using spawn and as you said in your comment on using raw theads, they really are too low level if one doesn't need those extra cababilities as one doesn't here.

For your use, you might use setMinPoolSize to your countProcessors() value to be sure that enough threads are "spun up" in the thread pool.

GordonBGood (orginal) [2020-06-13T22:42:38+02:00] view original

@BLM2:

@GordonBGood Please provide:

Link to full source code to your work.

Timing results comparisons between your code, my code, and primesieve.

Link to detailed writeup on your methodology and implementation details.

It's been some time, but I have done quite a lot of work on this across various languages.

Answers to your requests as follows:

In Nim: here is a Wandbox of a fast version of a true Sieve of Eratosthenes with wheel factorization as discussed. This is not fully optimized to the level of Kim Walisch's "primesieve", but it shows the principles of using wheel modulo residual bit planes in efficient wheel factorization sieving using something you would call PG4 (wheel factors of 2/3/5/7) but with the added refinement of using preculling for the primes of 11/13/17/19.

This has many more optimizations to the innermost culling loop than your code so is actually faster, but as your code isn't really a Sieve of Eratosthenes as it can't count or output the found primes in order, they are not really comparable. It is about as fast as "primesieve" even though it isn't complete as to optimizations extending it's range above about 16 billion, but you can compare it to primesieve over the range of a billion to 16 billion yourself.

I have written quite a full tutorial on how it works as a StackOverflow answer on implementing this in JavaScript which describes everything that is done here other than the macro that does extreme loop unrolling (not possible to use in Javascript) and also describes the other changes necessary to extend the work to much higher sieving ranges.

Now as your work:

No, it's not really unique as Huang Yuanbing's KTPrime is really the same thing other than his code isn't limited to only counting Twin Primes and can handle all prime tuples. His code is written in C++ but could easily enough be translated into Nim or any other language producing efficient native code. As his code is much more optimized than yours, I suspect that his code is at least as fast as yours even though it is written for the more general finding of any/all prime tuples and not just Twin Primes.

Huang Yuanbing doesn't claim that his code is a full Sieve of Eratosthenes as it doesn't count or find primes in order but rather across all modulo residual bit planes with only a range of bit planes in memory at any given time. In order to produce results in order, they would need to be stored and sorted at an extra computational cost. Your code has the same limitation and in fact isn't written for the general case of finding or counting all primes at all in any order, making it not really a full Sieve of Eratosthenes although it uses some of the SoE culling principles.

As you can see from my linked example in 1., your code still misses many culling loop optimizations that would make it much faster if they were applied. If I cared to re-write your code using my optimizations, it would run up to about three times faster, but I'm not going to do that because the end result is a Twin Prime finder with the results found out-of-order. My efforts are better spent on extending the work as per the posted link in 1. to extended sieving ranges.

I could adapt my work to finding prime K tuples as well as individual primes and prime gaps in order and will likely do that for the general case so my code is more comparable to "primesieve". The fact that your code may be slightly faster (or quite a bit faster if it were better optimized) means little as your code is for a limited use case whereas "primesieve" and my code are for the more general full sieve use case.

jlhouchin (orginal) [2021-12-22T20:28:45+01:00] view original

jzakiya posted in the Crystal forum a comparison of a Sieve implementation in which in the post the Nim implementation crashes at certain point.

https://forum.crystal-lang.org/t/kudos-to-crystal/4161

I like for each language to have the best implementation that the community can put forward and let the chips fall where they may.

Can someone help the Nim implementation better represent Nim?

Thanks.

DeletedUser (orginal) [2021-12-22T23:42:05+01:00] view original

The option --gc:arc which has been added since would probably help with the seq allocations & memory usage but not sure how much.

shirleyquirk (orginal) [2021-12-23T04:30:38+01:00] view original

i was able to replicate the oom crash and i know what needs to be done to fix it, but I don't like my hack, hopefully someone else can help make it cleaner.

The problem boils down to this:

proc foo(primes:seq[int]) =
  var even_bigger = newSeq[int](primes.len*2)
  for p in primes:
    discard #primes is read but not modified

proc main() =
  let primes = newSeq[int](203034842)
  for i in 0..2: #3 threads for this particular input but could well be more
    spawn foo(primes)

spawn makes a copy of primes for each thread, which, combined with the giant seqs created within each thread, is just too much.

my solution is:

var g_primes:seq[int]
proc foo() =
  {.gcsafe.}: #we're using a global using gc memory!
    var even_bigger = newSeq[int](g_primes.len*2)
    for p in g_primes:
      discard #primes is read but not modified

proc main() =
  g_primes = newSeq[int](203034842)
  for i in 0..2: #3 threads for this particular input but could well be more
    spawn foo()

and to compile with --gc:arc

is there a better way? one that doesn't require that {.gcsafe.} cast?

Knaque (orginal) [2021-12-23T06:24:37+01:00] view original

Well, as long as g_primes is read-only, I think this is a perfectly reasonable solution. {.gcsafe.} is essentially a way of saying "I know this code looks unsafe to the compiler, but I, the programmer, promise that it is actually safe." So, as long as you can keep that promise, there's really nothing wrong with doing it this way.

DeletedUser (orginal) [2021-12-23T15:18:07+01:00] view original

I didn't catch this at all. I believe --gc:arc is fine with managed globals so if you only compile with --gc:arc the gcsafe cast can be removed. I think there are plans to add a {.global.}: block for using globals as well to serve as a more specific version of gcsafe, but it might not be needed.

GordonBGood (orginal) [2021-12-24T08:42:16+01:00] view original

@shirleyquirk:

I don't like my hack, hopefully someone else can help make it cleaner...

I don't like it either, or this algorithm in general, which is why I haven't worked on it directly, at least so far...

As you have identified, the biggest problem with the implementation is that it uses Nim's experimental spawn feature that requires copying seq's between threads, which feature will likely soon be replaced with something better to avoid the copying between the foreground and background threads. As it currently exists, it will never be very satisfactory, so I would likely implement this manually using raw Threads and passing the required data across using pointers, which although still not "pretty" means it is more likely to work reliably.

However, using 12.2 Gigabytes for this algorithm is extremely high (although I don't know the number of threads you are using). Also, it looks like the Crystal implementation automatically limits the number of threads used, perhaps for this same reason so that the print out shows Crystal only using three threads where eight seem to be available. The base prime value`seq` stores eight-byte int's and for sieving over the maximum range there should be about 200 million of these so 1.6 Gigabytes per thread. The page-segment memory used per thread is only 384 Kilobytes maximum an a pair of these per thread would only be six Megabytes, so that is negligible. Not copying the array is a huge step in the right direction, as then the one base primes array serves all threads.

Another thing that would be easy to do would be reduce the stored base prime seq array to uint32 as that is always enough and just cast them up as required, which would reduce the memory use by a factor of two. A slightly harder thing to do would be encode the base primes as single byte delta's since the base prime values are all odd and the larges prime gap up to the 32-bit number range is 354 but if encoded knowing all gaps are even then the larges delta is only 177, which is within the 255 range of a single byte. This would reduce memory use for the base prime value seq another factor of four to about 200 Megabytes, which wouldn't be a problem. However, this isn't likely going to entirely fix the memory use problem, as it looks like the implementation uses huge "nextp" seq's of the same length as the number of base primes, with one entry for each of the two bit planes each thread handles. This is over three Gigabytes per thread (and they can't be shared) although there is no real need for each entry to be 64-bit values and 32 -bit values would reduce this by a factor of two as above. A better solution would be to use a more efficient page-segment start address calculation, although that would require fairly extensive re-writing and I would recommend my alternative implementation described below.

Taking 22 seconds to initialize to this range, which should mostly be sieving the maximum base prime value range, is also ridiculously slow as it should take only a second or two even single-threaded and can be optimized to take only a fraction of a second; of course part of the reason this is so slow is that, although it uses some wheel factorization, it uses the "one-huge-memory array" technique which suffers horribly from "cache-thrashing". In a similar vein, taking almost 20 seconds to do a tiny page-segment sieve of a few hundred totative bit planes is also very slow, especially using several threads to do so. For instance, Kim Walisch's "primesieve" counts the twin primes over the small range at near the high limit in a couple of seconds, even single threaded.

As to the problems with the algorithm, near the top of this thread, I replied to the OP, BLM2 (which is yet another account name by the Crystal forum's contributor, Zakiya, and the author of this algorithm) giving my objections to his work, which I will state again as follows:

For years he has persisted in claiming the basis of the algorithm as his own, now calling it the Segmented Sieve of Zakiya (SSoZ), when it is nothing but a version of the Page-Segmented Sieve of Eratosthenes (SoE) with a slightly different form of wheel factorization, and refuses to acknowledge previous work that also uses this slightly different form such as that of Andrew Booker or as per the SoE reference implementations for comparison with the implementation of the Sieve of Atkin as per Daniel Bernstein's work in the eratspeed program as part of that project.

Wheel factorization is often implemented with the wheel totatives each represented by one binary bit in a byte, word, or several successive words; for instance, for the wheel formed from the primes 2/3/5 which has eight totatives of 1/7/11/13/17/19/23/29 if including one or 7/11/13/17/19/23/29/31 if skipping past one, all totatives fit in one byte with each bit representing one binary value, either composite or not, and with successive bytes containing these values incremented by steps of 30.

In Zakiya's version in common with other previous work such as the above links, instead of combining all wheel factor totatives into one word or set of words, a separate bit plane is used for each of the totatives, so there would be eight bit-packed arrays uses for the above example.

For finding twin primes as per this implementation, two of these bit-packed totative arraysare used at a time by each thread so they can be checked for twins. For example, if the above wheel using primes up to 5 were used and the value one were skipped, the twin prime check pairs would be (1,2)/(3/4)/(6,7), which could be run on three threads

As the working range goes up, the size of the totative bit plane arrays would increase to ineffecient sizes, so his algorithm divides this into smaller page segments which are a maximum of 384 Kilobytes or about the CPU L2 cache size.

For very large upper limits, the implementation increases the "degree" of wheel factorization up to and including the prime of 17, which increases the number of totatives to 92160 and provides many more possible thread applications with one threaded task per pair of admissible bit planes, also keeping the number of page-segments required per bit plane to something quite small. Page-segmentation was required as the implementation allows looking for twin primes between a small range between a start value and an end value as well as to keep the memory use low.

If implemented correctly with a thread pool of only the maximum number of CPU threads, the maximum memory use per thread should be the maximum page segment size of 384 Kilobytes times two for the twin-primes search, plus storage for all of the base primes up to whatever upper limit as described above, but with the problem that the implementation requires a huge set of "nextp" seq's.

My preferred more general way of implementing this would be as follows:

Instead of a variable number of bit planes with range, I would choose a reasonable wheel factorization using the primes up to say 7, so 48 totative bit planes.

I would choose a page-segment size of fixed increments whose size again goes up with range as its span in bits needs to be at least as big as the square root of the represented top range per page, which is the maximum value of base prime required to cull the composites from the page; this isn't that different the the Zakiya implementation other than the above fixed number of totatives.

Then, each page-segment would consist of all of the admissible totative bit planes using the entire page-segment bit plane buffer for each, with successive pages assigned to different threads from the thread pool, whose work would include comparing the admissible pairs for bit planes and counting the occurrences where a given bit position is prime for pairs of totative bit planes. In other words, instead of each thread handling only one pair of totative bit planes, each thread would handle all admissible totative bit planes and include the finding and counting of twin primes in adjacent pairs.

The feeding and retrieving of results from the thread pool would continue until the entire range has been processed.

The method is somewhat superior to that of Zakiya as it allows better optimizations in culling, but especially in calculation of the culling start positions per totative bit plane largely through Look Up Table's as I do for my implementation of the Page-Segmented Sieve of Eratosthenes in a StackOverflow answer, or the same in F# on another SO answer, which includes multi-threading and other optimizations (although limited by running on a JIT platform).

Adding the processing to, rather than just count primes, skip the inadmissible totatives not able to be a part of twin primes and to or/and bit planes together to count twin primes rather than just primes is quite simple and could be easily accomplished by translating that code.

I have written this in Nim, but I need to update it for the latest Nim versions and do a little work to improve the multi-threading. It is about as fast as Kim Walisch's prime sieve, and if I added the logic to skip the inadmissible totative bit planes would be even faster, although if one wants the very highest speeds with upper limits above about 1e14, there is no avoiding using prime arrays totaling up to about 1.6 Gigabytes per thread for the highest limit, although these aren't used in the same way as the "nextp" arrays (they are part of a "bucket sieve" implementation).

Mirror of forum.nim-lang.org

4950 :: Fastest Prime Sieve, in Nim