So the parallelization of SmallPT uses dynamic scheduling
#pragma omp parallel for schedule(dynamic, 1) private(r) // OpenMP
for (int y=0; y<h; y++){ // Loop over image rows
This is very very important because some rays do not collide and do nothing, while some rays bounce all other the place. And the worse is that often you have wide expanse of "easy scenes" (sky, walls, ...) and so those would be assigned threads and the threads that are assigned the complex parts would be all alone.
Example: https://www.cs.brandeis.edu/~dilant/WebPage_TA160/initialsllides.pdf
A runtime without any load balancing wouldn't be able to scale a raytracing application (which makes it an interesting load balancing benchmark)
I expect the GCC team broke dynamic scheduling and default to static. An easy way to confirm that is to change the schedule from dynamic to static, and rerun with GCC8 and Clang and see if the perf degradation matches GCC10.
I leave that as an exercise to the reader (just joking).
I checked the schedule dynamic/static but it seems to be something else.
If someone can reproduce that would be helpful. You probably need at least 4~8 cores.
Limitations:
Note: Feel free to reuse the video code to record the NimConf but 6 seconds of size 576x324 took 53MB ;)
How about to use counter based random number generater like philox? https://www.thesalmons.org/john/random123/
It works like a hash function. When you generate a random number, you make a unique number and pass it the function. You can create unique numbers using pixel coordinate, item index, loop counter, line number, etc.
Pros:
Cons:
I have found a parallel RNG scheme that allows reproducible multithreaded result for those who wants to do parallel reproducible Monte-Carlo simulations:
Each RNGs generate different RNG streams but I think 2 RNG streams can be overlapped. All RNGs are seeded with unique random number in your code, but the state of one RNG right after being seeded can be same to the state of other RNG after several status update. (In that case, former RNG generate random number stream that is same to the one from later RNG after several state update) Probability of that happening can be very small, but it looks like Birthday_problem. You need to use PRNG that have much longer period than number of random numbers actually used to avoid random number stream overlap.
How about to use counter based random number generater like philox? https://www.thesalmons.org/john/random123/
This paper is mentioned in JAX (another Google Deep Learning framework), which was mentioned to me here https://github.com/mratsim/weave/issues/147#issuecomment-631864783
https://github.com/google/jax/blob/master/design_notes/prng.md
This is what I'm doing with my pair function + reseed. JAX focused more on the splittable RNG part so i may have reinvented the counter RNG :P.
Probability of that happening can be very small, but it looks like Birthday_problem. You need to use PRNG that have much longer period than number of random numbers actually used to avoid random number stream overlap.
The birthday paradox takes the square root of your probability, i.e. if you have 2^-256, the birthday paradox changes that to 2^-128. I use 2^256 period in my RNGs due to this, Weave is also using 2^256 for task scheduling.
Thank you for your suggestions, it's nice to see people interested in RNG quality in the community :)
i wanted to use Option[T] which were removed anyway because it made the code 5x slower
Doing exactly the same thing (but taking a rust port as a reference), I also had the same problem with Option[T] as return type being 3.5x slower than (bool, T) tuple on some cases but not others. :(
Also my threaded version was quite ugly and I was unable to use "parallel for". I will definitelly give "weave" a ride. Thank you!