nimforum mirror - Faster and Safer Raytracing in Nim

mratsim (orginal) [2020-05-22T01:38:23+02:00] view original

Couldn't render post #39247.

aredirect (orginal) [2020-05-22T02:36:09+02:00] view original

@mratsim it's always a pleasure to see your posts and work :) thank you

zio_tom78 (orginal) [2020-05-22T14:19:38+02:00] view original

@mratsim, I agree with @aredirect: this repository of yours is pure gold!

Araq (orginal) [2020-05-22T16:31:37+02:00] view original

Before we re-publish this as an article on our website, can you explain the 2min14.616s outlier for GCC 10 C++ OpenMP? Or can at least somebody confirm these numbers? :-)

mratsim (orginal) [2020-05-22T18:38:23+02:00] view original

So the parallelization of SmallPT uses dynamic scheduling

#pragma omp parallel for schedule(dynamic, 1) private(r)       // OpenMP
  for (int y=0; y<h; y++){                       // Loop over image rows

This is very very important because some rays do not collide and do nothing, while some rays bounce all other the place. And the worse is that often you have wide expanse of "easy scenes" (sky, walls, ...) and so those would be assigned threads and the threads that are assigned the complex parts would be all alone.

Example: https://www.cs.brandeis.edu/~dilant/WebPage_TA160/initialsllides.pdf

image:: https://gist.githubusercontent.com/mratsim/aaecdb8d77582bcfc2994b0ee66b99d5/raw/6cb890ee5ff68946133ba45fd9013ff555156fa2/2020-05-22_18-33.png

A runtime without any load balancing wouldn't be able to scale a raytracing application (which makes it an interesting load balancing benchmark)

I expect the GCC team broke dynamic scheduling and default to static. An easy way to confirm that is to change the schedule from dynamic to static, and rerun with GCC8 and Clang and see if the perf degradation matches GCC10.

I leave that as an exercise to the reader (just joking).

mratsim (orginal) [2020-05-22T21:57:57+02:00] view original

I checked the schedule dynamic/static but it seems to be something else.

If someone can reproduce that would be helpful. You probably need at least 4~8 cores.

mrgaturus (orginal) [2020-05-23T06:35:40+02:00] view original

Very good, this debunk an old blog article that show slowest results compared to C++

mratsim (orginal) [2020-05-26T22:15:10+02:00] view original

A weekend past and some new updates:

The raytracer is now parallelized.

I have found a parallel RNG scheme that allows reproducible multithreaded result for those who wants to do parallel reproducible Monte-Carlo simulations:
- writeup https://github.com/mratsim/weave/issues/147#issuecomment-633198832
- the magic: https://github.com/mratsim/trace-of-radiance/blob/26ef9ed5/trace_of_radiance/support/rng.nim#L21-L29 a pair function that can take 2 integers (for example one produced by a "master RNG" and one from a loop variable, or 2 from nested loop variables) and that can be used to reseed RNGs across threads to:
  - ensure different RNG streams
  - ensure reproducibility

I departed a bit from the book to add an animation:

For this I have added:
- a mini physics engine that can simulate gravity and bounce
- the output can now be a series of PPM images
- the output can be a mp4 video in H.264 format
- The H264 encoder is lossless and less than 300 lines of pure Nim:
  - https://github.com/mratsim/trace-of-radiance/blob/26ef9ed5/trace_of_radiance/io/h264.nim
  - When passed to FFMPEG, it will complain about corrupted frames but ssshhhh, don't let it say otherwise, it's spec compliant and can be read by media players ;)
- I've added a MP4 muxer (this one is full-featured via the minimp4 header-only library)
- color conversion for RGB to Y'CbCr 420 (also known as YUV420 i.e. with chroma subsampling).

Limitations:

It's very very slow, ~3 hours of rendering on 18 cores for a 576x324 image with 400+ spheres and 300 rays per pixels. The reason why is that each ray must test if it encounters any of the sphere, and redo that after each bounce, a ray can bounce up to 50 times (artificial limit otherwise compute never finishes).
- Solution is in the volume 2 "Raytracing the next week", via Bounding Volume Hierarchy (BVH), which will make the time needed logarithmic (and assuming 10000 seconds of initial rendering time ln(10000) == 9)

No Motion blur, I added animation but no motion blur.
- Solution is in the volume 2 "Raytracing the next week", as well. But the book simulates motion blur but has no animation code ;).

Note: Feel free to reuse the video code to record the NimConf but 6 seconds of size 576x324 took 53MB ;)

demotomohiro (orginal) [2020-05-28T22:00:51+02:00] view original

How about to use counter based random number generater like philox? https://www.thesalmons.org/john/random123/

It works like a hash function. When you generate a random number, you make a unique number and pass it the function. You can create unique numbers using pixel coordinate, item index, loop counter, line number, etc.

Pros:

deterministic as long as input numbers are deterministic.

Stateless. You can use existing variables to generate random numbers and it can reduce memory usage per thread.

Each random number generations can be executed in parallel as it doesn't need PRNG state update. Increase instruction-level parallelism.

Cons:

It require more instructions to generate random number from non-random numbers.

I have found a parallel RNG scheme that allows reproducible multithreaded result for those who wants to do parallel reproducible Monte-Carlo simulations:

Each RNGs generate different RNG streams but I think 2 RNG streams can be overlapped. All RNGs are seeded with unique random number in your code, but the state of one RNG right after being seeded can be same to the state of other RNG after several status update. (In that case, former RNG generate random number stream that is same to the one from later RNG after several state update) Probability of that happening can be very small, but it looks like Birthday_problem. You need to use PRNG that have much longer period than number of random numbers actually used to avoid random number stream overlap.

mratsim (orginal) [2020-05-29T00:59:50+02:00] view original

How about to use counter based random number generater like philox? https://www.thesalmons.org/john/random123/

This paper is mentioned in JAX (another Google Deep Learning framework), which was mentioned to me here https://github.com/mratsim/weave/issues/147#issuecomment-631864783

https://github.com/google/jax/blob/master/design_notes/prng.md

This is what I'm doing with my pair function + reseed. JAX focused more on the splittable RNG part so i may have reinvented the counter RNG :P.

Probability of that happening can be very small, but it looks like Birthday_problem. You need to use PRNG that have much longer period than number of random numbers actually used to avoid random number stream overlap.

The birthday paradox takes the square root of your probability, i.e. if you have 2^-256, the birthday paradox changes that to 2^-128. I use 2^256 period in my RNGs due to this, Weave is also using 2^256 for task scheduling.

Thank you for your suggestions, it's nice to see people interested in RNG quality in the community :)

aguspiza2 (orginal) [2020-06-05T22:19:31+02:00] view original

i wanted to use Option[T] which were removed anyway because it made the code 5x slower

Doing exactly the same thing (but taking a rust port as a reference), I also had the same problem with Option[T] as return type being 3.5x slower than (bool, T) tuple on some cases but not others. :(

Also my threaded version was quite ugly and I was unable to use "parallel for". I will definitelly give "weave" a ride. Thank you!

https://github.com/aguspiza/raytracing-nim

Mirror of forum.nim-lang.org

6367 :: Faster and Safer Raytracing in Nim