Hi,
please see https://github.com/edin/raytracer
I think there are several benchs of this type but in this one there is nim but not really well placed : https://github.com/edin/raytracer
comments says it is not optimized for nim, but how to make it quicker?
Just a question, just want to see if there are good ways / bad ways of doing things
I use nim at work for small projects, like a lot the language, fast enough for me but this bench surprised me.
Best regards Fabien
I didn't look at the code, but the run.bat is not optimized for release.
nim c -r RayTracer.nim
I created this PR in the repository. https://github.com/edin/raytracer/pull/1
@mratsim made a Ray Tracer implementation that is efficient https://github.com/mratsim/weave/tree/master/demos/raytracing
@Vindaar also made one : https://github.com/Vindaar/rayTracingInOneWeekend
You could take a look at how they did it to see what you can improve, their code is generally clean.
Here is what I looked at before I got bored: https://github.com/treeform/raytracer/blob/master/nim/story.md
Nim slower then C? How is that possible? Lets see
CPU time [ms] 2018.0
Oh compiled debug mode Nim, with C -O3 ... that will not do. Debug mode inserts huge stack traces into every function call - much easer to debug, but it so slow!
WOW x10 improvement!
nim c -r -d:release
CPU time [ms] 188.0
But really you know what is better then release mode? Danger mode - got to live dangerously!
nim c -r -d:danger
CPU time [ms] 196.0
Wait danger mode is slower? Lets try running it couple of more times....
CPU time [ms] 198.0
CPU time [ms] 209.0
CPU time [ms] 242.0
CPU time [ms] 181.0
CPU time [ms] 195.0
Wow there is so much variance in this test. You can't really know anything from a single benchmark... bench, benchy? Thats right I wrote bench testing library exactly for this reason!
import benchy
And then lets put stuff into a function. Did you know that stuff in side function can be better optimized because the yare more isolated from global state?
proc main(): float =
var t1 = cpuTime()
var scene = CreateScene()
var width = 500
var height = 500
var stride = width * 4
var bitmapData = newSeq[RgbColor](width * height)
RenderScene(scene, bitmapData, stride, width, height)
var t2 = cpuTime()
var diff = (t2 - t1) * 1000
return diff
timeIt "ray trace":
keep main()
name ............................... min time avg time std dv runs
ray trace ........................ 181.237 ms 191.066 ms ±10.801 x26
Ok now we can actually measure this. Lets look at what vTune says is the bottle neck? Don't forget to add --debugger:native so that we get symbols in vTune.
Wow ObjectIntersect... wait what? Why are we setting results to some thing, just to take a clobber it again in the case?
proc ObjectIntersect(obj: Thing, ray: Ray): Intersection =
result = Intersection(thing: nil, ray: ray, dist: 0) # <---- slow part
case obj.objectType:
of Sphere:
...
result.thing = obj
result.ray = ray
result.dist = dist
of Plane:
...
result.thing = obj
result.ray = ray
We can just like not do that? Be default nim inits objects to all zeros any ways.
# result = Intersection(thing: nil, ray: ray, dist: 0)
Lets run it?
name ............................... min time avg time std dv runs
ray trace ........................ 160.613 ms 164.691 ms ±6.493 x30
Wow saved 21ms just on that line! Thats huge. What next vTune? Fight me bru!
ObjectIntersect is still at the top, but much better now. What parts of ObjectIntersect slow?
My fear is that those functions are not getting inlined properly. Lets throw {.inline.} in there? The rule if a function is small enough and is run often of then enough we can inline it.
name ............................... min time avg time std dv runs
ray trace ........................ 160.544 ms 162.430 ms ±3.049 x31
No change, I guess the compiler was smart enough to inline it all.
Lets try SIMD? We can just --passC:"-march=native" and change nothing:
name ............................... min time avg time std dv runs
ray trace ........................ 157.290 ms 164.497 ms ±8.950 x30
Oh great a 3ms win. Now we are faster then C. Great! Job done.
Next steps would be to review the algorithm, and maybe hand roll the SIMD instructions. But I am happy with the speedups.
Why is it using float64 everywhere? This is computer graphics not computational physics! Changing everything to use float32 I get:
name ............................... min time avg time std dv runs
ray trace ........................ 137.195 ms 140.204 ms ±4.783 x36
Man I totally forgot about --gc:arc. Adding that yields more speed:
name ............................... min time avg time std dv runs
ray trace ........................ 110.665 ms 119.449 ms ±9.403 x41
I don't know what the rules are, but shouldn't you count all the included libraries in the SLOC?
I also think using float32 is 'cheating'
I love push noinit,checks:off
On my machine, with devel, running sh run.bat I get 100 ms for treeform/raytracer and 68 ms for edin/raytracer (after removing quakesqrt and intpow)
But I would still much rather use your vmath in real code
Could there be something platform- or compiler-dependent going on? I tried this on MacOS 10.15.7 with Nim 1.4.4 and get 141ms for Nim vs. 127ms for C++.
I should perhaps note that compiling the Nim code with -d:lto gave clang: error: invalid linker name in argument '-fuse-ld=lld'. Changing to gcc makes that work, but gives 120ms for Nim, which doesn't match your claim that it takes less than half as much time.
(I didn't bother with C because the last time I checked that code was broken and didn't produce a correct image.)
@cantanima clang on linux is slower than gcc with the upstream (edin) code. for me it's 78ms (vs 74ms for c++ and 62ms for nim/gcc) but the error bars on those numbers is like +/-10ms
@kcvinu the D commandline is dmd RayTracer.d -m64 -O -inline -release -noboundscheck and i'm getting 400ms or so.
for anyone cloning the upstream, don't compile the nim example with the included run.bat, it's not fair. nim r -d:lto -d:danger --passC:"-march=native" RayTracer.nim is a more appropriate comparison.
why to complain? On the repo, Nim(+GCC) are the fastest. Plain C is a bit slower, C++ dito. Crystal is a surprise, almost as fast as C++.
I got 570 ms with --d:debug, 68 ms with release and 62 ms with danger. Very nice.
Why to complain?
If that's directed at me, I'm not complaining. I'm trying to understand. When someone publishes that his implementation is 50+% faster than C, as @treeform has in the GitHub repo, it's worth looking into the reasons.
More generally, it's not fair to say that one should compare different algorithms and conclude, this language or compiler is faster! Some of that has gone on here; for instance, if your implementation uses Quake's inverse square root optimization, and no one else's does, then the comparison is invalid. If you do that merely to say, "it's easier to implement optimization O in language L," then fine, that's another matter, but one shouldn't implement optimization O, then claim that language L is faster, when no other language's implementation of the algorithm uses O.
(And of course when the output is plain incorrect, that says something too.)
DMD is the reference version of the D compiler, it has no optimizations and has the highest compilation speed. DMD is based on 2 other compilers, GDC (from GCC) and LDC (LLVM). Obviously, DMD will be the slowest.
I'm surprised that PR was accepted, this comparison doesn't make any sense now.
I agree, I regret the oversight of leaving them turned on. You should submit a pr changing run.bat to something more appropriate, and edit the D while you're at it. Except the points are all made up, treeforms version is what one should emulate, benchmarks are nonsense.