nimforum mirror - Performance comparison of various compilers

FabienPRI (orginal) [2021-03-15T16:29:20+01:00] view original

Hi,

please see https://github.com/edin/raytracer

I think there are several benchs of this type but in this one there is nim but not really well placed : https://github.com/edin/raytracer

comments says it is not optimized for nim, but how to make it quicker?

Just a question, just want to see if there are good ways / bad ways of doing things

I use nim at work for small projects, like a lot the language, fast enough for me but this bench surprised me.

Best regards Fabien

g_lasso (orginal) [2021-03-15T16:40:31+01:00] view original

I didn't look at the code, but the run.bat is not optimized for release.


nim c -r RayTracer.nim

I created this PR in the repository. https://github.com/edin/raytracer/pull/1

miran (orginal) [2021-03-15T16:41:20+01:00] view original

Without even looking at the code: if they use that run.bat for running the Nim example, they need -d:release or -d:danger there.

Yardanico (orginal) [2021-03-15T17:11:32+01:00] view original

They use that

Clonk (orginal) [2021-03-15T17:55:38+01:00] view original

@mratsim made a Ray Tracer implementation that is efficient https://github.com/mratsim/weave/tree/master/demos/raytracing

@Vindaar also made one : https://github.com/Vindaar/rayTracingInOneWeekend

You could take a look at how they did it to see what you can improve, their code is generally clean.

treeform (orginal) [2021-03-15T22:31:27+01:00] view original

Here is what I looked at before I got bored: https://github.com/treeform/raytracer/blob/master/nim/story.md

Nim slower then C? How is that possible? Lets see


CPU time [ms] 2018.0

Oh compiled debug mode Nim, with C -O3 ... that will not do. Debug mode inserts huge stack traces into every function call - much easer to debug, but it so slow!

WOW x10 improvement!


nim c -r -d:release
CPU time [ms] 188.0

But really you know what is better then release mode? Danger mode - got to live dangerously!


nim c -r -d:danger
CPU time [ms] 196.0

Wait danger mode is slower? Lets try running it couple of more times....


CPU time [ms] 198.0
CPU time [ms] 209.0
CPU time [ms] 242.0
CPU time [ms] 181.0
CPU time [ms] 195.0

Wow there is so much variance in this test. You can't really know anything from a single benchmark... bench, benchy? Thats right I wrote bench testing library exactly for this reason!

import benchy

And then lets put stuff into a function. Did you know that stuff in side function can be better optimized because the yare more isolated from global state?

proc main(): float =
  var t1 = cpuTime()
  var scene  = CreateScene()
  var width  = 500
  var height = 500
  var stride = width * 4
  var bitmapData = newSeq[RgbColor](width * height)
  
  RenderScene(scene, bitmapData, stride, width, height)
  var t2 = cpuTime()
  var diff = (t2 - t1) * 1000
  return diff

timeIt "ray trace":
  keep main()


name ............................... min time      avg time    std dv   runs
ray trace ........................ 181.237 ms    191.066 ms   ±10.801    x26

Ok now we can actually measure this. Lets look at what vTune says is the bottle neck? Don't forget to add --debugger:native so that we get symbols in vTune.

Wow ObjectIntersect... wait what? Why are we setting results to some thing, just to take a clobber it again in the case?

proc ObjectIntersect(obj: Thing, ray: Ray): Intersection =
  result = Intersection(thing: nil, ray: ray, dist: 0) # <---- slow part
  case obj.objectType:
    of Sphere:
      ...
      result.thing = obj
      result.ray   = ray
      result.dist  = dist
    of Plane:
      ...
      result.thing = obj
      result.ray   = ray

We can just like not do that? Be default nim inits objects to all zeros any ways.

# result = Intersection(thing: nil, ray: ray, dist: 0)

Lets run it?


name ............................... min time      avg time    std dv   runs
ray trace ........................ 160.613 ms    164.691 ms    ±6.493    x30

Wow saved 21ms just on that line! Thats huge. What next vTune? Fight me bru!

ObjectIntersect is still at the top, but much better now. What parts of ObjectIntersect slow?

My fear is that those functions are not getting inlined properly. Lets throw {.inline.} in there? The rule if a function is small enough and is run often of then enough we can inline it.


name ............................... min time      avg time    std dv   runs
ray trace ........................ 160.544 ms    162.430 ms    ±3.049    x31

No change, I guess the compiler was smart enough to inline it all.

Lets try SIMD? We can just --passC:"-march=native" and change nothing:


name ............................... min time      avg time    std dv   runs
ray trace ........................ 157.290 ms    164.497 ms    ±8.950    x30

Oh great a 3ms win. Now we are faster then C. Great! Job done.

Next steps would be to review the algorithm, and maybe hand roll the SIMD instructions. But I am happy with the speedups.

treeform (orginal) [2021-03-15T22:42:03+01:00] view original

Why is it using float64 everywhere? This is computer graphics not computational physics! Changing everything to use float32 I get:


name ............................... min time      avg time    std dv   runs
ray trace ........................ 137.195 ms    140.204 ms    ±4.783    x36

treeform (orginal) [2021-03-15T22:44:45+01:00] view original

Man I totally forgot about --gc:arc. Adding that yields more speed:


name ............................... min time      avg time    std dv   runs
ray trace ........................ 110.665 ms    119.449 ms    ±9.403    x41

shirleyquirk (orginal) [2021-03-15T23:08:17+01:00] view original

changing Thing to be an object and I get 17ms, vs 69ms for the c++ example, compiled with -d:danger -d:lto --gc:arc

treeform (orginal) [2021-03-15T23:09:42+01:00] view original

How do you deal wit the Thing nil checks everywhere?

shirleyquirk (orginal) [2021-05-23T10:31:46+02:00] view original

I don't know what the rules are, but shouldn't you count all the included libraries in the SLOC?

I also think using float32 is 'cheating'

I love push noinit,checks:off

On my machine, with devel, running sh run.bat I get 100 ms for treeform/raytracer and 68 ms for edin/raytracer (after removing quakesqrt and intpow)

But I would still much rather use your vmath in real code

kcvinu (orginal) [2021-05-23T16:03:19+02:00] view original

Try D with DMD compiler. It is the fastest I think. I have same project in Nim & D. When I hit run button, D project compiles and run first.

cantanima (orginal) [2021-05-23T16:10:38+02:00] view original

Could there be something platform- or compiler-dependent going on? I tried this on MacOS 10.15.7 with Nim 1.4.4 and get 141ms for Nim vs. 127ms for C++.

I should perhaps note that compiling the Nim code with -d:lto gave clang: error: invalid linker name in argument '-fuse-ld=lld'. Changing to gcc makes that work, but gives 120ms for Nim, which doesn't match your claim that it takes less than half as much time.

(I didn't bother with C because the last time I checked that code was broken and didn't produce a correct image.)

shirleyquirk (orginal) [2021-05-23T18:37:53+02:00] view original

@cantanima clang on linux is slower than gcc with the upstream (edin) code. for me it's 78ms (vs 74ms for c++ and 62ms for nim/gcc) but the error bars on those numbers is like +/-10ms

@kcvinu the D commandline is dmd RayTracer.d -m64 -O -inline -release -noboundscheck and i'm getting 400ms or so.

for anyone cloning the upstream, don't compile the nim example with the included run.bat, it's not fair. nim r -d:lto -d:danger --passC:"-march=native" RayTracer.nim is a more appropriate comparison.

Sixte (orginal) [2021-05-23T19:28:54+02:00] view original

why to complain? On the repo, Nim(+GCC) are the fastest. Plain C is a bit slower, C++ dito. Crystal is a surprise, almost as fast as C++.

I got 570 ms with --d:debug, 68 ms with release and 62 ms with danger. Very nice.

cantanima (orginal) [2021-05-23T20:04:33+02:00] view original

Why to complain?

If that's directed at me, I'm not complaining. I'm trying to understand. When someone publishes that his implementation is 50+% faster than C, as @treeform has in the GitHub repo, it's worth looking into the reasons.

More generally, it's not fair to say that one should compare different algorithms and conclude, this language or compiler is faster! Some of that has gone on here; for instance, if your implementation uses Quake's inverse square root optimization, and no one else's does, then the comparison is invalid. If you do that merely to say, "it's easier to implement optimization O in language L," then fine, that's another matter, but one shouldn't implement optimization O, then claim that language L is faster, when no other language's implementation of the algorithm uses O.

(And of course when the output is plain incorrect, that says something too.)

gavr (orginal) [2021-05-23T20:37:58+02:00] view original

By the way, other languages still use double, not float. The C++ version generally only has -O2, while nim -d:danger-d:lto-d:intpow-d:quake --passC:" - march=native". Yea even march=native. I'm surprised that PR was accepted, this comparison doesn't make any sense now.

gavr (orginal) [2021-05-23T20:41:16+02:00] view original

DMD is the reference version of the D compiler, it has no optimizations and has the highest compilation speed. DMD is based on 2 other compilers, GDC (from GCC) and LDC (LLVM). Obviously, DMD will be the slowest.

https://wiki.dlang.org/Compilers

shirleyquirk (orginal) [2021-05-23T21:04:52+02:00] view original

I'm surprised that PR was accepted, this comparison doesn't make any sense now.

I agree, I regret the oversight of leaving them turned on. You should submit a pr changing run.bat to something more appropriate, and edit the D while you're at it. Except the points are all made up, treeforms version is what one should emulate, benchmarks are nonsense.

Mirror of forum.nim-lang.org

7633 :: Performance comparison of various compilers