nimforum mirror - RosettaBoy - Gameboy emulator rosetta stone

cumulonimbus (orginal) [2023-01-30T22:37:31+01:00] view original

I stumbled across this project today: https://github.com/shish/rosettaboy

It is a gameboy emulator implemented in a variety of languages (C++, Rust, Python, Zig, PHP, Cython and of course Nim). The benchmark results are included below; the nim "speed" build is nimble --accept build -d:danger --opt:speed -d:lto --mm:arc --panics:on , with the fastest Nim being almost twice as slow as the fastest rust and c++. It's using -d:danger correctly, but not lto which would be --passC:"-flto" --passL:"-flto" I think (but see https://github.com/nim-lang/Nim/issues/18608 - they are using a Mac); Also, -opt:speed is redundant following -d:danger AFAIK - could it actually hurt?

It doesn't say which Nim version was used, I'd guess 1.6

Other than LTO (which is likely going to give 10%-20%), does anyone have any insight on how to speed up the Nim version? A quick look at the Rust, C++ and Nim code shows all of them are equally idiomatic (none is perfectly idiomatic, but they are all more-or-less so)


$ ./utils/bench.py
   rs / lto    : Emulated 15763 frames in 10.00s (1576fps)
  cpp / lto    : Emulated 14737 frames in 10.00s (1474fps)
   rs / release: Emulated 13183 frames in 10.00s (1318fps)
  cpp / release: Emulated 12966 frames in 10.00s (1297fps)
  zig / release: Emulated  8792 frames in 10.00s (879fps)
  nim / speed  : Emulated  8127 frames in 10.00s (812fps)
  nim / release: Emulated  6161 frames in 10.00s (616fps)
  cpp / debug  : Emulated  5693 frames in 10.00s (569fps)
   go / release: Emulated  5040 frames in 10.00s (504fps)
  pxd / release: Emulated  3792 frames in 10.00s (379fps)
  nim / debug  : Emulated  1968 frames in 10.00s (196fps)
   rs / debug  : Emulated  1676 frames in 10.00s (168fps)
   py / mypyc  : Emulated   887 frames in 10.01s (89fps)
  php / opcache: Emulated   613 frames in 10.01s (61fps)
  php / release: Emulated   255 frames in 10.01s (25fps)
   py / release: Emulated   101 frames in 10.06s (10fps)
  zig / safe   : Emulated    40 frames in 10.00s (4fps)

cblake (orginal) [2023-01-30T22:53:25+01:00] view original

I would try --passC:-march=native and also PGO before concluding much as mentioned here.

inv2004 (orginal) [2023-02-03T08:24:56+01:00] view original

On my machine results are not the same: cpp, rs, then nim.

inv2004 (orginal) [2023-02-03T12:16:57+01:00] view original

Also, there is some mess in ref: looks like it passes var ref about everywhere. if it is double ref - it can affect perf significantly

cblake (orginal) [2023-02-03T13:13:47+01:00] view original

It is an unfortunate truism that those who put the most energy into many-prog.lang benchmarks often have little expertise in optimizing more than about 2 of said langs (never mind experience &| patience for the many compilation modes usually available &| more careful measurement). Indeed, it is often a lang learning experiment for authors.

What is worse, sloppier writing & methodology yields bigger deltas (more surprise!) which then yields more attention. The initial splash of attention also gets far more readers than follow-on corrections. In some ways, it can often become almost "sincere trolling" (not quite the contradiction in terms it may seem)..

All that can be done is to hope for any big splashes (often hard to say) to generate enough community effort to correct major errors. Perhaps, if you have the interest, you could re-architect the simulator to be more efficient? { The author would ideally accept your patches into his repo, but even that is far from certain (optimization is often an almost unbounded time sink. Sometimes originators shutdown changes / discussion).. maybe at least a link to your repo could happen. }

shirleyquirk (orginal) [2023-02-03T18:50:40+01:00] view original

> var ref can affect perf Is that true? If so frankly that's a legit criticism

planetis (orginal) [2023-02-03T20:15:32+01:00] view original

This is a case where someone really needs to ask them to take it down, since they are not even willing to learn the basics. At least that's my impression from reading https://github.com/shish/rosettaboy/tree/master/nim

inv2004 (orginal) [2023-02-03T22:31:09+01:00] view original

Is that true? If so frankly that's a legit criticism

I can say that I see some improvement, but cannot guarantee it is 100% because the single runs are quiet variety from run to run. Also I added two inlines (like I found in [rs] solution):

my current run: https://github.com/shish/rosettaboy/pull/117

cpu: i5-12600


  cpp / debug  : Emulated  7307 frames in 10.00s (731fps)
  cpp / release: Emulated 18307 frames in 10.00s (1831fps)
  cpp / lto    : Emulated 20463 frames in 10.00s (2046fps)
  nim / debug  : Emulated  2560 frames in 10.00s (255fps)
  nim / release: Emulated 15234 frames in 10.00s (1523fps)
  nim / speed  : Emulated 16137 frames in 10.00s (1613fps)
   rs / debug  : Emulated  1739 frames in 10.01s (174fps)
   rs / release: Emulated 16828 frames in 10.00s (1683fps)
   rs / lto    : Emulated 18112 frames in 10.00s (1811fps)

inv2004 (orginal) [2023-02-03T23:55:26+01:00] view original

callgrind -> mem_get function is still not-inlined:

proc get*(self: RAM, address: uint16): uint8 {.codegenDecl:"__attribute__((always_inline)) $# $#$#".} =
proc set*(self: RAM, address: uint16, val: uint8) {.codegenDecl:"__attribute__((always_inline)) $# $#$#".} =


  cpp / lto    : Emulated 20435 frames in 10.00s (2044fps)
  nim / speed  : Emulated 19210 frames in 10.00s (1920fps)
   rs / lto    : Emulated 18022 frames in 10.00s (1802fps)

But, I cannot compile in debug/release mode because it is not always defined.


error: inlining failed in call to always_inline ‘get__ram_118’: function body not available

Probably someone more experienced can help with the inline

inv2004 (orginal) [2023-02-04T00:48:56+01:00] view original

https://github.com/shish/rosettaboy/pull/117


  cpp / debug  : Emulated  7345 frames in 10.00s (734fps)
  cpp / release: Emulated 18445 frames in 10.00s (1844fps)
  cpp / lto    : Emulated 20492 frames in 10.00s (2049fps)
  nim / debug  : Emulated  3863 frames in 10.00s (386fps)
  nim / release: Emulated 16764 frames in 10.00s (1676fps)
  nim / speed  : Emulated 21339 frames in 10.00s (2133fps)
   rs / debug  : Emulated  1771 frames in 10.00s (177fps)
   rs / release: Emulated 16852 frames in 10.00s (1685fps)
   rs / lto    : Emulated 18160 frames in 10.00s (1816fps)

cblake (orginal) [2023-02-04T01:22:18+01:00] view original

I did not add march because looks like rust does not do it.

On Linux Skylake i7-6700k, -march=native made Rust slower but Nim with gcc-12 faster. Even on one platform, fiddling with backend flags or PGO often moves things by >>> the +- 6% deltas you (now) see cross-lang. { But good work! This is more as it should be all around. :-) And good luck with the repo owner & PR! }

inv2004 (orginal) [2023-02-17T07:34:25+01:00] view original

Readme was updated in the repo. Unfortunately, I suppose that lto is better than danger because it still has has all the bounds check and etc. But gcc version looks much better than clang

inv2004 (orginal) [2023-02-17T23:25:04+01:00] view original

question: how is it possible that -d:danger + --panics::on is 5% slower than -d:release ?

cblake (orginal) [2023-02-18T14:12:42+01:00] view original

Profiling shows like 35-40% of the CPU time is just in that nim/src/ram.get template. This is supported by overall time variations as big as 12% I see by fiddling with the way of-branches work in ram.get (showing 10% changes in branch mispredicts via perf stat -ddd on a Linux Skylake across, say, of-ordering/change to if-elif, etc.). So, my guess is that what you are seeing is just incidental "code layout" effects (that might not generalize to other game ROMs). The sensitivity of this RosettaBoy benchmark to microarchitectural behavior makes it a particularly fraught with interpretational risks.

Mirror of forum.nim-lang.org

9859 :: RosettaBoy - Gameboy emulator rosetta stone