I stumbled across this project today: https://github.com/shish/rosettaboy
It is a gameboy emulator implemented in a variety of languages (C++, Rust, Python, Zig, PHP, Cython and of course Nim). The benchmark results are included below; the nim "speed" build is nimble --accept build -d:danger --opt:speed -d:lto --mm:arc --panics:on , with the fastest Nim being almost twice as slow as the fastest rust and c++. It's using -d:danger correctly, but not lto which would be --passC:"-flto" --passL:"-flto" I think (but see https://github.com/nim-lang/Nim/issues/18608 - they are using a Mac); Also, -opt:speed is redundant following -d:danger AFAIK - could it actually hurt?
It doesn't say which Nim version was used, I'd guess 1.6
Other than LTO (which is likely going to give 10%-20%), does anyone have any insight on how to speed up the Nim version? A quick look at the Rust, C++ and Nim code shows all of them are equally idiomatic (none is perfectly idiomatic, but they are all more-or-less so)
$ ./utils/bench.py
rs / lto : Emulated 15763 frames in 10.00s (1576fps)
cpp / lto : Emulated 14737 frames in 10.00s (1474fps)
rs / release: Emulated 13183 frames in 10.00s (1318fps)
cpp / release: Emulated 12966 frames in 10.00s (1297fps)
zig / release: Emulated 8792 frames in 10.00s (879fps)
nim / speed : Emulated 8127 frames in 10.00s (812fps)
nim / release: Emulated 6161 frames in 10.00s (616fps)
cpp / debug : Emulated 5693 frames in 10.00s (569fps)
go / release: Emulated 5040 frames in 10.00s (504fps)
pxd / release: Emulated 3792 frames in 10.00s (379fps)
nim / debug : Emulated 1968 frames in 10.00s (196fps)
rs / debug : Emulated 1676 frames in 10.00s (168fps)
py / mypyc : Emulated 887 frames in 10.01s (89fps)
php / opcache: Emulated 613 frames in 10.01s (61fps)
php / release: Emulated 255 frames in 10.01s (25fps)
py / release: Emulated 101 frames in 10.06s (10fps)
zig / safe : Emulated 40 frames in 10.00s (4fps)
It is an unfortunate truism that those who put the most energy into many-prog.lang benchmarks often have little expertise in optimizing more than about 2 of said langs (never mind experience &| patience for the many compilation modes usually available &| more careful measurement). Indeed, it is often a lang learning experiment for authors.
What is worse, sloppier writing & methodology yields bigger deltas (more surprise!) which then yields more attention. The initial splash of attention also gets far more readers than follow-on corrections. In some ways, it can often become almost "sincere trolling" (not quite the contradiction in terms it may seem)..
All that can be done is to hope for any big splashes (often hard to say) to generate enough community effort to correct major errors. Perhaps, if you have the interest, you could re-architect the simulator to be more efficient? { The author would ideally accept your patches into his repo, but even that is far from certain (optimization is often an almost unbounded time sink. Sometimes originators shutdown changes / discussion).. maybe at least a link to your repo could happen. }
Is that true? If so frankly that's a legit criticism
I can say that I see some improvement, but cannot guarantee it is 100% because the single runs are quiet variety from run to run. Also I added two inlines (like I found in [rs] solution):
my current run: https://github.com/shish/rosettaboy/pull/117
cpu: i5-12600
cpp / debug : Emulated 7307 frames in 10.00s (731fps)
cpp / release: Emulated 18307 frames in 10.00s (1831fps)
cpp / lto : Emulated 20463 frames in 10.00s (2046fps)
nim / debug : Emulated 2560 frames in 10.00s (255fps)
nim / release: Emulated 15234 frames in 10.00s (1523fps)
nim / speed : Emulated 16137 frames in 10.00s (1613fps)
rs / debug : Emulated 1739 frames in 10.01s (174fps)
rs / release: Emulated 16828 frames in 10.00s (1683fps)
rs / lto : Emulated 18112 frames in 10.00s (1811fps)
callgrind -> mem_get function is still not-inlined:
proc get*(self: RAM, address: uint16): uint8 {.codegenDecl:"__attribute__((always_inline)) $# $#$#".} =
proc set*(self: RAM, address: uint16, val: uint8) {.codegenDecl:"__attribute__((always_inline)) $# $#$#".} =
cpp / lto : Emulated 20435 frames in 10.00s (2044fps)
nim / speed : Emulated 19210 frames in 10.00s (1920fps)
rs / lto : Emulated 18022 frames in 10.00s (1802fps)
But, I cannot compile in debug/release mode because it is not always defined.
error: inlining failed in call to always_inline ‘get__ram_118’: function body not available
Probably someone more experienced can help with the inline
https://github.com/shish/rosettaboy/pull/117
cpp / debug : Emulated 7345 frames in 10.00s (734fps)
cpp / release: Emulated 18445 frames in 10.00s (1844fps)
cpp / lto : Emulated 20492 frames in 10.00s (2049fps)
nim / debug : Emulated 3863 frames in 10.00s (386fps)
nim / release: Emulated 16764 frames in 10.00s (1676fps)
nim / speed : Emulated 21339 frames in 10.00s (2133fps)
rs / debug : Emulated 1771 frames in 10.00s (177fps)
rs / release: Emulated 16852 frames in 10.00s (1685fps)
rs / lto : Emulated 18160 frames in 10.00s (1816fps)
I did not add march because looks like rust does not do it.
On Linux Skylake i7-6700k, -march=native made Rust slower but Nim with gcc-12 faster. Even on one platform, fiddling with backend flags or PGO often moves things by >>> the +- 6% deltas you (now) see cross-lang. { But good work! This is more as it should be all around. :-) And good luck with the repo owner & PR! }