I was doing some testing, comparing Nim and Rust for some string operations, and I somehow managed to cut Nim's runtime in half by adding a call to my reverse function outside of the loop.
proc reverse(s: string): string =
result = s
let size = s.len
for (i, c) in s.pairs():
result[(size - i - 1)] = c
proc main() =
let s = "Hello world"
when defined(faster):
discard reverse(s)
for i in 1..200_000_000:
let _ = s.reverse()
when isMainModule:
main()
Rust:
$ time ./target/release/str_reverse
real 0m10.385s
user 0m10.362s
sys 0m0.019s
Nim:
$ nim c -d:release src/str_reverse.nim
$ time src/str_reverse
real 0m11.957s
user 0m11.777s
sys 0m0.049s
Nim with my custom flag:
$ nim c -d:release -d:faster src/str_reverse.nim
$ time src/str_reverse
real 0m6.325s
user 0m6.145s
sys 0m0.050s
Could someone explain why the speedup is occurring? Thanks!
Testing speed is really hard. I recommend some thing like (benchy)[https://github.com/treeform/benchy] rather then time utility.
Also a super smart compiler can figure out that you are doing nothing and optimize your program out. What you are really testing is not what code is faster but how smart your compiler is. How far does your compiler go before it bails. Ultimately I think its a bad test. In this case you are just testing the underlaying C compiler, not even Nim.
Never the less I did paste your program into the compiler explorer: https://godbolt.org/ . I even diffed the two version and they do produce different code but nothing major jumps out to me.
But when I try it on windows I get TotalSeconds : 21.0486674
But the one that does more work is twice as fast: TotalSeconds : 10.8038881
My only thought is that extra discard reverse() gives the compiler a hint that reverse() can be optimized out? Then it optimizes some but not all reverse() calls? But on the other hand that is not true and https://godbolt.org/ shows way more code for the faster case...
I would say its a mystery, but not one I want to continue...
Very much seems like a compiler edge case. Here a benchy benchmark:
import benchy
proc reverse(s: string): string =
result = s
let size = s.len
for (i, c) in s.pairs():
result[(size - i - 1)] = c
let s = "Hello world"
timeIt "With Discard":
discard reverse(s)
for i in 1..200_000:
let x = s.reverse()
keep(x)
timeIt "No Discard":
for i in 1..200_000:
let x = s.reverse()
keep(x)
min time avg time std dv runs name
44.975 ms 47.967 ms ±1.599 x104 With Discard
45.317 ms 48.030 ms ±1.566 x104 No Discard
$ nim c -d:danger -d:lto --listcmd reverse.nim
......................................................................
$ time ./reverse
real 0m31.493s
user 0m31.491s
sys 0m0.001s
$ nim c -d:danger -d:lto --listcmd -d:faster reverse.nim
......................................................................
$ time ./reverse
real 0m3.487s
user 0m3.485s
sys 0m0.002s
-d:lto flag makes code much faster.
If you really want to know why, you need to read Nim generated C code and assembly code generated by backend C compiler. Nim generates assembly code with --asm option. But if you use -d:lto flag, see: https://internet-of-tomohiro.netlify.app/nim/faq.en.html#nim-compiler-how-to-produce-assembler-codeqmark
You can get simpler assembly code with -d:danger as it remove runtime checks.
https://godbolt.org shows assembly code but it is less readable because it doesn't tell function name call instruction calls and hard to see where jmp instructions jump to.
Surprised no one has yet mentioned:
import std/algorithm
proc main =
var s = "Hello world"
when defined(faster): reverse(s, 0, s.len - 1)
for i in 1..200_000_000: reverse(s, 0, s.len - 1)
main()
and which side steps all the allocation stuff. Someone coming from Rust / new to Nim might be unaware of var vs. let or that this even was an allocation benchmark as @treeform observes. As with any measurement, it depends on what the point of said measurement is. Not having shown his Rust, I suspect it was not actually intended to measure alloc/dealloc cycles which Rust folk tend to avoid. So, the above might help @walkr.
FWIW, at least my backend gcc-13.2 compiler is NOT smart enough to optimize out the work (even with LTO & PGO which do make that program about 2.2X faster, incidentally, down to about 0.316 seconds on an otherwise idle i7-6700k (in a not atypical pattern of speed-up @Yardanico and I tend to observe; He could try clang if he likes :-) )). It is a good point to check that, though. Also, you may do much better with --passC:-march=native since my perf report shows vpshufb (i.e. AVX2) in the inner loop which netted 1.6X for me. { Also, tim is more careful about "reproduction" / "meaningfulness of ±" on time-sharing OSes (i.e. - almost all of 'em) than benchy currently is, but both times & deltas here are big enough that timing is not subtle. }
As far as that allocation stuff goes, it seems like what is going on is that the extra call before the loop suppresses a destructor call (in both Nim-1.6 and Nim-2.0/Nim-devel). Doing no -d:faster and -d:faster and diffing the ~/.nim/cache files got me (I just moved the first version ~/.cache/nim/rWalkr_r to ~/.cache/nim/rWalkr_r0:
diff -Nu rWalkr_r0/@mrWalkr.nim.c rWalkr_r/@mrWalkr.nim.c
--- rWalkr_r0/@mrWalkr.nim.c 2023-08-08 06:08:32.887747209 -0400
+++ rWalkr_r/@mrWalkr.nim.c 2023-08-08 06:08:50.344965932 -0400
@@ -122,9 +122,20 @@
#line 6 "/tmp/k/rWalkr.nim"
N_LIB_PRIVATE N_NIMCALL(void, main__r87alkr_u7)(void) {
#line 7
-NimStringV2 s;NIM_BOOL* nimErr_;{nimErr_ = nimErrorFlag(); s.len = 0; s.p = NIM_NIL;// let s = "Hello world"
+NimStringV2 s;
+#line 8
+NimStringV2 colontmpD_;NIM_BOOL* nimErr_;{nimErr_ = nimErrorFlag(); s.len = 0; s.p = NIM_NIL; colontmpD_.len = 0; colontmpD_.p = NIM_NIL;// let s = "Hello world"
+
+#line 7
// let s = "Hello world"
- s = TM__3QhDSXtO082ujkyrhqLedQ_3; {
+ s = TM__3QhDSXtO082ujkyrhqLedQ_3;// when defined(faster): discard reverse(s)
+
+#line 8
+// when defined(faster): discard reverse(s)
+// when defined(faster): discard reverse(s)
+// when defined(faster): discard reverse(s)
+ colontmpD_ = reverse__r87alkr_u1(s); if (NIM_UNLIKELY(*nimErr_)) goto BeforeRet_; (void)(colontmpD_);
+ {
#line 9
NI i;
#line 90 "/usr/lib/nim/lib/system/iterators_1.nim"
@@ -153,7 +164,8 @@
if (_.p && !(_.p->cap & NIM_STRLIT_FLAG)) { dealloc(_.p);} } LA3: ;
}
}
- }BeforeRet_: ;
+// proc `=destroy`*(x: string) {.inline, magic: "Destroy".} =
+ if (colontmpD_.p && !(colontmpD_.p->cap & NIM_STRLIT_FLAG)) { dealloc(colontmpD_.p);} }BeforeRet_: ;
}
N_LIB_PRIVATE void PreMainInner(void) {
nim c --mm:arc --expandArc:reverse does not show that =destroy either with/without -d:faster { not sure if it is supposed to show every =op.. }. So, there may be a real issue or two here...
let s = "Thank you all for your inputs! (:"
Since compiling modern programming languages involves so many moving parts, there's lots of useful information in here to learn from.
I'm not going to be spending much more time on this microbenchmark, but it's good to know now which layers to look under.