nimforum mirror - Why is My Nim Code Faster when Doing This

walkr (orginal) [2023-08-08T03:15:16+02:00] view original

I was doing some testing, comparing Nim and Rust for some string operations, and I somehow managed to cut Nim's runtime in half by adding a call to my reverse function outside of the loop.

proc reverse(s: string): string =
  result = s
  
  let size = s.len
  for (i, c) in s.pairs():
    result[(size - i - 1)] = c

proc main() =
  let s = "Hello world"
  
  when defined(faster):
    discard reverse(s)
  
  for i in 1..200_000_000:
    let _ = s.reverse()


when isMainModule:
  main()

Rust:


$ time ./target/release/str_reverse

real	0m10.385s
user	0m10.362s
sys	0m0.019s

Nim:


$ nim c -d:release src/str_reverse.nim
$ time src/str_reverse

real  0m11.957s
user  0m11.777s
sys   0m0.049s

Nim with my custom flag:


$ nim c -d:release -d:faster src/str_reverse.nim
$ time src/str_reverse

real  0m6.325s
user  0m6.145s
sys   0m0.050s

Could someone explain why the speedup is occurring? Thanks!

treeform (orginal) [2023-08-08T07:16:15+02:00] view original

Testing speed is really hard. I recommend some thing like (benchy)[https://github.com/treeform/benchy] rather then time utility.

Also a super smart compiler can figure out that you are doing nothing and optimize your program out. What you are really testing is not what code is faster but how smart your compiler is. How far does your compiler go before it bails. Ultimately I think its a bad test. In this case you are just testing the underlaying C compiler, not even Nim.

Never the less I did paste your program into the compiler explorer: https://godbolt.org/ . I even diffed the two version and they do produce different code but nothing major jumps out to me.

But when I try it on windows I get TotalSeconds : 21.0486674

But the one that does more work is twice as fast: TotalSeconds : 10.8038881

My only thought is that extra discard reverse() gives the compiler a hint that reverse() can be optimized out? Then it optimizes some but not all reverse() calls? But on the other hand that is not true and https://godbolt.org/ shows way more code for the faster case...

I would say its a mystery, but not one I want to continue...

treeform (orginal) [2023-08-08T07:25:50+02:00] view original

Also I want to add, I think the reverse speed is probably trivial, what you are testing is allocating and deallocating a small string 200_000_000 times.

Araq (orginal) [2023-08-08T07:51:50+02:00] view original

Micro benchmarks are weird and not interesting. You cannot reverse a string in this way either, you have to use unicode.reversed because of Unicode.

jasonfi (orginal) [2023-08-08T08:21:59+02:00] view original

It looks like a weird compiler edge case. How much time do you want to spend on it?

Yardanico (orginal) [2023-08-08T08:36:45+02:00] view original

I feel like the difference can be explained by how the Nim allocator works (with it not releasing memory back to the OS) somehow, or by strings being CoW, because with -d:useMalloc there is no substantial performance difference between the two cases.

awr1 (orginal) [2023-08-08T08:39:10+02:00] view original

one of the things that is also less fortunate about this example is that there is a latent yet untaken opportunity for DCO to turn reverse into a noop the way it's used in the loop.

Isofruit (orginal) [2023-08-08T08:39:43+02:00] view original

Very much seems like a compiler edge case. Here a benchy benchmark:

import benchy
proc reverse(s: string): string =
  result = s
  
  let size = s.len
  for (i, c) in s.pairs():
    result[(size - i - 1)] = c

let s = "Hello world"
timeIt "With Discard":
  discard reverse(s)
  
  for i in 1..200_000:
    let x = s.reverse()
    keep(x)

timeIt "No Discard":
  for i in 1..200_000:
    let x = s.reverse()
    keep(x)


   min time    avg time  std dv   runs name
  44.975 ms   47.967 ms  ±1.599   x104 With Discard
  45.317 ms   48.030 ms  ±1.566   x104 No Discard

demotomohiro (orginal) [2023-08-08T08:58:54+02:00] view original

On my machine

$ nim c -d:danger -d:lto --listcmd reverse.nim
......................................................................
$ time ./reverse

real    0m31.493s
user    0m31.491s
sys     0m0.001s
$ nim c -d:danger -d:lto --listcmd -d:faster reverse.nim
......................................................................
$ time ./reverse

real    0m3.487s
user    0m3.485s
sys     0m0.002s

-d:lto flag makes code much faster.

If you really want to know why, you need to read Nim generated C code and assembly code generated by backend C compiler. Nim generates assembly code with --asm option. But if you use -d:lto flag, see: https://internet-of-tomohiro.netlify.app/nim/faq.en.html#nim-compiler-how-to-produce-assembler-codeqmark

You can get simpler assembly code with -d:danger as it remove runtime checks.

https://godbolt.org shows assembly code but it is less readable because it doesn't tell function name call instruction calls and hard to see where jmp instructions jump to.

cblake (orginal) [2023-08-08T13:15:53+02:00] view original

Surprised no one has yet mentioned:

import std/algorithm
proc main =
  var s = "Hello world"
  when defined(faster): reverse(s, 0, s.len - 1)
  for i in 1..200_000_000: reverse(s, 0, s.len - 1)
main()

and which side steps all the allocation stuff. Someone coming from Rust / new to Nim might be unaware of var vs. let or that this even was an allocation benchmark as @treeform observes. As with any measurement, it depends on what the point of said measurement is. Not having shown his Rust, I suspect it was not actually intended to measure alloc/dealloc cycles which Rust folk tend to avoid. So, the above might help @walkr.

FWIW, at least my backend gcc-13.2 compiler is NOT smart enough to optimize out the work (even with LTO & PGO which do make that program about 2.2X faster, incidentally, down to about 0.316 seconds on an otherwise idle i7-6700k (in a not atypical pattern of speed-up @Yardanico and I tend to observe; He could try clang if he likes :-) )). It is a good point to check that, though. Also, you may do much better with --passC:-march=native since my perf report shows vpshufb (i.e. AVX2) in the inner loop which netted 1.6X for me. { Also, tim is more careful about "reproduction" / "meaningfulness of ±" on time-sharing OSes (i.e. - almost all of 'em) than benchy currently is, but both times & deltas here are big enough that timing is not subtle. }

As far as that allocation stuff goes, it seems like what is going on is that the extra call before the loop suppresses a destructor call (in both Nim-1.6 and Nim-2.0/Nim-devel). Doing no -d:faster and -d:faster and diffing the ~/.nim/cache files got me (I just moved the first version ~/.cache/nim/rWalkr_r to ~/.cache/nim/rWalkr_r0:


diff -Nu rWalkr_r0/@mrWalkr.nim.c rWalkr_r/@mrWalkr.nim.c
--- rWalkr_r0/@mrWalkr.nim.c    2023-08-08 06:08:32.887747209 -0400
+++ rWalkr_r/@mrWalkr.nim.c     2023-08-08 06:08:50.344965932 -0400
@@ -122,9 +122,20 @@
 #line 6 "/tmp/k/rWalkr.nim"
 N_LIB_PRIVATE N_NIMCALL(void, main__r87alkr_u7)(void) {
 #line 7
-NimStringV2 s;NIM_BOOL* nimErr_;{nimErr_ = nimErrorFlag();     s.len = 0; s.p = NIM_NIL;//  let s = "Hello world"
+NimStringV2 s;
+#line 8
+NimStringV2 colontmpD_;NIM_BOOL* nimErr_;{nimErr_ = nimErrorFlag();    s.len = 0; s.p = NIM_NIL;       colontmpD_.len = 0; colontmpD_.p = NIM_NIL;//  let s = "Hello world"
+
+#line 7
 //  let s = "Hello world"
-       s = TM__3QhDSXtO082ujkyrhqLedQ_3;       {
+       s = TM__3QhDSXtO082ujkyrhqLedQ_3;//  when defined(faster): discard reverse(s)
+
+#line 8
+//  when defined(faster): discard reverse(s)
+//  when defined(faster): discard reverse(s)
+//  when defined(faster): discard reverse(s)
+       colontmpD_ = reverse__r87alkr_u1(s);    if (NIM_UNLIKELY(*nimErr_)) goto BeforeRet_;    (void)(colontmpD_);
+       {
 #line 9
 NI i;
 #line 90 "/usr/lib/nim/lib/system/iterators_1.nim"
@@ -153,7 +164,8 @@
                                if (_.p && !(_.p->cap & NIM_STRLIT_FLAG)) { dealloc(_.p);}                      } LA3: ;
                }
        }
-       }BeforeRet_: ;
+//  proc `=destroy`*(x: string) {.inline, magic: "Destroy".} =
+       if (colontmpD_.p && !(colontmpD_.p->cap & NIM_STRLIT_FLAG)) { dealloc(colontmpD_.p);}   }BeforeRet_: ;
 }
 
 N_LIB_PRIVATE void PreMainInner(void) {

nim c --mm:arc --expandArc:reverse does not show that =destroy either with/without -d:faster { not sure if it is supposed to show every =op.. }. So, there may be a real issue or two here...

walkr (orginal) [2023-08-08T20:42:29+02:00] view original

let s = "Thank you all for your inputs! (:"

Since compiling modern programming languages involves so many moving parts, there's lots of useful information in here to learn from.

I'm not going to be spending much more time on this microbenchmark, but it's good to know now which layers to look under.

Mirror of forum.nim-lang.org

10384 :: Why is My Nim Code Faster when Doing This