nimforum mirror - Please help me understand the performance difference between Nim/Rust in this (contrived) example

blashyrk92 (orginal) [2022-06-05T21:23:24+02:00] view original

Hi, before I start I just want to preface this question by saying that this post is in no way, shape or form, a jab on Nim. I'm simply trying to better understand this so I can learn something. Thanks!

The code for the Nim and Rust version as well as the disassembly is here:

As you can tell, this is (slightly modified) code from the Nim Profiling article

CPU: AMD Ryzen 2700x OS: Windows 10 64-bit

Compiler flags:

Nim (GCC)


nim c -d:danger -d:release --gc:none perfnim.nim

Rust


rustc -O perfrust.rs

Running both executables on my machine, I get the following results:

Nim


> Measure-Command { ./perfnim.exe }
Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 0
Milliseconds      : 31
Ticks             : 319615
TotalDays         : 3.69924768518519E-07
TotalHours        : 8.87819444444444E-06
TotalMinutes      : 0.000532691666666667
TotalSeconds      : 0.0319615
TotalMilliseconds : 31.9615

> Measure-Command { ./perfnim.exe }
Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 0
Milliseconds      : 32
Ticks             : 324051
TotalDays         : 3.75059027777778E-07
TotalHours        : 9.00141666666667E-06
TotalMinutes      : 0.000540085
TotalSeconds      : 0.0324051
TotalMilliseconds : 32.4051

> Measure-Command { ./perfnim.exe }
Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 0
Milliseconds      : 33
Ticks             : 330303
TotalDays         : 3.82295138888889E-07
TotalHours        : 9.17508333333333E-06
TotalMinutes      : 0.000550505
TotalSeconds      : 0.0330303
TotalMilliseconds : 33.0303

Rust


Measure-Command { ./perfrust.exe }
Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 0
Milliseconds      : 9
Ticks             : 92074
TotalDays         : 1.0656712962963E-07
TotalHours        : 2.55761111111111E-06
TotalMinutes      : 0.000153456666666667
TotalSeconds      : 0.0092074
TotalMilliseconds : 9.2074

Measure-Command { ./perfrust.exe }
Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 0
Milliseconds      : 9
Ticks             : 95374
TotalDays         : 1.10386574074074E-07
TotalHours        : 2.64927777777778E-06
TotalMinutes      : 0.000158956666666667
TotalSeconds      : 0.0095374
TotalMilliseconds : 9.5374

Measure-Command { ./perfrust.exe }
Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 0
Milliseconds      : 9
Ticks             : 98446
TotalDays         : 1.1394212962963E-07
TotalHours        : 2.73461111111111E-06
TotalMinutes      : 0.000164076666666667
TotalSeconds      : 0.0098446
TotalMilliseconds : 9.8446

As you can see the Rust is faster by the factor of 3.5-4x.

Comments/questions:

When I switch the Rust version to use .chars() instead of .bytes() I get similar execution time as the Nim version, but Nim string is a char sequence, Nim char is still 1 byte so that theoretically it should be the same?

I tried switching static string to openArray[char] in the analyze proc to no avail (same execution time).

3) I tried adding the {.inline.} pragma to the analyze proc to no effect. I also tried manually inlining the analyze proc but I get worse perf by a factor of about 2, not sure why:

const input = "uyguhijkmnbdv44354gasuygiuiolknchyqudsayd12635uha"

var res = Result()
for _ in 0 ..< 1000000:
  for c in input:
    case c:
    of 'A' .. 'z':
      res.letters.inc
    of '0' .. '9':
      res.numbers.inc
    else:
      {.linearScanEnd.}
      discard

Thanks in advance!

sky_khan (orginal) [2022-06-05T22:14:43+02:00] view original

I'm lazy to make performance analysis. I have no Rust installed anyway. Could you try this ?

type
    Result = object
        letters : int
        numbers : int

proc analyze(x: string): Result =
    for c in x:
        case c:
            of 'A'..'z':
                result.letters.inc
            of '0'..'9':
                result.numbers.inc
            else:
                discard

var total = Result()

proc main() =
    const input = "uyguhijkmnbdv44354gasuygiuiolknchyqudsayd12635uha"
    for _ in 0 ..< 1000000:
        let res = analyze(input)
        total.letters += res.letters
        total.numbers += res.numbers
    echo "Numbers: ", total.numbers, " Letters: ", total.letters

main()

blashyrk92 (orginal) [2022-06-05T22:29:13+02:00] view original

Thanks for the input! I tried that with no measurable difference in performance.

I then decided to open both executables using Ghidra since the assembly may differ on my machine compared to that on Godbolt...

And boy is it surprising.

Nim:

Rust:

Judging by the _avx instructions it seems that Nim/C vectorizes the algorithm on my machine. Nice!

However, if I'm not mistaken (and I'm certainly no expert on reading assembly), it seems that Rust/LLVM optimizes the whole thing by not even using the input string, rather analyzing the whole thing at compile time, converting it to a couple of add operations and... just printing out the pre-computed result? If that's the case, wow.

I do realize that, if I'm correct, this is a complete non-factor in execution speed between the languages/compiler chains, as what Rust is doing is kinda cheating and wouldn't hold water with any kind of runtime-based input or real world scenario. Still, a cool trick :)

sky_khan (orginal) [2022-06-05T22:42:19+02:00] view original

Well, I'm not good at assembly either. If thats the case, it must be where "Rust is even faster than C" comes from :)

shirleyquirk (orginal) [2022-06-05T22:46:46+02:00] view original

a picture of code is not code, can i get a godbolt link or something?

blashyrk92 (orginal) [2022-06-05T23:14:44+02:00] view original

Awesome! Thank you!

One more question.

Why does

proc analyze(x: string): Result {.compiletime.}

compile, while

proc analyze(x: static string): Result {.compiletime.}

... doesn't? The error I get is


Error: request to generate code for .compileTime proc: analyze

tcheran (orginal) [2022-06-05T23:49:59+02:00] view original

I was thinking something like @shirleyquirk... and indeed it runs much faster. I don't post absolute values because I'have slower PC, older Nim version, but in relative the time reduction is evident. At the end I guess it's like providing a lookup table to compiler. Modifying a little @sky_khan version:

type
    Result = object
        letters : int
        numbers : int

proc analyze(x: string): Result =
    for c in x:
        case c:
            of 'A'..'z':
                result.letters.inc
            of '0'..'9':
                result.numbers.inc
            else:
                discard

proc main() =
    const input = "uyguhijkmnbdv44354gasuygiuiolknchyqudsayd12635uha"
    const res = analyze(input)
    var total = Result()
    for _ in 0 ..< 1000000:
        total.letters += res.letters
        total.numbers += res.numbers
    echo "Numbers: ", total.numbers, " Letters: ", total.letters

main()

blashyrk92 (orginal) [2022-06-05T23:57:26+02:00] view original

Hi! Yes I realize that moving the analyze call out of the loop is possible in this instance, this was just an exercise in learning how the compilers can reason about/optimize the code automatically. I didn't realize Nim had a {.compileTime.} pragma, so I learned something valuable :)

mardiyah (orginal) [2022-06-06T03:23:18+02:00] view original

it'd be wrong at first place to do it for the purpose

A complete program or piece of code must go through intensive profiling analyze, by the known reliable methods and tools, to be understood its optimal performance That'd involve among others some refactoring to obtain max efficiency It'd be done for two PL's code than'd compared their total runtime

Need master use of perf https://perf.wiki.kernel.org/index.php/Tutorial

treeform (orginal) [2022-06-06T04:59:27+02:00] view original

I love optimizing things. Here is my take.

In my experience once you get to this low level, you get what I call "load bearing" ifs, memory reads and memory writes. It some times can gets very strange as to why moving an if/read/write changes run times x3 faster or slow. It's just the way it is. The only thing you can do is measure, measure, measure and move things about etc... Different compilers, cpus, and OSes will have different "load bearing" if/read/write. Run your program on exact hardware you care about.

This is not a good performance example as in theory a compiler can just fold everything into a constant output. Maybe Rust can guess further up the tree, while Nim/C compiler can't?

My big question is why both programs don't run in zero time?

This is what I start with:

const input = "uyguhijkmnbdv44354gasuygiuiolknchyqudsayd12635uha"
var
  letters: int
  numbers: int
for _ in 0 ..< 1000000:
  for c in input:
    case c:
    of 'A' .. 'z':
      inc letters
    of '0' .. '9':
      inc numbers
    else:
      {.linearScanEnd.}
      discard
echo letters
echo numbers

I want to start measuring right a way, so I wrap it in benchy a benchmarking library I wrote:

import benchy
timeIt "couter1", 100:
  const input = "uyguhijkmnbdv44354gasuygiuiolknchyqudsayd12635uha"
  var
    letters: int
    numbers: int
  for _ in 0 ..< 1000000:
    for c in input:
      case c:
      of 'A' .. 'z':
        inc letters
      of '0' .. '9':
        inc numbers
      else:
        {.linearScanEnd.}
        discard
  doAssert letters == 39000000
  doAssert numbers == 10000000


name ............................... min time      avg time    std dv   runs
couter ............................ 21.048 ms     21.084 ms    ±0.073   x237

A funny thing happens if I remove the doAssert

import benchy
timeIt "couter":
  const input = "uyguhijkmnbdv44354gasuygiuiolknchyqudsayd12635uha"
  var
    letters: int
    numbers: int
  for _ in 0 ..< 1000000:
    for c in input:
      case c:
      of 'A' .. 'z':
        inc letters
      of '0' .. '9':
        inc numbers
      else:
        {.linearScanEnd.}
        discard
  # doAssert letters == 39000000
  # doAssert numbers == 10000000

Compiler realizes we are doing nothing and just does nothing:

name ............................... min time      avg time    std dv   runs
couter ............................. 0.000 ms      0.000 ms    ±0.000  x1000

I think this is at the crux at what is happening here. Different operations can be folded.

If you know that input only has letters and numbers one can eliminate a lot of branches by just checking if ASCII is above or bellow 57.

This is reading memory 1 byte at a time. But you can also read stuff faster at like 64 bit numbers at a time.

There is probably some sort of bit/shift pattern that can go faster then branches.

One could use SIMD next as they can read memory a little faster.

A smart compiler can guess and do that.

I wish there was more to dig into this example, but really the correct answer to how to optimize the code above is:

# Just optimize it all away...

sls1005 (orginal) [2022-06-06T06:12:27+02:00] view original

As I know, using pre-computed result is very common for optimization. Sometimes C compilers also do that.

r3c (orginal) [2022-06-06T14:19:15+02:00] view original

Nim's int is 8 bytes 64 bit integer, in your Rust code you are using u32 which is 4 bytes 32 bit unsigned integer.

Can you change the Rust example to i64 and see what happens?

Mirror of forum.nim-lang.org

9208 :: Please help me understand the performance difference between Nim/Rust in this (contrived) example