Hi, before I start I just want to preface this question by saying that this post is in no way, shape or form, a jab on Nim. I'm simply trying to better understand this so I can learn something. Thanks!
The code for the Nim and Rust version as well as the disassembly is here:
As you can tell, this is (slightly modified) code from the Nim Profiling article
CPU: AMD Ryzen 2700x OS: Windows 10 64-bit
Compiler flags:
Nim (GCC)
nim c -d:danger -d:release --gc:none perfnim.nim
Rust
rustc -O perfrust.rs
Running both executables on my machine, I get the following results:
Nim
> Measure-Command { ./perfnim.exe }
Days : 0
Hours : 0
Minutes : 0
Seconds : 0
Milliseconds : 31
Ticks : 319615
TotalDays : 3.69924768518519E-07
TotalHours : 8.87819444444444E-06
TotalMinutes : 0.000532691666666667
TotalSeconds : 0.0319615
TotalMilliseconds : 31.9615
> Measure-Command { ./perfnim.exe }
Days : 0
Hours : 0
Minutes : 0
Seconds : 0
Milliseconds : 32
Ticks : 324051
TotalDays : 3.75059027777778E-07
TotalHours : 9.00141666666667E-06
TotalMinutes : 0.000540085
TotalSeconds : 0.0324051
TotalMilliseconds : 32.4051
> Measure-Command { ./perfnim.exe }
Days : 0
Hours : 0
Minutes : 0
Seconds : 0
Milliseconds : 33
Ticks : 330303
TotalDays : 3.82295138888889E-07
TotalHours : 9.17508333333333E-06
TotalMinutes : 0.000550505
TotalSeconds : 0.0330303
TotalMilliseconds : 33.0303
Rust
Measure-Command { ./perfrust.exe }
Days : 0
Hours : 0
Minutes : 0
Seconds : 0
Milliseconds : 9
Ticks : 92074
TotalDays : 1.0656712962963E-07
TotalHours : 2.55761111111111E-06
TotalMinutes : 0.000153456666666667
TotalSeconds : 0.0092074
TotalMilliseconds : 9.2074
Measure-Command { ./perfrust.exe }
Days : 0
Hours : 0
Minutes : 0
Seconds : 0
Milliseconds : 9
Ticks : 95374
TotalDays : 1.10386574074074E-07
TotalHours : 2.64927777777778E-06
TotalMinutes : 0.000158956666666667
TotalSeconds : 0.0095374
TotalMilliseconds : 9.5374
Measure-Command { ./perfrust.exe }
Days : 0
Hours : 0
Minutes : 0
Seconds : 0
Milliseconds : 9
Ticks : 98446
TotalDays : 1.1394212962963E-07
TotalHours : 2.73461111111111E-06
TotalMinutes : 0.000164076666666667
TotalSeconds : 0.0098446
TotalMilliseconds : 9.8446
As you can see the Rust is faster by the factor of 3.5-4x.
Comments/questions:
3) I tried adding the {.inline.} pragma to the analyze proc to no effect. I also tried manually inlining the analyze proc but I get worse perf by a factor of about 2, not sure why:
const input = "uyguhijkmnbdv44354gasuygiuiolknchyqudsayd12635uha"
var res = Result()
for _ in 0 ..< 1000000:
for c in input:
case c:
of 'A' .. 'z':
res.letters.inc
of '0' .. '9':
res.numbers.inc
else:
{.linearScanEnd.}
discard
Thanks in advance!
I'm lazy to make performance analysis. I have no Rust installed anyway. Could you try this ?
type
Result = object
letters : int
numbers : int
proc analyze(x: string): Result =
for c in x:
case c:
of 'A'..'z':
result.letters.inc
of '0'..'9':
result.numbers.inc
else:
discard
var total = Result()
proc main() =
const input = "uyguhijkmnbdv44354gasuygiuiolknchyqudsayd12635uha"
for _ in 0 ..< 1000000:
let res = analyze(input)
total.letters += res.letters
total.numbers += res.numbers
echo "Numbers: ", total.numbers, " Letters: ", total.letters
main()
Thanks for the input! I tried that with no measurable difference in performance.
I then decided to open both executables using Ghidra since the assembly may differ on my machine compared to that on Godbolt...
And boy is it surprising.
Nim:
Rust:
Judging by the _avx instructions it seems that Nim/C vectorizes the algorithm on my machine. Nice!
However, if I'm not mistaken (and I'm certainly no expert on reading assembly), it seems that Rust/LLVM optimizes the whole thing by not even using the input string, rather analyzing the whole thing at compile time, converting it to a couple of add operations and... just printing out the pre-computed result? If that's the case, wow.
I do realize that, if I'm correct, this is a complete non-factor in execution speed between the languages/compiler chains, as what Rust is doing is kinda cheating and wouldn't hold water with any kind of runtime-based input or real world scenario. Still, a cool trick :)
Awesome! Thank you!
One more question.
Why does
proc analyze(x: string): Result {.compiletime.}
compile, while
proc analyze(x: static string): Result {.compiletime.}
... doesn't? The error I get is
Error: request to generate code for .compileTime proc: analyze
I was thinking something like @shirleyquirk... and indeed it runs much faster. I don't post absolute values because I'have slower PC, older Nim version, but in relative the time reduction is evident. At the end I guess it's like providing a lookup table to compiler. Modifying a little @sky_khan version:
type
Result = object
letters : int
numbers : int
proc analyze(x: string): Result =
for c in x:
case c:
of 'A'..'z':
result.letters.inc
of '0'..'9':
result.numbers.inc
else:
discard
proc main() =
const input = "uyguhijkmnbdv44354gasuygiuiolknchyqudsayd12635uha"
const res = analyze(input)
var total = Result()
for _ in 0 ..< 1000000:
total.letters += res.letters
total.numbers += res.numbers
echo "Numbers: ", total.numbers, " Letters: ", total.letters
main()
it'd be wrong at first place to do it for the purpose
A complete program or piece of code must go through intensive profiling analyze, by the known reliable methods and tools, to be understood its optimal performance That'd involve among others some refactoring to obtain max efficiency It'd be done for two PL's code than'd compared their total runtime
Need master use of perf https://perf.wiki.kernel.org/index.php/Tutorial
I love optimizing things. Here is my take.
In my experience once you get to this low level, you get what I call "load bearing" ifs, memory reads and memory writes. It some times can gets very strange as to why moving an if/read/write changes run times x3 faster or slow. It's just the way it is. The only thing you can do is measure, measure, measure and move things about etc... Different compilers, cpus, and OSes will have different "load bearing" if/read/write. Run your program on exact hardware you care about.
This is not a good performance example as in theory a compiler can just fold everything into a constant output. Maybe Rust can guess further up the tree, while Nim/C compiler can't?
My big question is why both programs don't run in zero time?
This is what I start with:
const input = "uyguhijkmnbdv44354gasuygiuiolknchyqudsayd12635uha"
var
letters: int
numbers: int
for _ in 0 ..< 1000000:
for c in input:
case c:
of 'A' .. 'z':
inc letters
of '0' .. '9':
inc numbers
else:
{.linearScanEnd.}
discard
echo letters
echo numbers
I want to start measuring right a way, so I wrap it in benchy a benchmarking library I wrote:
import benchy
timeIt "couter1", 100:
const input = "uyguhijkmnbdv44354gasuygiuiolknchyqudsayd12635uha"
var
letters: int
numbers: int
for _ in 0 ..< 1000000:
for c in input:
case c:
of 'A' .. 'z':
inc letters
of '0' .. '9':
inc numbers
else:
{.linearScanEnd.}
discard
doAssert letters == 39000000
doAssert numbers == 10000000
name ............................... min time avg time std dv runs
couter ............................ 21.048 ms 21.084 ms ±0.073 x237
A funny thing happens if I remove the doAssert
import benchy
timeIt "couter":
const input = "uyguhijkmnbdv44354gasuygiuiolknchyqudsayd12635uha"
var
letters: int
numbers: int
for _ in 0 ..< 1000000:
for c in input:
case c:
of 'A' .. 'z':
inc letters
of '0' .. '9':
inc numbers
else:
{.linearScanEnd.}
discard
# doAssert letters == 39000000
# doAssert numbers == 10000000
Compiler realizes we are doing nothing and just does nothing:
name ............................... min time avg time std dv runs
couter ............................. 0.000 ms 0.000 ms ±0.000 x1000
I think this is at the crux at what is happening here. Different operations can be folded.
If you know that input only has letters and numbers one can eliminate a lot of branches by just checking if ASCII is above or bellow 57.
This is reading memory 1 byte at a time. But you can also read stuff faster at like 64 bit numbers at a time.
There is probably some sort of bit/shift pattern that can go faster then branches.
One could use SIMD next as they can read memory a little faster.
A smart compiler can guess and do that.
I wish there was more to dig into this example, but really the correct answer to how to optimize the code above is:
# Just optimize it all away...
Nim's int is 8 bytes 64 bit integer, in your Rust code you are using u32 which is 4 bytes 32 bit unsigned integer.
Can you change the Rust example to i64 and see what happens?