Just wanted to share some info about using LTO + PGO with Clang for Nim.
First of all you should know that PGO optimization is not always good because it optimizes code paths for the profile-guided run, so some corner cases may have even less performance.
The process:
nim c -d:danger --cc:clang --passC:"-flto -fprofile-instr-generate" --passL:"-flto -fprofile-instr-generate" file.nim
After that you should have a file named default.profraw in the folder where you ran your program.
llvm-profdata merge default.profraw -output data.profdata to process the profiling data for Clang to use
nim c -d:danger --cc:clang --passC:"-flto -fprofile-instr-use=data.profdata" --passL:"-flto -fprofile-instr-use=data.profdata" file.nim
After that the process is done, you can now test your binary to see if you got any performance boost :)
I tried doing that for my mathexpr library:
# Don't mind the nimbench, I know I shouldn't use it :P
import mathexpr, nimbench
let e = newEvaluator()
e.addVars({"a": 3.0, "b": 5.7})
bench("test", m):
for x in 1..m:
var c = e.eval("(a^a + b * 2 - 3*4.2412+5335^2-4e3)^2")
if c == 0:
echo "can't"
runBenchmarks()
No LTO/PGO (also yeah, I'm using gc:arc since it's faster :P) - nim c -d:danger --gc:arc --cc:clang -r tests/bench.nim:
============================================================================
bench.nim relative time/iter iters/s
============================================================================
"test" 435.24ns 2.30M
LTO only - nim c -d:danger --gc:arc --cc:clang --passC:"-flto" --passL:"-flto" -r tests/bench.nim:
"test" 332.45ns 3.01M
LTO+PGO (I won't show all commands, just the last one) - nim c -r -d:danger --gc:arc --cc:clang --passC:"-flto -fprofile-instr-use=perf.profdata" --passL:"-flto -fprofile-instr-use=perf.profdata" tests/bench.nim:
"test" 266.02ns 3.76M
Thanks for reading :)
That is interesting.
Note that CBlake recently gave instructions for gcc:
https://forum.nim-lang.org/t/6283#38755
We really should collect these instructions somewhere, maybe in the wiki.
I think simplifying life with a couple of wrapper scripts (nim-pgo-gcc and nim-pgo-clang, say) would help would-be PGO'ers, even if said scripts were only a starting point for the users' own scripts.
In theory, the Nim compiler could itself grow a 2-phase compilation mode. Things are already set up so you can say nim c --run prog args. So, all we would really need to do is add a --pgo that works just like --run but A) uses a modified namespace/extra name for compiler.options (a place to put -fprofile-generate) and B) re-compiles everything again afterward with another modified namespace/extra name for compiler.options (a place to put -fprofile-use).
I'd say that kind of fits well with existing nim CLI workflow. There are only minor risks like biasing folks to write programs that do a benchmark run when given no args or accidental infinite loop/giant benchmarks but those are both intrinsic to PGO in general.
Maybe there is already a way to hack config.nims to be this smart? I think a lot more people would use it if it were as simple as nim c --pgo -d:danger prog args.
Results for the PGO of the Nim compiler - https://gist.github.com/Yardanico/e5ef5130b43f3d4e6f8c308ee910c1c3
In short:
Updated the compiler PGO script a bit (it's still messy, I know I need to reduce all the repetition or just rewrite it in Nim itself)
https://github.com/Yardanico/nim-snippets/blob/master/compile_nim_pgo.sh
Now it also builds npeg with --gc:orc and the compiler itself (with -d:leanCompiler) with --gc:arc so that the modules used for ARC analysis (dfa, cursor inference, etc) can also benefit from PGO
I'm not sure what it is, exactly, about Nim-generated C code, but PGO tends to benefit it more than "my typical C".
Interesting. Makes me wonder if the generated code could benefit from some annotations like __builtin_expect, i.e. in places where Nim knows a code path is very likely/unlikely (exceptions?). Other useful annotations are __builtin_assume and __attribute__((__optimize__(...))) Of course these are compiler-dependent so they'd need to be wrapped in some macros. I've got a fair amount of experience at this.
@snej - the gcc people say that the biggests boosts from PGO come from better inlining decisions. A function call can be like 14 cycles (or 14*2..6 issue = 28..84 superscalar dynamic instructions. That's a lot of (potential) work and much Nim code might have small non-inlined functions that only do like 3-15 dynamic instructions. So, speed-up potential might be 1.86x .. 28x which is a lot, but L1 i-cache and the uop cache (on Intel, anyhow) is also very scarce and also have large speed multipliers. So, it's kind of a tricky "dynamic stew" and the claims of the gcc people are plausible.
In light of that claim, you may be able to get --passC:-flto performance closer to PGO performance by better static inline decisions, such as adding the {.inline.} pragma (more?) judiciously. Or there could be some core system procs that just need some {.inline.} annotation as "low hanging fruit".
In a more perfect world, these C/C++ compilers might have option flags to "show their work" in PGO mode. Then it might be easier to investigate on a case-by-case basis. You can diff the output, of course, but that's not easy work. I haven't really looked. Maybe clang or gcc have such "show their work" features.