nimforum mirror - How to use Clang LTO + PGO with Nim

Yardanico (orginal) [2020-05-06T07:51:59+02:00] view original

Just wanted to share some info about using LTO + PGO with Clang for Nim.

First of all you should know that PGO optimization is not always good because it optimizes code paths for the profile-guided run, so some corner cases may have even less performance.

The process:

Compile your application like

nim c -d:danger --cc:clang --passC:"-flto -fprofile-instr-generate" --passL:"-flto -fprofile-instr-generate" file.nim

Run it with your typical workloads to generate the profiling data for PGO - ./file.

After that you should have a file named default.profraw in the folder where you ran your program.

llvm-profdata merge default.profraw -output data.profdata to process the profiling data for Clang to use

Compile your program again, this time like so (you should be in the same folder with the data.profdata file)

nim c -d:danger --cc:clang --passC:"-flto -fprofile-instr-use=data.profdata" --passL:"-flto -fprofile-instr-use=data.profdata" file.nim

After that the process is done, you can now test your binary to see if you got any performance boost :)

I tried doing that for my mathexpr library:

# Don't mind the nimbench, I know I shouldn't use it :P
import mathexpr, nimbench

let e = newEvaluator()
e.addVars({"a": 3.0, "b": 5.7})

bench("test", m):
  for x in 1..m:
    var c = e.eval("(a^a + b * 2 - 3*4.2412+5335^2-4e3)^2")
    if c == 0:
      echo "can't"

runBenchmarks()

No LTO/PGO (also yeah, I'm using gc:arc since it's faster :P) - nim c -d:danger --gc:arc --cc:clang -r tests/bench.nim:


============================================================================
bench.nim                                       relative  time/iter  iters/s
============================================================================
"test"                                                     435.24ns    2.30M

LTO only - nim c -d:danger --gc:arc --cc:clang --passC:"-flto" --passL:"-flto" -r tests/bench.nim:


"test"                                                     332.45ns    3.01M

LTO+PGO (I won't show all commands, just the last one) - nim c -r -d:danger --gc:arc --cc:clang --passC:"-flto -fprofile-instr-use=perf.profdata" --passL:"-flto -fprofile-instr-use=perf.profdata" tests/bench.nim:


"test"                                                     266.02ns    3.76M

Thanks for reading :)

Stefan_Salewski (orginal) [2020-05-06T09:08:03+02:00] view original

That is interesting.

Note that CBlake recently gave instructions for gcc:

https://forum.nim-lang.org/t/6283#38755

We really should collect these instructions somewhere, maybe in the wiki.

cblake (orginal) [2020-05-06T14:36:03+02:00] view original

I think simplifying life with a couple of wrapper scripts (nim-pgo-gcc and nim-pgo-clang, say) would help would-be PGO'ers, even if said scripts were only a starting point for the users' own scripts.

In theory, the Nim compiler could itself grow a 2-phase compilation mode. Things are already set up so you can say nim c --run prog args. So, all we would really need to do is add a --pgo that works just like --run but A) uses a modified namespace/extra name for compiler.options (a place to put -fprofile-generate) and B) re-compiles everything again afterward with another modified namespace/extra name for compiler.options (a place to put -fprofile-use).

I'd say that kind of fits well with existing nim CLI workflow. There are only minor risks like biasing folks to write programs that do a benchmark run when given no args or accidental infinite loop/giant benchmarks but those are both intrinsic to PGO in general.

Maybe there is already a way to hack config.nims to be this smart? I think a lot more people would use it if it were as simple as nim c --pgo -d:danger prog args.

cblake (orginal) [2020-05-06T16:49:50+02:00] view original

(Of course at least two C++ backends can also do PGO..So, probably just nim BE --pgo prog args is probably best if slightly more verbose.)

Yardanico (orginal) [2020-05-06T17:42:15+02:00] view original

Results for the PGO of the Nim compiler - https://gist.github.com/Yardanico/e5ef5130b43f3d4e6f8c308ee910c1c3

In short:

-d:danger:
- 3.64 s tests_cpu.nim from arraymancer
- 4.01 s compiler itself

-d:danger with PGO (generated from compiler runs from these two cases)
- 3.22 s tests_cpu.nim from arraymancer
- 3.26 s (!!!) compiler itself

cblake (orginal) [2020-05-06T18:03:15+02:00] view original

Yeah. I'm not sure what it is, exactly, about Nim-generated C code, but PGO tends to benefit it more than "my typical C". So, it makes sense the compiler itself could get a 1.25x speed-up. Might be more|less on gcc. I've probably mentioned this about 10x over the past 5 years, but here's one from about a year ago: https://forum.nim-lang.org/t/4517

Yardanico (orginal) [2020-08-02T13:03:24+02:00] view original

Updated the compiler PGO script a bit (it's still messy, I know I need to reduce all the repetition or just rewrite it in Nim itself)

https://github.com/Yardanico/nim-snippets/blob/master/compile_nim_pgo.sh

Now it also builds npeg with --gc:orc and the compiler itself (with -d:leanCompiler) with --gc:arc so that the modules used for ARC analysis (dfa, cursor inference, etc) can also benefit from PGO

snej (orginal) [2020-08-04T20:42:38+02:00] view original

I'm not sure what it is, exactly, about Nim-generated C code, but PGO tends to benefit it more than "my typical C".

Interesting. Makes me wonder if the generated code could benefit from some annotations like __builtin_expect, i.e. in places where Nim knows a code path is very likely/unlikely (exceptions?). Other useful annotations are __builtin_assume and __attribute__((__optimize__(...))) Of course these are compiler-dependent so they'd need to be wrapped in some macros. I've got a fair amount of experience at this.

Yardanico (orginal) [2020-08-04T20:44:33+02:00] view original

We already have likely/unlikely which map to the GCC/Clang builtins, and CPU is usually much better than human at making assumptions about branches.

cblake (orginal) [2020-08-18T12:18:36+02:00] view original

@snej - the gcc people say that the biggests boosts from PGO come from better inlining decisions. A function call can be like 14 cycles (or 14*2..6 issue = 28..84 superscalar dynamic instructions. That's a lot of (potential) work and much Nim code might have small non-inlined functions that only do like 3-15 dynamic instructions. So, speed-up potential might be 1.86x .. 28x which is a lot, but L1 i-cache and the uop cache (on Intel, anyhow) is also very scarce and also have large speed multipliers. So, it's kind of a tricky "dynamic stew" and the claims of the gcc people are plausible.

In light of that claim, you may be able to get --passC:-flto performance closer to PGO performance by better static inline decisions, such as adding the {.inline.} pragma (more?) judiciously. Or there could be some core system procs that just need some {.inline.} annotation as "low hanging fruit".

In a more perfect world, these C/C++ compilers might have option flags to "show their work" in PGO mode. Then it might be easier to investigate on a case-by-case basis. You can diff the output, of course, but that's not easy work. I haven't really looked. Maybe clang or gcc have such "show their work" features.

Mirror of forum.nim-lang.org

6295 :: How to use Clang LTO + PGO with Nim