nimforum mirror - orc mm slower than markandsweep in my experience

enaaab460 (orginal) [2024-01-17T08:32:29+01:00] view original

I will preface this by saying I just picked up nim 2.0.2 this week, coming from python and golang. I started experimenting with the nim.cfg setting to find the settings that get the fastest perfomance. This is how it looks now


cc = gcc
passC = "-flto -march=native -O4"
passL = "-flto -s"
d=danger
d=useMalloc
d=lto

---  the rest of the file is untouched ---
# additional options always passed to the compiler:
--parallel_build: "0"
....

I have benchmarked 2 scenarios where using the default orc mm is twice as slow as using markandsweep.

The first script is recursive fibonacci till 50 which looks as follow:

proc fibonacci*(n: int32) : int =
    if n <= 1: return n
    fibonacci(n-1) + fibonacci(n-2)
discard fibonacci(50)

Benchmarking it with hyperfine using the default mm:


nim\vsgo> nim c --hints:off --o:fibonacciorc.exe fibonacci.nim | hyperfine fibonacciorc.exe
Benchmark 1: fibonacciorc.exe
  Time (mean ± σ):     56.496 s ±  1.153 s    [User: 56.143 s, System: 0.025 s]
  Range (min … max):   53.893 s … 57.685 s    10 runs

Benchmarking it with markandsweep:


nim\vsgo> nim c --hints:off --mm:markandsweep --o:fibonaccimarkandsweep.exe fibonacci.nim | hyperfine fibonaccimarkandsweep.exe
Benchmark 1: fibonaccimarkandsweep.exe
  Time (mean ± σ):     15.840 s ±  0.141 s    [User: 15.797 s, System: 0.013 s]
  Range (min … max):   15.675 s … 16.074 s    10 runs

The second benchmark is this year's advent of code day 2, to read a file, parse the game moves, and calculate some parameters from that input. I modified the input file which was 100 lines long, copied it till it became 20,000 lines

Same nim commands as above, and used hyperfine on both exes:


nim\aoc23\day02> hyperfine day2markandsweep.exe day2orc.exe --warmup 10
Benchmark 1: day2markandsweep.exe
  Time (mean ± σ):     134.8 ms ±   5.7 ms    [User: 121.6 ms, System: 13.5 ms]
  Range (min … max):   125.8 ms … 146.5 ms    21 runs

Benchmark 2: day2orc.exe
  Time (mean ± σ):     176.4 ms ±   3.1 ms    [User: 167.1 ms, System: 10.9 ms]
  Range (min … max):   172.0 ms … 182.3 ms    16 runs

Summary
  'day2markandsweep.exe' ran
    1.31 ± 0.06 times faster than 'day2orc.exe'

Finally: when to choose which mm, and what options can I add to nim.cfg to further improve performance

enaaab460 (orginal) [2024-01-17T09:03:35+01:00] view original

Update: Tried benchmarking an older version of advent of code script, which parses the input in one go, then loops through the parsed sequences once for the first solution, then again for the second solution

import strutils,std/strformat,tables

type ball = tuple[number: int, color: string]
type set = seq[ball]
type game = seq[set]
var allgames = newSeq[game]()

proc processInputs(inputLines: seq[string]) =
  var gameId,ballnum: int
  var sets,balls,thisball: seq[string]
  var games,ballcolor: string
  for lineI,lineV in inputLines:
    # echo inputLines
    gameId = lineI + 1
    games = lineV.split(": ")[1]
    sets = games.split("; ")
    var thisgame = newSeq[set]()
    for setI,setV in sets.pairs:
      # echo "set " & setI.string
      var thisset = newSeq[ball]()
      balls = setV.split(", ")
      for ball in balls:
        thisball = ball.split(" ")
        ballnum = parseInt(thisball[0])
        ballcolor = thisball[1]
        thisset.add((ballnum,ballcolor))
      thisgame.add(thisset)
    allgames.add(thisgame)
  # echo allgames


proc Level1(): int=
  var sum,gameId: int
  const colorLimits = {"red":12,"green":13,"blue":14}.toTable
  for gameI,gameV in allgames.pairs:
    # echo inputLines
    var impossible = false
    gameId = gameI + 1
    for set in gameV:
      for ball in set:
        if ball.number > colorLimits[ball.color]:
          impossible = true
    if impossible == false:
      sum += gameId
  return sum

proc Level2(): int=
  var gameId,sum: int
  for gameI,gameV in allgames.pairs:
    var minBalls = {"red":0,"green":0,"blue":0}.toTable
    gameId = gameI + 1
    for set in gameV:
      # echo "set " & setI.string
      for ball in set:
        if ball.number > minBalls[ball.color]:
          minBalls[ball.color] = ball.number
    var product = 1
    for color,value in minBalls:
      product *= value
    # echo product
    sum += product
  return sum


proc main() =
  let input = readFile("input.txt")
  let inputLines = input.splitLines
  processInputs(inputLines)
  echo fmt"Level1: {Level1()}"
  echo fmt"Level2: {Level2()}"

main()

orc is now faster


nim\aoc23\day02> hyperfine day2orc.exe day2markandsweep.exe --warmup 10
Benchmark 1: day2orc.exe
  Time (mean ± σ):     245.4 ms ±   3.7 ms    [User: 233.1 ms, System: 17.0 ms]
  Range (min … max):   240.8 ms … 254.9 ms    11 runs

Benchmark 2: day2markandsweep.exe
  Time (mean ± σ):     354.9 ms ±   6.6 ms    [User: 333.1 ms, System: 22.3 ms]
  Range (min … max):   348.3 ms … 365.5 ms    10 runs

Summary
  'day2orc.exe' ran
    1.45 ± 0.03 times faster than 'day2markandsweep.exe'

My understanding is that the older version has more unnecessary loops and unnecessary sequences, so it is the less efficient script. Is the garbage collection working more now, and therefore orc is better? Is markandsweep better if less garbage collection is needed?

Araq (orginal) [2024-01-17T09:16:39+01:00] view original

fibonacci is a terrible benchmark and does not use the GC anyway. The difference that you're seeing for that one might be related to the different exception handling implementations. You can try nim cpp for a "zero overhead" exception handling implementation that works with ORC.

The other benchmark ... I don't know. It's naive code and you're at the mercy of Nim's optimizer which got better after the release of 2.0 but in general is pretty unpredictable. :-)

enaaab460 (orginal) [2024-01-17T09:23:09+01:00] view original

Update 2: recompiled the older script with clang for both mm, now they are neck and neck

nim\aoc23\day02> hyperfine day2orc.exe day2markandsweep.exe --warmup 10
Benchmark 1: day2orc.exe
  Time (mean ± σ):     237.1 ms ±   1.5 ms    [User: 220.8 ms, System: 19.0 ms]
  Range (min … max):   234.7 ms … 239.7 ms    12 runs

Benchmark 2: day2markandsweep.exe
  Time (mean ± σ):     248.5 ms ±   4.0 ms    [User: 232.5 ms, System: 12.0 ms]
  Range (min … max):   241.8 ms … 253.9 ms    11 runs

Summary
  'day2orc.exe' ran
    1.05 ± 0.02 times faster than 'day2markandsweep.exe'

enaaab460 (orginal) [2024-01-17T11:28:07+01:00] view original

Thanks!

Zoom (orginal) [2024-01-17T12:40:07+01:00] view original

Now comes another question, when to use gcc and when to use clang?

Use gcc unless you need clang. If performance is critical, measure every time.

enaaab460 (orginal) [2024-01-17T12:47:58+01:00] view original

Thanks

PMunch (orginal) [2024-01-17T13:14:49+01:00] view original

In general, if performance is critical you have to benchmark.

cblake (orginal) [2024-01-17T15:31:09+01:00] view original

@enaaaab450 - WELCOME!

Also, 3 things:

PGO can also be very useful (2..3X) - or not at all or even hurt! - and may be worth a try: https://forum.nim-lang.org/t/6295

To expand on why recursive Fibonacci is a terrible benchmark, which it absolutely is @Araq is 100% right, some backend compilers can partially inline the recursion which in the case of recursive Fibonacci makes for exponential performance sensitivity (1.62^n) to how much inlining can happen. To compound the problem (and explain why it may/may not be exceptions or get better with C++), backend compiler optimizer analyzers are very finicky about the exact shape of the code to engage this optimization. Unless you are doing gprof -style call count programming you have not known how much actual work recursive Fibonacci is doing for something like 10..15 years. On modern CPUs the number of funcalls could be off by like 32X. If you could steer the amount of inlining, you could tune that to almost anything. { This bad benchmark will live on forever, as they all do, just as I was trying to understand literally 50 year old Fortran code yesterday. I think people underestimate how much f2py made Python take off for numerical work. }

If you care enough about timing precision to do 10 warm-up runs, you probably will get better accuracy for the happy path/hot cache time via bu/tim.nim than hyperfine. Specifically, the minimum time is the least contaminated by vulnerability to the huge network of queues and caches any modern CPU/timesharing OS has. For A/B time comparison one does want an error bar - so you want some error estimate on the estimator of the minimum; tim just uses repeated runs for that. { Or else you probably don't want a summary number at all, but rather the whole distribution function to display aforementioned vulnerability, or at least 20 or more quantiles; A good estimate of that requires far more data than the minimum and may also be very hard to make reproducible, given its sensitivity to the entire state of a test system. }

enaaab460 (orginal) [2024-01-17T16:12:14+01:00] view original

Thanks! I appreciate the writeup. I was mainly trying to find a one size fits all compile command, but things seem far too complicated than what I thought. I am new to NIM (and golang, and intermediate at python), and it was faster than golang (fibonacci took 58 seconds at go, and I'm not foolish enough to benchmark python). Fibonacci markandsweep was even faster than rust 1.67 (19.5 seconds) that i got greedy trying to find how far I can push performance (without too much of a headache). For now i will keep my nim.cfg (and occasionally mess with clang or markandsweep), it is already sufficiently fast for what I could use it for.

Thanks again!

enaaab460 (orginal) [2024-01-18T20:42:37+01:00] view original

Update 3: I make a script to automate the benchmarking to some degree, using hyperfine

.nim
import std/[osproc,os],strformat,strutils,times,algorithm
let starttime = cpuTime()
type paramOpt = tuple[name,cmd: string]
let allparameters:seq[paramOpt]= @[("danger","-d:danger"),("clang","--cc:clang"),("mas","--mm:markandsweep"),("threadsoff","--threads:off")]
var possibleCombos = newSeq[paramOpt]()
proc getpossibleCombos =
  var base:paramOpt
  let allparameterslen = allparameters.len
  var basepoint,afterbase,length:int
  proc appendTuple(b:seq[paramOpt]): paramOpt=
    var a:paramOpt
    for x in b:
      a.name.add(x.name & " ")
      a.cmd.add(x.cmd & " ")
    return a
  while length < allparameterslen-2:
    afterbase = basepoint+length
    while basepoint<allparameterslen-length:
      for thisoption in allparameters[afterbase..<allparameterslen]: possibleCombos.add((base.name & thisoption.name,base.cmd & thisoption.cmd))
      base = appendTuple(allparameters[basepoint..basepoint+length])
      inc afterbase
      inc basepoint
    basepoint = 0
    inc length
    base = appendTuple(allparameters[basepoint..basepoint+length])
    inc basepoint
  var lastitem:paramOpt
  lastitem = appendTuple(allparameters)
  possibleCombos.add lastitem
getpossibleCombos()

proc findmean(output: string): (float,string)=
# proc findmean(output: string): string=
  var lines = output.splitLines(false)
  for line in lines:
    if line.contains("mean"):
      var tmp1 = line.split(":")[1]
      var tmp2 = tmp1.splitWhitespace()
      var tmp3 = tmp2[1..4].join " "
      # return tmp2.join " "
      return (tmp2[0].parseFloat,tmp3)

proc sysexec(command: string): tuple=
  var res = execCmdEx(command)
  if res.exitCode != 0:
    echo 1, " " & command
    echo res.output
    system.quit(1)
  return res

# type result= tuple[time,name: string]
type result= tuple[time:(float,string),name: string]
var benchSlice: seq[result]
var benchdir = "bench" & format(times.now(),"yyyyMMdd-HHmm")
os.createDir(benchdir)
var sourcenim = os.paramStr(1)
var res = sysexec(fmt"nim --hints:off c -r --o:{benchdir}\base.exe {sourcenim}")
echo "base:  ",res.output
# var baseoutput = res.output
res = sysexec(fmt"hyperfine {benchdir}\base.exe")
benchSlice.add((res.output.findmean,"base.exe"))
# var params:string
# var warmup = if (os.paramCount() == 2): os.paramStr(2) else: 10.intToStr
for lang in ["c","cpp"]:
  for param in possibleCombos:
    var targetexe = "\"" & benchdir & "\\" & param.name & ".exe\""
    res = sysexec(&"nim {lang} --hints:off -r {param.cmd} --o:{targetexe} {sourcenim}")
    echo param.cmd,":  ",res.output
    #   system.quit(1)
    res = sysexec(fmt"hyperfine -i --show-output {targetexe}")
    # var res = sysexec(fmt"hyperfine {targetexe} --warmup {warmup}")
    benchSlice.add((res.output.findmean,lang & " " & param.name))
benchSlice.sort(proc (a,b:result):int = cmp(a.time[0],b.time[0]))
let benchmarkfile = benchdir & "\\" & "benchmark.txt"
try: os.removeFile(benchmarkfile)
except: discard
var benchstr = benchSlice.join("\n")
echo benchstr
echo "Benchmarked ", possibleCombos.len * 2 + 1, " programs in ", cpuTime() - starttime, "s"
writeFile(benchmarkfile,benchstr)

Fibonacci 50 fares as follows:


(time: (92.3, "92.3 ms ± 4.2 ms"), name: "c danger clang mas")
(time: (101.0, "101.0 ms ± 7.2 ms"), name: "c danger clang mas threadsoff ")
(time: (105.9, "105.9 ms ± 11.3 ms"), name: "cpp danger clang mas")
(time: (107.9, "107.9 ms ± 7.4 ms"), name: "c clang mas")
(time: (109.6, "109.6 ms ± 9.6 ms"), name: "c clang mas threadsoff")
(time: (118.8, "118.8 ms ± 9.7 ms"), name: "cpp clang mas")
(time: (122.4, "122.4 ms ± 8.2 ms"), name: "cpp clang mas threadsoff")
(time: (123.9, "123.9 ms ± 10.1 ms"), name: "cpp danger clang mas threadsoff ")
(time: (135.9, "135.9 ms ± 9.3 ms"), name: "c danger mas")
(time: (146.4, "146.4 ms ± 5.6 ms"), name: "cpp danger mas")
(time: (172.0, "172.0 ms ± 5.9 ms"), name: "c danger clang threadsoff")
(time: (173.5, "173.5 ms ± 6.9 ms"), name: "c danger threadsoff")
(time: (175.3, "175.3 ms ± 10.4 ms"), name: "c mas")
(time: (176.7, "176.7 ms ± 6.1 ms"), name: "cpp danger threadsoff")
(time: (178.9, "178.9 ms ± 11.4 ms"), name: "cpp danger clang threadsoff")
(time: (179.3, "179.3 ms ± 6.5 ms"), name: "cpp danger clang")
(time: (179.4, "179.4 ms ± 5.2 ms"), name: "c danger clang")
(time: (180.8, "180.8 ms ± 3.0 ms"), name: "c mas threadsoff")
(time: (181.5, "181.5 ms ± 4.6 ms"), name: "cpp danger")
(time: (181.6, "181.6 ms ± 7.4 ms"), name: "c danger")
(time: (184.8, "184.8 ms ± 6.2 ms"), name: "c clang threadsoff")
(time: (188.8, "188.8 ms ± 8.2 ms"), name: "c threadsoff")
(time: (188.9, "188.9 ms ± 10.7 ms"), name: "cpp clang threadsoff")
(time: (189.9, "189.9 ms ± 12.3 ms"), name: "cpp mas")
(time: (191.3, "191.3 ms ± 7.8 ms"), name: "cpp mas threadsoff")
(time: (191.4, "191.4 ms ± 5.6 ms"), name: "c clang")
(time: (194.6, "194.6 ms ± 5.5 ms"), name: "cpp clang")
(time: (209.6, "209.6 ms ± 3.0 ms"), name: "cpp threadsoff")
(time: (327.6, "327.6 ms ± 6.5 ms"), name: "base.exe")

where mas is markandsweep

Mirror of forum.nim-lang.org

10880 :: orc mm slower than markandsweep in my experience