I will preface this by saying I just picked up nim 2.0.2 this week, coming from python and golang. I started experimenting with the nim.cfg setting to find the settings that get the fastest perfomance. This is how it looks now
cc = gcc
passC = "-flto -march=native -O4"
passL = "-flto -s"
d=danger
d=useMalloc
d=lto
--- the rest of the file is untouched ---
# additional options always passed to the compiler:
--parallel_build: "0"
....
I have benchmarked 2 scenarios where using the default orc mm is twice as slow as using markandsweep.
The first script is recursive fibonacci till 50 which looks as follow:
proc fibonacci*(n: int32) : int =
if n <= 1: return n
fibonacci(n-1) + fibonacci(n-2)
discard fibonacci(50)
Benchmarking it with hyperfine using the default mm:
nim\vsgo> nim c --hints:off --o:fibonacciorc.exe fibonacci.nim | hyperfine fibonacciorc.exe
Benchmark 1: fibonacciorc.exe
Time (mean ± σ): 56.496 s ± 1.153 s [User: 56.143 s, System: 0.025 s]
Range (min … max): 53.893 s … 57.685 s 10 runs
Benchmarking it with markandsweep:
nim\vsgo> nim c --hints:off --mm:markandsweep --o:fibonaccimarkandsweep.exe fibonacci.nim | hyperfine fibonaccimarkandsweep.exe
Benchmark 1: fibonaccimarkandsweep.exe
Time (mean ± σ): 15.840 s ± 0.141 s [User: 15.797 s, System: 0.013 s]
Range (min … max): 15.675 s … 16.074 s 10 runs
The second benchmark is this year's advent of code day 2, to read a file, parse the game moves, and calculate some parameters from that input. I modified the input file which was 100 lines long, copied it till it became 20,000 lines
Same nim commands as above, and used hyperfine on both exes:
nim\aoc23\day02> hyperfine day2markandsweep.exe day2orc.exe --warmup 10
Benchmark 1: day2markandsweep.exe
Time (mean ± σ): 134.8 ms ± 5.7 ms [User: 121.6 ms, System: 13.5 ms]
Range (min … max): 125.8 ms … 146.5 ms 21 runs
Benchmark 2: day2orc.exe
Time (mean ± σ): 176.4 ms ± 3.1 ms [User: 167.1 ms, System: 10.9 ms]
Range (min … max): 172.0 ms … 182.3 ms 16 runs
Summary
'day2markandsweep.exe' ran
1.31 ± 0.06 times faster than 'day2orc.exe'
Finally: when to choose which mm, and what options can I add to nim.cfg to further improve performance
Update: Tried benchmarking an older version of advent of code script, which parses the input in one go, then loops through the parsed sequences once for the first solution, then again for the second solution
import strutils,std/strformat,tables
type ball = tuple[number: int, color: string]
type set = seq[ball]
type game = seq[set]
var allgames = newSeq[game]()
proc processInputs(inputLines: seq[string]) =
var gameId,ballnum: int
var sets,balls,thisball: seq[string]
var games,ballcolor: string
for lineI,lineV in inputLines:
# echo inputLines
gameId = lineI + 1
games = lineV.split(": ")[1]
sets = games.split("; ")
var thisgame = newSeq[set]()
for setI,setV in sets.pairs:
# echo "set " & setI.string
var thisset = newSeq[ball]()
balls = setV.split(", ")
for ball in balls:
thisball = ball.split(" ")
ballnum = parseInt(thisball[0])
ballcolor = thisball[1]
thisset.add((ballnum,ballcolor))
thisgame.add(thisset)
allgames.add(thisgame)
# echo allgames
proc Level1(): int=
var sum,gameId: int
const colorLimits = {"red":12,"green":13,"blue":14}.toTable
for gameI,gameV in allgames.pairs:
# echo inputLines
var impossible = false
gameId = gameI + 1
for set in gameV:
for ball in set:
if ball.number > colorLimits[ball.color]:
impossible = true
if impossible == false:
sum += gameId
return sum
proc Level2(): int=
var gameId,sum: int
for gameI,gameV in allgames.pairs:
var minBalls = {"red":0,"green":0,"blue":0}.toTable
gameId = gameI + 1
for set in gameV:
# echo "set " & setI.string
for ball in set:
if ball.number > minBalls[ball.color]:
minBalls[ball.color] = ball.number
var product = 1
for color,value in minBalls:
product *= value
# echo product
sum += product
return sum
proc main() =
let input = readFile("input.txt")
let inputLines = input.splitLines
processInputs(inputLines)
echo fmt"Level1: {Level1()}"
echo fmt"Level2: {Level2()}"
main()
orc is now faster
nim\aoc23\day02> hyperfine day2orc.exe day2markandsweep.exe --warmup 10
Benchmark 1: day2orc.exe
Time (mean ± σ): 245.4 ms ± 3.7 ms [User: 233.1 ms, System: 17.0 ms]
Range (min … max): 240.8 ms … 254.9 ms 11 runs
Benchmark 2: day2markandsweep.exe
Time (mean ± σ): 354.9 ms ± 6.6 ms [User: 333.1 ms, System: 22.3 ms]
Range (min … max): 348.3 ms … 365.5 ms 10 runs
Summary
'day2orc.exe' ran
1.45 ± 0.03 times faster than 'day2markandsweep.exe'
My understanding is that the older version has more unnecessary loops and unnecessary sequences, so it is the less efficient script. Is the garbage collection working more now, and therefore orc is better? Is markandsweep better if less garbage collection is needed?
fibonacci is a terrible benchmark and does not use the GC anyway. The difference that you're seeing for that one might be related to the different exception handling implementations. You can try nim cpp for a "zero overhead" exception handling implementation that works with ORC.
The other benchmark ... I don't know. It's naive code and you're at the mercy of Nim's optimizer which got better after the release of 2.0 but in general is pretty unpredictable. :-)
Update 2: recompiled the older script with clang for both mm, now they are neck and neck
nim\aoc23\day02> hyperfine day2orc.exe day2markandsweep.exe --warmup 10
Benchmark 1: day2orc.exe
Time (mean ± σ): 237.1 ms ± 1.5 ms [User: 220.8 ms, System: 19.0 ms]
Range (min … max): 234.7 ms … 239.7 ms 12 runs
Benchmark 2: day2markandsweep.exe
Time (mean ± σ): 248.5 ms ± 4.0 ms [User: 232.5 ms, System: 12.0 ms]
Range (min … max): 241.8 ms … 253.9 ms 11 runs
Summary
'day2orc.exe' ran
1.05 ± 0.02 times faster than 'day2markandsweep.exe'
Now comes another question, when to use gcc and when to use clang?
Use gcc unless you need clang. If performance is critical, measure every time.
@enaaaab450 - WELCOME!
Also, 3 things:
Thanks! I appreciate the writeup. I was mainly trying to find a one size fits all compile command, but things seem far too complicated than what I thought. I am new to NIM (and golang, and intermediate at python), and it was faster than golang (fibonacci took 58 seconds at go, and I'm not foolish enough to benchmark python). Fibonacci markandsweep was even faster than rust 1.67 (19.5 seconds) that i got greedy trying to find how far I can push performance (without too much of a headache). For now i will keep my nim.cfg (and occasionally mess with clang or markandsweep), it is already sufficiently fast for what I could use it for.
Thanks again!
Update 3: I make a script to automate the benchmarking to some degree, using hyperfine
.nim
import std/[osproc,os],strformat,strutils,times,algorithm
let starttime = cpuTime()
type paramOpt = tuple[name,cmd: string]
let allparameters:seq[paramOpt]= @[("danger","-d:danger"),("clang","--cc:clang"),("mas","--mm:markandsweep"),("threadsoff","--threads:off")]
var possibleCombos = newSeq[paramOpt]()
proc getpossibleCombos =
var base:paramOpt
let allparameterslen = allparameters.len
var basepoint,afterbase,length:int
proc appendTuple(b:seq[paramOpt]): paramOpt=
var a:paramOpt
for x in b:
a.name.add(x.name & " ")
a.cmd.add(x.cmd & " ")
return a
while length < allparameterslen-2:
afterbase = basepoint+length
while basepoint<allparameterslen-length:
for thisoption in allparameters[afterbase..<allparameterslen]: possibleCombos.add((base.name & thisoption.name,base.cmd & thisoption.cmd))
base = appendTuple(allparameters[basepoint..basepoint+length])
inc afterbase
inc basepoint
basepoint = 0
inc length
base = appendTuple(allparameters[basepoint..basepoint+length])
inc basepoint
var lastitem:paramOpt
lastitem = appendTuple(allparameters)
possibleCombos.add lastitem
getpossibleCombos()
proc findmean(output: string): (float,string)=
# proc findmean(output: string): string=
var lines = output.splitLines(false)
for line in lines:
if line.contains("mean"):
var tmp1 = line.split(":")[1]
var tmp2 = tmp1.splitWhitespace()
var tmp3 = tmp2[1..4].join " "
# return tmp2.join " "
return (tmp2[0].parseFloat,tmp3)
proc sysexec(command: string): tuple=
var res = execCmdEx(command)
if res.exitCode != 0:
echo 1, " " & command
echo res.output
system.quit(1)
return res
# type result= tuple[time,name: string]
type result= tuple[time:(float,string),name: string]
var benchSlice: seq[result]
var benchdir = "bench" & format(times.now(),"yyyyMMdd-HHmm")
os.createDir(benchdir)
var sourcenim = os.paramStr(1)
var res = sysexec(fmt"nim --hints:off c -r --o:{benchdir}\base.exe {sourcenim}")
echo "base: ",res.output
# var baseoutput = res.output
res = sysexec(fmt"hyperfine {benchdir}\base.exe")
benchSlice.add((res.output.findmean,"base.exe"))
# var params:string
# var warmup = if (os.paramCount() == 2): os.paramStr(2) else: 10.intToStr
for lang in ["c","cpp"]:
for param in possibleCombos:
var targetexe = "\"" & benchdir & "\\" & param.name & ".exe\""
res = sysexec(&"nim {lang} --hints:off -r {param.cmd} --o:{targetexe} {sourcenim}")
echo param.cmd,": ",res.output
# system.quit(1)
res = sysexec(fmt"hyperfine -i --show-output {targetexe}")
# var res = sysexec(fmt"hyperfine {targetexe} --warmup {warmup}")
benchSlice.add((res.output.findmean,lang & " " & param.name))
benchSlice.sort(proc (a,b:result):int = cmp(a.time[0],b.time[0]))
let benchmarkfile = benchdir & "\\" & "benchmark.txt"
try: os.removeFile(benchmarkfile)
except: discard
var benchstr = benchSlice.join("\n")
echo benchstr
echo "Benchmarked ", possibleCombos.len * 2 + 1, " programs in ", cpuTime() - starttime, "s"
writeFile(benchmarkfile,benchstr)
Fibonacci 50 fares as follows:
(time: (92.3, "92.3 ms ± 4.2 ms"), name: "c danger clang mas")
(time: (101.0, "101.0 ms ± 7.2 ms"), name: "c danger clang mas threadsoff ")
(time: (105.9, "105.9 ms ± 11.3 ms"), name: "cpp danger clang mas")
(time: (107.9, "107.9 ms ± 7.4 ms"), name: "c clang mas")
(time: (109.6, "109.6 ms ± 9.6 ms"), name: "c clang mas threadsoff")
(time: (118.8, "118.8 ms ± 9.7 ms"), name: "cpp clang mas")
(time: (122.4, "122.4 ms ± 8.2 ms"), name: "cpp clang mas threadsoff")
(time: (123.9, "123.9 ms ± 10.1 ms"), name: "cpp danger clang mas threadsoff ")
(time: (135.9, "135.9 ms ± 9.3 ms"), name: "c danger mas")
(time: (146.4, "146.4 ms ± 5.6 ms"), name: "cpp danger mas")
(time: (172.0, "172.0 ms ± 5.9 ms"), name: "c danger clang threadsoff")
(time: (173.5, "173.5 ms ± 6.9 ms"), name: "c danger threadsoff")
(time: (175.3, "175.3 ms ± 10.4 ms"), name: "c mas")
(time: (176.7, "176.7 ms ± 6.1 ms"), name: "cpp danger threadsoff")
(time: (178.9, "178.9 ms ± 11.4 ms"), name: "cpp danger clang threadsoff")
(time: (179.3, "179.3 ms ± 6.5 ms"), name: "cpp danger clang")
(time: (179.4, "179.4 ms ± 5.2 ms"), name: "c danger clang")
(time: (180.8, "180.8 ms ± 3.0 ms"), name: "c mas threadsoff")
(time: (181.5, "181.5 ms ± 4.6 ms"), name: "cpp danger")
(time: (181.6, "181.6 ms ± 7.4 ms"), name: "c danger")
(time: (184.8, "184.8 ms ± 6.2 ms"), name: "c clang threadsoff")
(time: (188.8, "188.8 ms ± 8.2 ms"), name: "c threadsoff")
(time: (188.9, "188.9 ms ± 10.7 ms"), name: "cpp clang threadsoff")
(time: (189.9, "189.9 ms ± 12.3 ms"), name: "cpp mas")
(time: (191.3, "191.3 ms ± 7.8 ms"), name: "cpp mas threadsoff")
(time: (191.4, "191.4 ms ± 5.6 ms"), name: "c clang")
(time: (194.6, "194.6 ms ± 5.5 ms"), name: "cpp clang")
(time: (209.6, "209.6 ms ± 3.0 ms"), name: "cpp threadsoff")
(time: (327.6, "327.6 ms ± 6.5 ms"), name: "base.exe")
where mas is markandsweep