Awesome! Some ideas:
txs
Why the emits instead of Nim's asm blocks?
I used emits, because I couldn't get straight asm blocks to work, although I started on 10.2 and now on 10.3 and its simpler than when I first tried, so I'll try again.
Measure the measuring overhead and subtract it from the results.
Good idea
Maybe you could also use suite and test as names, then you can use benchmark.suite and unittest.suite to differentiate.
Hmm, I didn't want to have to use a full name when I'm using unittest to test benchmark, but if you think that would be a better choice it's fine with me.
My system doesn't seem to have rdtscp yet (only rdtsc). Could be detected with cpuid I guess.
Yes, I'll work on detecting that, there is an alternative that Intel suggests when rdtscp isn't available.
Maybe some more platform independent way than rdtsc for other platforms, maybe clock_gettime for posix: https://github.com/def-/nim-bench/blob/master/tests.nim#L21-L24
Definitely, but to get to this stage took me a long time since I started at nearly zero and wanted to get feed back, such as this. Also, I wanted to be able to consistently measure the performance difference of changing a single instruction when testing very small snippets of code and that probably needs something like rdtsc. But yes this needs to be able to run on all platforms, although YMMV :)
Hmm, I didn't want to have to use a full name when I'm using unittest to test benchmark, but if you think that would be a better choice it's fine with me.
I think module namespaces should be preferred to such prefixes. You can alias module names:
import unittest as ut
ut.suite "foo":
echo "foo"
I guess you've seen the ideas on the GSoC page: https://github.com/Araq/Nim/wiki/GSoC-2015-Ideas#benchmarking-library
I've created a change to subtract the overhead (https://github.com/winksaville/nim-benchmark/tree/Subtract-overhead) and I'd like people to review it, is anyone interested?
Is there a way to add reviewers, for the life of me I don't see how?
Finally, is it considered rude to add a reviewer to a change without first asking?
@def: Yes, I'd seen the GSoC proposal, hopefully what I've done will help.
@jester: I'm not sure what you mean by your post. Are you suggesting a mechanism to record the best measurement for later comparison to changes in the future?
What does "Stop forcing other people {to} validate your dreams for you."?
@winksaville: really enjoying reading your code, learning a lot (eg. your ref to http://goo.gl/tzKu65 ).
I could not hit
https://github.com/winksaville/nim-benchmark/tree/Subtract-overhead
Did you push the branch? Is it public? I can't review your code, it's over my head, just following along.
@Renesac: Actually, in modern Intel processors TSC runs at a constant frequency although you are correct if a thread migrates to a new "core" the TSC can be incorrect. But the RDTSCP instruction also returns the "logical" core id and so this can be detected. So are far as I can tell using TSC should be the preferred method when its available. See this, http://goo.gl/tzKu65, paper from Intel which is where this information comes from and I've tried to follow the suggestions as best as possible.
@def: I've just updated the code to detect if RDTSCP is supported, please see if it works for you.
I've just updated the code to detect if RDTSCP is supported, please see if it works for you.
Works fine, all tests succeed.
But the benchmark results seem to flactuate quite a lot. Not sure how to solve this. That's also where I stopped investigating when I tried to get some micro-benchmarking code running. Maybe criterion's source code has some insights.
If it runs at a constant frequency then it isn't measuring actual execution cycles, but time... Well, that is what I meant when I said it is difficult to get what it really means. I bet the equivalent to RDTSCP in ARM or MIPS will behave differently.
@def: what if you put the code into a long loop?
repax: gettimeofday() wraps at midnight.
No. "The gettimeofday() function shall obtain the current time, expressed as seconds and microseconds since the Epoch, and store it in the timeval structure pointed to by tp."
There is no wrapping involved.
@def: what if you put the code into a long loop?
Then the compiler is smart enough to optimize it away. I'm just trying the bminc.nim example.
@def and anyone else interested in testing. I added a more debug and the capability to control it see the latest commit at https://github.com/winksaville/nim-benchmark.
I found that rdtscp was returning negative values quite often and for longs periods of time, but would generally fix itself. But I've changed the routine getEndCycles to use rdtsc instead of rdtscp but its slightly different from getEndCyclesNoRdtscp as it uses rdTscAux(), which is rdtscp.
Anyway, @def I'd like to see the output of bminc2 from your computer where which doesn't have the rdtscp to see if I can understand why you might be getting inconsistent results. Or maybe you've got your own code your benchmarking for which you can enable verbose and DEBUG. The file will likely be huge, you can upload it to my Google Drive at <http://goo.gl/2dPJt9> if you wish.
@def, thanks for the data. It didn't have "everything" because you didn't uncomment the line 3 of exmpl/bminc2.nim:
bmDefaultVerbosity = Verbosity.verbose
But actually it looks like you got a good run to me ignoring the "rdtsc:-".
Looking at the first line we see you TSC cyclesPerSec is 2.5GH (i.e. the speed of your cpu), verbosity is "normal" (i.e. you didn't uncomment line 3),overhead=862cy (this is large because of the debug output in rdtsc) and hasRDTSCO=false.
Looking at the 10 buckets for the "increment.inc" we see the min value vary between 8 and 30 cycles and the mean from 34 to 802 cycles, that's pretty good really and to be expected there is just a lot of variability. The question is if we wanted to choose one number as to the performance of inc what would that be?
The answer I'd give is probably 23cy, I wouldn't say 8 or 16 because they only occurred 4 times. Where as 23cy is the fastest nearly 10% of the time its the fastest. Of course you could also choose one of the "mean" value from bms[4] or bms[5]. I would ignore the upper data especially bms[9] which has a mean of 802ns and thus represents time when the routine was significantly interrupted.
$ grep -v "rdtsc:-" def-data/x.txt
bmso={suiteName=increment testName=nil fullName=nil cyclesPerSec=2499540692.000s cyclesToSecThreshold=1000.000s verbosity=normal overhead=862cy hasRDTSCP=false}
increment.inc: bms[0]={min=8cy mean=34cy minC=1 n=39699}
increment.inc: bms[1]={min=16cy mean=37cy minC=3 n=39699}
increment.inc: bms[2]={min=23cy mean=39cy minC=1579 n=39699}
increment.inc: bms[3]={min=23cy mean=41cy minC=815 n=39699}
increment.inc: bms[4]={min=23cy mean=44cy minC=353 n=39699}
increment.inc: bms[5]={min=23cy mean=47cy minC=122 n=39699}
increment.inc: bms[6]={min=23cy mean=50cy minC=34 n=39699}
increment.inc: bms[7]={min=23cy mean=56cy minC=6 n=39699}
increment.inc: bms[8]={min=30cy mean=95cy minC=26 n=39699}
increment.inc: bms[9]={min=30cy mean=802.039ns minC=2 n=39699}
increment.atomicInc: bms[0]={min=30cy mean=50cy minC=18 n=39264}
increment.atomicInc: bms[1]={min=38cy mean=54cy minC=1606 n=39264}
increment.atomicInc: bms[2]={min=38cy mean=56cy minC=640 n=39264}
increment.atomicInc: bms[3]={min=38cy mean=58cy minC=220 n=39264}
increment.atomicInc: bms[4]={min=38cy mean=60cy minC=43 n=39264}
increment.atomicInc: bms[5]={min=38cy mean=63cy minC=4 n=39264}
increment.atomicInc: bms[6]={min=45cy mean=67cy minC=124 n=39264}
increment.atomicInc: bms[7]={min=45cy mean=73cy minC=21 n=39264}
increment.atomicInc: bms[8]={min=45cy mean=126cy minC=1 n=39264}
increment.atomicInc: bms[9]={min=46cy mean=873.708ns minC=2 n=39264}
But in actual fact I'm guessing if you run with DEBUG false and Verbosity.normal I think you'll see a smaller min, maybe 0cy as I see on my mac book pro. So an increment in a tight loop with a few other instructions costs "nothing" where as atomicInc costs 14cy.
$ exmpl/bminc2
bmso={suiteName=increment testName=nil fullName=nil cyclesPerSec=2594770888.000s cyclesToSecThreshold=1000.000s verbosity=normal overhead=12cy hasRDTSCP=false}
increment.inc: bms[0]={min=0cy mean=0cy minC=406838 n=407705}
increment.inc: bms[1]={min=0cy mean=0cy minC=406815 n=407705}
increment.inc: bms[2]={min=0cy mean=0cy minC=406624 n=407705}
increment.inc: bms[3]={min=0cy mean=0cy minC=405335 n=407705}
increment.inc: bms[4]={min=0cy mean=0cy minC=404328 n=407705}
increment.inc: bms[5]={min=0cy mean=1cy minC=363867 n=407705}
increment.inc: bms[6]={min=0cy mean=7cy minC=193280 n=407705}
increment.inc: bms[7]={min=0cy mean=10cy minC=129561 n=407705}
increment.inc: bms[8]={min=0cy mean=12cy minC=60184 n=407705}
increment.inc: bms[9]={min=0cy mean=15cy minC=6532 n=407705}
increment.atomicInc: bms[0]={min=14cy mean=15cy minC=174848 n=354413}
increment.atomicInc: bms[1]={min=14cy mean=16cy minC=80212 n=354413}
increment.atomicInc: bms[2]={min=14cy mean=20cy minC=28754 n=354413}
increment.atomicInc: bms[3]={min=14cy mean=27cy minC=5978 n=354413}
increment.atomicInc: bms[4]={min=14cy mean=28cy minC=1862 n=354413}
increment.atomicInc: bms[5]={min=14cy mean=29cy minC=524 n=354413}
increment.atomicInc: bms[6]={min=14cy mean=30cy minC=47 n=354413}
increment.atomicInc: bms[7]={min=16cy mean=31cy minC=836 n=354413}
increment.atomicInc: bms[8]={min=28cy mean=31cy minC=15102 n=354413}
increment.atomicInc: bms[9]={min=28cy mean=33cy minC=836 n=354413}
On the other hand a different interpretation might be used for longer running tests such the nim compiler in exmpl/bmnimc below are two back to back runs of exmpl/bmnimc I modified the file to have more buckets, 9 total and run for 30 seconds:
import benchmark, os
suite "test compile", 1.0:
var
bmsArray : array[0..8, BmStats]
test "bminc.nim", 30.0, bmsArray:
discard execShellCmd("nim c -d:release --verbosity:0 --hints:off --warnings:off --threads:on exmpl/bminc.nim")
And compile with as follows:
$ nim c -d:release --threads:on exmpl/bmnimc.nim
The back to back runs are below, I chose 9 runs because I think the best interpretation is to use the "mean" of the median run, i.e. bms[4]. So I'd suggest reporting something like 235ms +-X%. I'm not sure what X should be but maybe 2-3%.
Hopefully a stats wiz can propose something better for interpreting the data and together maybe we can refine what's here.
In any case, I want to work on some wait free algorithms and benchmark should provide some useful performance numbers. If not, oh well, I tried and have learned quite a bit about benchmarking and nim :)
wink-macbookpro2:benchmark wink$ exmpl/bmnimc
test compile.bminc.nim: bms[0]={min=217.661ms mean=223.921ms minC=1 n=14}
test compile.bminc.nim: bms[1]={min=224.468ms mean=229.366ms minC=1 n=14}
test compile.bminc.nim: bms[2]={min=226.128ms mean=232.194ms minC=1 n=14}
test compile.bminc.nim: bms[3]={min=228.346ms mean=234.254ms minC=1 n=14}
test compile.bminc.nim: bms[4]={min=230.603ms mean=236.416ms minC=1 n=14}
test compile.bminc.nim: bms[5]={min=230.825ms mean=238.047ms minC=1 n=14}
test compile.bminc.nim: bms[6]={min=233.388ms mean=240.638ms minC=1 n=14}
test compile.bminc.nim: bms[7]={min=238.707ms mean=243.930ms minC=1 n=14}
test compile.bminc.nim: bms[8]={min=242.305ms mean=268.945ms minC=1 n=14}
wink-macbookpro2:benchmark wink$ exmpl/bmnimc
test compile.bminc.nim: bms[0]={min=215.127ms mean=224.721ms minC=1 n=15}
test compile.bminc.nim: bms[1]={min=220.704ms mean=228.352ms minC=1 n=15}
test compile.bminc.nim: bms[2]={min=225.508ms mean=231.041ms minC=1 n=15}
test compile.bminc.nim: bms[3]={min=228.669ms mean=233.378ms minC=1 n=15}
test compile.bminc.nim: bms[4]={min=230.660ms mean=234.908ms minC=1 n=15}
test compile.bminc.nim: bms[5]={min=231.265ms mean=237.694ms minC=1 n=15}
test compile.bminc.nim: bms[6]={min=233.844ms mean=239.389ms minC=1 n=15}
test compile.bminc.nim: bms[7]={min=236.037ms mean=242.360ms minC=1 n=15}
test compile.bminc.nim: bms[8]={min=241.683ms mean=247.321ms minC=1 n=15}