I've got some networking code running now (implementation of a protocol based on WebSockets) and I have a little client and server that send each other messages as fast as they can. Now I get to profile it! (With -d:release of course.)
Disappointingly, about 85% of the total time is being spent in a single call to ``del(table,key)``, where the Table's keys are uint64 and the values are a ref object type. Here's a snippet what the macOS sample tool is showing me:
1614 pendingResponse__9bNzW9abL3s854sblP5selTw_2 (in client) + 434 [0x10d342b72]
1096 del__F9aUzT5FfSHbDaq0vvIxNng (in client) + 1935,1769,... [0x10d2e292f,0x10d2e2889,...]
518 del__F9aUzT5FfSHbDaq0vvIxNng (in client) + 1419 [0x10d2e272b]
277 isEmpty__5Qpmu5QqwDY0esVXn6wf7w_2tables (in client) + 48,23,... [0x10d2e2cc0,0x10d2e2ca7,...]
180 isEmpty__5Qpmu5QqwDY0esVXn6wf7w_2tables (in client) + 57 [0x10d2e2cc9]
180 nimFrame (in client) + 1,15,... [0x10d2e04d1,0x10d2e04df,...]
61 isEmpty__5Qpmu5QqwDY0esVXn6wf7w_2tables (in client) + 102 [0x10d2e2cf6]
61 popFrame (in client) + 1,14,... [0x10d2e0651,0x10d2e065e,...]
(Indentation shows the call nesting. The numbers on the left are the number of times the given stack frame was observed; there were 1886 samples total over 2 seconds.)
I haven't dug into the Table source code yet, but it looks like there may be something really unoptimal there.
The name-mangling is weird and very random-looking: are those base64 digests being appended to function names? Is there a tool to unmangle names so I can find out exactly what a specific function is?
PS: From my past experiences with Go, I was expecting GC to be the culprit; there are a few stack frames that look like GC but they're wayyyy down low, only a handful of samples. Nice! Is Nim using ARC by default now?
Sorry, false alarm! I was misled by the nim command's help text:
--opt:none|speed|size optimize not at all or for speed|size
Note: use -d:release for a release build!
I took this to mean that ``-d:release`` was *sufficient* to get a fully optimized release build. Nope! Once I started to look at the x86 disassembly it was painfully non-optimal. 🤮
I changed the flags to -d:release --opt:size --stackTrace:off --lineTrace:off --assertions:off --checks:off, and everything got about 20 times faster... Now the code is spending almost all its time in kevent waiting for I/O, as it should be.
Question: Is this the optimal set of flags for a release build? Is there an easy way to toggle between debug and release builds, without having to edit a bunch of lines in my nim.cfg file?
and everything got about 20 times faster.
We have never seen such an behaviour before, so it would be great if you could investigate it in more detail.
Maybe a strange bug? Did you use gcc10 or clang, or maybe the microsoft or intel compiler as backend?
Try --passC:-flto to for lto, that give you inlining for all procs.
I enable generally -march=native also. And you may try PGO, there was a post about it recently on the forum.
I looked at the Compiler User Guide, and it also implies that -d:release is sufficient to turn on compiler optimizations. As do your comments above. Weird...
Here's my environment:
I'm building a Nimble package that I created a few days ago, which imports asynchttpserver and the 'news' WebSocket library. I'm using nimble build to compile. My nim.cfg file contains:
--path:src
--outdir:bin
--hints:off
I'll use binary size as a rough indicator of optimization, for now.
Conclusions: It definitely appears -d:release is not turning on optimizations. And it's weird that -d:danger results in a larger binary; it should be the other way around since it disables generating runtime checks, right?
But specifying --opt:size at the same time may be not that good, do you use -O3 and -Os for gcc at the same time? Most people do not.
They're mutually exclusive, so the compiler's going to pick one if both are given. It looks as though the -Os wins since I do see a large reduction in code size.
I'm of the belief that -Os is the best setting in most cases. (That was definitely the rule when I worked at Apple in the '00s.) Smaller code is more cache-friendly, launches faster, and contributes less to memory issues. If necessary you can use Clang or GCC function attributes to force individual functions to optimize for speed.
use binary size as a rough indicator of optimization
That makes not much sense. For gcc -O3 gives generally large but fast executables. Large because gcc unrolls loops and apply SIMD instructions and much more.
Have you tested -d:danger ? That option should really give you fastest code and turn of all checks. If not then there is something wrong. If -d:danger is fine, then you may try without danger and turn of checks to find out what check makes it slow. Or create a minimal example for table that we can investigate.
(I'm assuming it's 1.2.2 but please double check)
Just wanted to add to what @treeform mentioned, key distribution matters a lot when deleting with tables. Check out the timings of sequential keys vs a more distributed table access.
import tables, times, random, sugar
template bench(code: untyped) =
let t1 = cpuTime()
code
let t2 = cpuTime()
echo "Line ", instantiationInfo().line, " took ", t2 - t1
var t: Table[int, int]
const tests = 100_000
bench:
for i in 0 ..< tests:
t.add i, 1
bench:
for i in 0 ..< tests:
t.del i
let randomKeys = collect(newSeqOfCap(tests)):
for i in 0 ..< tests:
rand int.high
bench:
for i in 0 ..< tests:
t[randomKeys[i]] = 1
bench:
for i in 0 ..< tests:
t.del randomKeys[i]
I get the following output with -d:danger -d:release:
In this example, deleting with sequential keys is 2400 times slower than keys with a more random distribution.
key distribution matters a lot
That is true, but it is obviously not the problem for Mr snej. He gets good performance when he disables all checks one by one, but not with -d:danger. So his configuration is somewhat broken.
And note, -d:danger is enough, no need for -d:danger -d:release. We should not confuse people too much. Same for -d:release --opt:speed, -d:release already includes --opt:speed.
Are you putting -d:release in the config file by any chance? If you do - that wouldn't work, you must specify it in the call to the compiler/nimble build
That's it! I put it in nim.cfg like all the other compile options. If instead I run nimble build -d:release, I get an optimized build.
The help text for nimble doesn't describe the options to build, it just says [opts, ...] [bin]. And I already knew I could put nim c option flags in the config file, so it seemed obvious to put -d:release there.
Help me understand the build process better -- why is it different putting it on the command line vs. in the config file?