nimforum mirror - Perf: Table.del(key)is taking 85% of my code's time

snej (orginal) [2020-06-17T20:21:25+02:00] view original

I've got some networking code running now (implementation of a protocol based on WebSockets) and I have a little client and server that send each other messages as fast as they can. Now I get to profile it! (With -d:release of course.)

Disappointingly, about 85% of the total time is being spent in a single call to ``del(table,key)``, where the Table's keys are uint64 and the values are a ref object type. Here's a snippet what the macOS sample tool is showing me:


1614 pendingResponse__9bNzW9abL3s854sblP5selTw_2  (in client) + 434  [0x10d342b72]
  1096 del__F9aUzT5FfSHbDaq0vvIxNng  (in client) + 1935,1769,...  [0x10d2e292f,0x10d2e2889,...]
  518 del__F9aUzT5FfSHbDaq0vvIxNng  (in client) + 1419  [0x10d2e272b]
    277 isEmpty__5Qpmu5QqwDY0esVXn6wf7w_2tables  (in client) + 48,23,...  [0x10d2e2cc0,0x10d2e2ca7,...]
    180 isEmpty__5Qpmu5QqwDY0esVXn6wf7w_2tables  (in client) + 57  [0x10d2e2cc9]
      180 nimFrame  (in client) + 1,15,...  [0x10d2e04d1,0x10d2e04df,...]
    61 isEmpty__5Qpmu5QqwDY0esVXn6wf7w_2tables  (in client) + 102  [0x10d2e2cf6]
      61 popFrame  (in client) + 1,14,...  [0x10d2e0651,0x10d2e065e,...]

(Indentation shows the call nesting. The numbers on the left are the number of times the given stack frame was observed; there were 1886 samples total over 2 seconds.)

I haven't dug into the Table source code yet, but it looks like there may be something really unoptimal there.

The name-mangling is weird and very random-looking: are those base64 digests being appended to function names? Is there a tool to unmangle names so I can find out exactly what a specific function is?

PS: From my past experiences with Go, I was expecting GC to be the culprit; there are a few stack frames that look like GC but they're wayyyy down low, only a handful of samples. Nice! Is Nim using ARC by default now?

snej (orginal) [2020-06-17T21:00:58+02:00] view original

Sorry, false alarm! I was misled by the nim command's help text:


  --opt:none|speed|size     optimize not at all or for speed|size
                            Note: use -d:release for a release build!

I took this to mean that ``-d:release`` was *sufficient* to get a fully optimized release build. Nope! Once I started to look at the x86 disassembly it was painfully non-optimal. 🤮

I changed the flags to -d:release --opt:size --stackTrace:off --lineTrace:off --assertions:off --checks:off, and everything got about 20 times faster... Now the code is spending almost all its time in kevent waiting for I/O, as it should be.

Question: Is this the optimal set of flags for a release build? Is there an easy way to toggle between debug and release builds, without having to edit a bunch of lines in my nim.cfg file?

SolitudeSF (orginal) [2020-06-17T21:34:41+02:00] view original

-d:danger -d:lto

Stefan_Salewski (orginal) [2020-06-17T21:43:14+02:00] view original

and everything got about 20 times faster.

We have never seen such an behaviour before, so it would be great if you could investigate it in more detail.

Maybe a strange bug? Did you use gcc10 or clang, or maybe the microsoft or intel compiler as backend?

-d:release is fine, -d:danger can be a bit faster as it turns off all runtime checks. Both is O3 for gcc by default. But specifying --opt:size at the same time may be not that good, do you use -O3 and -Os for gcc at the same time? Most people do not.

Try --passC:-flto to for lto, that give you inlining for all procs.

I enable generally -march=native also. And you may try PGO, there was a post about it recently on the forum.

treeform (orginal) [2020-06-17T21:52:36+02:00] view original

I also experienced really bad Table performance which was related to really bad hash generation. Print out the hash(key) and see if you hashes are usually bad.

juancarlospaco (orginal) [2020-06-17T22:05:57+02:00] view original

--opt:size can me stuff slower, it changes file size for speed.

snej (orginal) [2020-06-17T22:43:40+02:00] view original

I looked at the Compiler User Guide, and it also implies that -d:release is sufficient to turn on compiler optimizations. As do your comments above. Weird...

Here's my environment:

MacBook Pro, macOS 10.15.5

Xcode 11.5; clang --version reports "Apple clang version 11.0.3 (clang-1103.0.32.62)"

I used to have Nim 1.2 installed from HomeBrew; this morning I uninstalled that, installed choosenim, and installed 1.2.2.

I'm building a Nimble package that I created a few days ago, which imports asynchttpserver and the 'news' WebSocket library. I'm using nimble build to compile. My nim.cfg file contains:


--path:src
--outdir:bin
--hints:off

I'll use binary size as a rough indicator of optimization, for now.

Default (debug): binary size is 752K

With -d:release: 701K

With -d:release --opt:speed: 608K

With -d:release --opt:size: 528K

With -d:release --opt:size --stackTrace:off --lineTrace:off --assertions:off --checks:off: 273K

With -d:danger: 752K

With -d:danger --opt:speed: 647K

With -d: danger --opt:size: 570K

With -d: danger --opt:size --stackTrace:off --lineTrace:off --assertions:off --checks:off: 307K

Conclusions: It definitely appears -d:release is not turning on optimizations. And it's weird that -d:danger results in a larger binary; it should be the other way around since it disables generating runtime checks, right?

snej (orginal) [2020-06-17T22:52:25+02:00] view original

But specifying --opt:size at the same time may be not that good, do you use -O3 and -Os for gcc at the same time? Most people do not.

They're mutually exclusive, so the compiler's going to pick one if both are given. It looks as though the -Os wins since I do see a large reduction in code size.

I'm of the belief that -Os is the best setting in most cases. (That was definitely the rule when I worked at Apple in the '00s.) Smaller code is more cache-friendly, launches faster, and contributes less to memory issues. If necessary you can use Clang or GCC function attributes to force individual functions to optimize for speed.

Stefan_Salewski (orginal) [2020-06-17T23:00:52+02:00] view original

use binary size as a rough indicator of optimization

That makes not much sense. For gcc -O3 gives generally large but fast executables. Large because gcc unrolls loops and apply SIMD instructions and much more.

Have you tested -d:danger ? That option should really give you fastest code and turn of all checks. If not then there is something wrong. If -d:danger is fine, then you may try without danger and turn of checks to find out what check makes it slow. Or create a minimal example for table that we can investigate.

timothee (orginal) [2020-06-18T03:46:15+02:00] view original

@snej

what is nim -v ?

(I'm assuming it's 1.2.2 but please double check)

please post minimum reproducible example, otherwise it's hard to do anything about your problem

Yardanico (orginal) [2020-06-18T17:23:20+02:00] view original

Are you putting -d:release in the config file by any chance? If you do - that wouldn't work, you must specify it in the call to the compiler/nimble build

coffeepot (orginal) [2020-06-18T17:49:44+02:00] view original

Just wanted to add to what @treeform mentioned, key distribution matters a lot when deleting with tables. Check out the timings of sequential keys vs a more distributed table access.

import tables, times, random, sugar

template bench(code: untyped) =
  let t1 = cpuTime()
  code
  let t2 = cpuTime()
  echo "Line ", instantiationInfo().line, " took ", t2 - t1

var t: Table[int, int]
const tests = 100_000

bench:
  for i in 0 ..< tests:
    t.add i, 1

bench:
  for i in 0 ..< tests:
    t.del i

let randomKeys = collect(newSeqOfCap(tests)):
  for i in 0 ..< tests:
    rand int.high

bench:
  for i in 0 ..< tests:
    t[randomKeys[i]] = 1

bench:
  for i in 0 ..< tests:
    t.del randomKeys[i]

I get the following output with -d:danger -d:release:

Line 12 took 0.003

Line 16 took 7.209

Line 24 took 0.00199999999999978

Line 28 took 0.003000000000000114

In this example, deleting with sequential keys is 2400 times slower than keys with a more random distribution.

Stefan_Salewski (orginal) [2020-06-18T18:00:04+02:00] view original

key distribution matters a lot

That is true, but it is obviously not the problem for Mr snej. He gets good performance when he disables all checks one by one, but not with -d:danger. So his configuration is somewhat broken.

And note, -d:danger is enough, no need for -d:danger -d:release. We should not confuse people too much. Same for -d:release --opt:speed, -d:release already includes --opt:speed.

snej (orginal) [2020-06-18T19:05:33+02:00] view original

Are you putting -d:release in the config file by any chance? If you do - that wouldn't work, you must specify it in the call to the compiler/nimble build

That's it! I put it in nim.cfg like all the other compile options. If instead I run nimble build -d:release, I get an optimized build.

The help text for nimble doesn't describe the options to build, it just says [opts, ...] [bin]. And I already knew I could put nim c option flags in the config file, so it seemed obvious to put -d:release there.

Help me understand the build process better -- why is it different putting it on the command line vs. in the config file?

Yardanico (orginal) [2020-06-18T20:11:07+02:00] view original

That's not a nimble bug - see https://github.com/nim-lang/Nim/issues/14272

Yardanico (orginal) [2020-06-18T20:15:22+02:00] view original

By "main config" I mean https://github.com/nim-lang/Nim/blob/devel/config/nim.cfg

Mirror of forum.nim-lang.org

6444 :: Perf: Table.del(key)is taking 85% of my code's time