So, this comment by Araq was recently brought to my attention over on disruptek's matrix channel.
In light of discussion there lionizing the duration of compiles of zig cc, I did some tests that might be of broader interest than just that channel's membership (which also seems more about random topics/"shallow analysis & quips" for material like the below -- maybe that's chat rooms in general). Yes, maybe this should be an RFC, but I feel like new users might search for "slow compile times" on the forum before elsewhere -- too many threads to mention -- or zig cc as here.
These tests use a simple hrtm.nim timing program (and are on Linux 5.15 on an i7-6700k):
import times, os, posix # dumb little hi-res time
let clp = commandLineParams()
let cmd = allocCStringArray(clp)
if (let pid = vfork(); pid != 0):
var status: cint
let t0 = epochTime()
discard waitpid(pid, status, 0.cint)
stderr.write epochTime() - t0, "\n"
else:
discard execvp(clp[0], cmd); quit(1)
Here is a very simple C program in bench.c:
#include <stdio.h> /* transitive closure = 27 files */
int main(int ac, char **av) { return 0; }
Here are some of my times:
/tmp/zig$ hrtm gcc -O0 bench.c # gcc-11.2
0.020973682403564453
/tmp/zig$ hrtm clang -O0 bench.c # clang-12.0.1
0.04278683662414551
/tmp/zig$ hrtm zig cc bench.c # zig-0.8.1
1.0602991580963135
/tmp/zig$ hrtm zig cc bench.c # hot cache
0.015044212341308594
/tmp/zig$ hrtm tcc bench.c # mob branch; gcc PGO compiled
0.0020494461059570312
So, tcc is about 7.5X faster than zig cc with zig's compile cache hot and over 500X faster with a cold zig cache. In terms of "total work", it's a bit worse for zig cc since it tries to go multicore (just via multiple processes). So, wall clock time gives only a lower bound, but it does not achieve much utilization (like 132% on bench.c or ~700X the work of tcc on that).
The natural follow-on question is "What about 'in context' as a Nim C compiler backend?" Well, the nim c program does a lot more work which shrinks ratios like 500X or 7.5X way down. Ultimately, it depends on the number of C files generated and such things. It's probably also harder to reproduce. So, I will just mention some stylized observations.
First, for would be experimenters, I did not need disruptek's "tools" (shim or cfg changes) on Unix with mainline Nim. A script zigcc in your $PATH containing (and with chmod +x):
#!/bin/sh
exec zig cc "$@" # above statically linked dash adds ~0.1ms overhead
and CC=zigcc hrtm nim c --cc:env foo.nim seemed to work fine.
Second, I could not get the cached mode of zigcc to compile within-errors faster than tcc in various experiments. Cached zigcc (with nim->C also cached) mostly matched performance. So, zig cc is not a performance disaster - IF you never clear the cache, that cache stays hot in the OS buffer cache, and/or is not over NFS like $HOME sometimes is, etc.
Meanwhile, the uncached mode was unsurprisingly much slower (than even gcc -O0 as suggested by the bench.c times). For example, with suggest.nim, uncached zigcc was 14.1 sec vs 1.19 sec for tcc -- 12X slower and a real human-noticeable difference.
In short, a zig cc backend seems to really magnify "risk of slowness" at no obvious speed-up over tcc, and tcc on its own is fast even with no cache at all.
It should also not be much work to keep supporting tcc, IMO. To resolve all real open issues on the Nim issue pages that match a "tcc" query, all I need do to is add --passL:-lm. I am happy to help as long as I am able.
This is also not just about me. I have had several cligen users over the years complain about compile-speeds only to be happily satisfied by switching to tcc for an edit-compile-debug cycles. So, I think it would be great to keep supporting tcc. Anyway, thanks for listening.
First, I really agree with the 'manage expectations' part. That also applies to the quality of tcc code generation which is..not great.
I was also wondering because of the almost 70x more work than gcc if I was missing some (non-default) zig cc --work-faster flag or if somehow my build of zig is broken. Maybe some less newbie-than-me Zigger-Nimmer might take a couple minutes to reproduce the C only test? If I did mess up my zig cc, this risk also supports shying away from it as a backend (especially a default for newbies if UX matters). zigcc -O0 -c bench.c was only 56ms uncached. So, the extra time seems mostly in the linking, but it is still markedly slower than gcc -O0.
Re justifying: you are right, though I think I may personally be the one mentioning tcc the most. So, there might be a diversity of interest issue I thought I should back up. :-)
Re: triggerhappy - Also, I only recently was told/reminded that @disruptek is banned here when I thought it was only chat. IMO, non-1on1 "unstructured" chat formats lead directly to conversational chaos. (BTW, google "scare quotes" per my highly qualified "shallow" above; I typically use them to indicate 'extra interpretational difficulty', but offense was taken. I did not mean it to apply to every writer/subthread in every context.)
I'm probably not the only one that finds it hard to keep track of who is banned where. Someone should really create a website with a table of long-time participants & ban venues (maybe with ban reasons). Lol. Or maybe unban him/them here/there after a time out/cooling off period?
I couldn't get zig to magically speed up my compile times, either; I've been using zapcc for some time now, which is noticeably quicker than gcc. I tried it on your suggest.nim and got:
gcc: 4.5s
zapcc: 3.4s
tcc: 1.7s
of that, the Nim compilation is like 1.7s. tcc is just stupidly fast.
on bench.c i get:
gcc: 0.03
tcc: 0.004
zapcc(cold): 0.2
zapcc(hot): 0.04
so, not a replacement for what tcc does, but i can recommend it for when you want a 'real' compiler.
Isn't the main attraction for zig is the promise of easy cross-compilation, not the performance?
For me tcc gives a significant speedup in total compilation time ~ x5-x7, so it's nice to have as an option. On Windows, though, the error from #16326 really breaks the fun. Not Nim's problem, I suppose.
@cblake Thank you for the reminder of tcc. I found that tcc works noticeably faster for small tasks, like AoC because you do a lot of runs - and the tcc makes it fills like you use interpreter, which is good for the kind of tasks.
Unfortunately tcc does not work ideal and I have error like tcc: error: undefined symbol '_ftelli64' in some tasks
@cblake Thank you.
Unfortunately I found that there were decision to drop tcc support in other issues. I understand if it creates more problems
"nim c -r test.nim":
clang.exe -c -w -ferror-limit=3 -O3 -IC:\Users\User\.choosenim\toolchains\nim-1.6.0\lib -ID:\Amjad\AoC2021 -o C:\Users\User\nimcache\test_r\@mtest.nim.c.o C:\Users\User\nimcache\test_r\@mtest.nim.c
nim c -r -d:danger test.nim:
clang.exe -c -w -ferror-limit=3 -IC:\Users\User\.choosenim\toolchains\nim-1.6.0\lib -ID:\Amjad\AoC2021 -o C:\Users\User\nimcache\test_d\@mtest.nim.c.o C:\Users\User\nimcache\test_d\@mtest.nim.c
the difference is the -O3 in release build: yet clang, on Windows in the debug build creates: test.ilk (3272 kB) and test.pdb (6236 KB) How do I disable this ?Thanks a lot! Just to ironically support Araq's view that the only users of TCC are writing "hello-world-like code", here's the comparison of the running speed of my AOC solutions up to this day. Compiled on Windows with TCC and GCC with -d:danger and flto, times include process start-up and loading the inputs.
# TCC: x̄ σ Δ
aoc01.exe: 5.7 ms ± 0.7 ms +2,60
aoc02.exe: 20.7 ms ± 1.1 ms +15,30
aoc03.exe: 3.6 ms ± 0.7 ms +0,80
aoc04.exe: 8.5 ms ± 0.8 ms +4,20
aoc05.exe: 88.2 ms ± 2.9 ms +67,30
aoc06.exe: 2.6 ms ± 0.7 ms +0,20
aoc07.exe: 88.9 ms ± 3.0 ms +84,70
aoc08.exe: 7.7 ms ± 0.7 ms +4,40
aoc09.exe: 27.8 ms ± 1.6 ms +21,80
aoc10.exe: 3.1 ms ± 0.4 ms +0,40
aoc11.exe: 15.3 ms ± 1.0 ms +11,40
aoc12.exe: 721.7 ms ± 48.8 ms +619,80
aoc13.exe: 10.6 ms ± 0.9 ms +7,00
Total: 1004.4 ms
# GCC: x̄ σ Δ
aoc01.exe: 3.1 ms ± 0.9 ms -2,60
aoc02.exe: 5.4 ms ± 0.8 ms -15,30
aoc03.exe: 2.8 ms ± 0.5 ms -0,80
aoc04.exe: 4.3 ms ± 1.1 ms -4,20
aoc05.exe: 20.9 ms ± 1.0 ms -67,30
aoc06.exe: 2.4 ms ± 0.5 ms -0,20
aoc07.exe: 4.2 ms ± 0.7 ms -84,70
aoc08.exe: 3.3 ms ± 0.8 ms -4,40
aoc09.exe: 6.0 ms ± 0.6 ms -21,80
aoc10.exe: 2.7 ms ± 0.7 ms -0,40
aoc11.exe: 3.9 ms ± 0.5 ms -11,40
aoc12.exe: 101.9 ms ± 2.9 ms -619,80
aoc13.exe: 3.6 ms ± 0.6 ms -7,00
Total: 164.5 ms
3. Abandon-ware too big price for reduced compile time, as for me.. if you like windows programming, fast compilation time try to use Pelles C as backend - http://www.smorgasbordet.com/pellesc/ - At least it releases regularly...
Corrections: the "mob" branch, from the tcc Wikipedia entry has x64 support and is not abandonware but - simply "community maintained" which seems to be a workable open source model:
git clone https://repo.or.cz/tinycc.git/
I usually use it on Linux, but I have used it on Windows years ago. As to modern intrinsics - I am not sure what you mean, but you can always do what you need via inline assembly. You can probably even steal the output from gcc -S or clang -S.
I have never tried Pelles C, but it seems Windows-only. Being based on the Fraser-Hansen lcc suggests from my past experience that it compiles no faster than gcc -O0, though. { I'm sure someone somewhere maintains some Unix fork, but I could not get a copy I have from around y2k working on modern Linux to report comparable numbers on the same machine as the initial post. Sorry. }
With this nim program generator:
#!/bin/bash
for i in {0..99}; do echo "proc foo$i(x=2)=discard"; done
echo "when isMainModule:"
echo " import cligen"
echo -n " dispatchMulti"
for i in {0..98}; do echo " [ foo$i ],"; done
echo " [ foo99 ]"
(This is not even an artificial example - it is very close to a real example I once got a complaint about as a cligen issue, from @cdunn2001. 100 may be a slight exaggeration, but I' bet 20-50 are not so rare) and this driver script:
#!/bin/bash
for how in tcc gcc; do
./gen.bash > j.nim
hrtm nim c --cc:$how j.nim # Same `hrtm` as in opener
sed -i 's/99/100/g' j.nim # tweak one byte
hrtm nim c --cc:$how j.nim
done
and doing a sed -i s/-Og/-O0/ ~/.config/nim/nim.cfg (and similarly for -O1) at one point and using the same driver script, I get these results:
compiler | 1st Run(s) | 2nd Run(s | ) 1st | Ratio 2nd Ratio |
---|---|---|---|---|
tcc | 2.244 | 2.126 | 1.00 | 1.00 |
gccO0 | 7.883 | 7.415 | 3.51 | 3.49 |
gccOg | 33.026 | 31.714 | 14.72 | 14.92 |
gccO1 | 39.612 | 38.048 | 17.65 | 17.90 |
Preliminary conclusion: The Nim compiler is pretty fast. I.e, it does not seem to be the case that "most time is in nim c translating to C", at least for very large single file sources like this example. Such sources are not unnatural (IMO) given Nim's macro system. For this e.g. 1-2/7=5/7=71.5% of the time is in C compilation for gcc -O0.
Secondary conclusion: Maybe the distributed/default nim.cfg should use -O0 not -Og? -Og is supposed to optimize a lot - as much as makes debugging 'easy' (according to "someone"). That 5/7 goes to 93% even with a cached compilation followed by a single byte edit. -O1 is even worse, but this should not surprise much.
It used to be -O0. It seems to have been changed without any specific comment about it in https://github.com/nim-lang/Nim/commit/721534119000c2bd53cc72b531726a6104381222. I realize that there is some chance gcc -O0 breaks tests given the commit comment and some feature or two being disabled at -O0, but given the large 4..5X speed delta, it may be worth someone tracking down the exact single -fturn-me-on flag needed to fix things. (Of course, this will still do no better than -O0, compilation speed-wise)
While I see you already figured out your answer, since you revived this ancient thread, I can perhaps say some other things that might be helpful for fast compiles. I do this in my $HOME/.config/nim/nim.cfg:
# Ideally, each Nim compiler flag would have an associated defineSymbol, BUT we
# can: git log -p compiler/condsyms.nim lib/system.nim & find NimMinor edits and
# then search forward/backward for defineSymbol instead to get:
# surrogate for 1.0.0 nimMacrosGetNodeId
# surrogate for 1.2.0 nimHasInvariant
# surrogate for 1.4.0 nimHasCastPragmaBlocks
# surrogate for 1.6.0 nimHasEffectsOf
# surrogate for 1.7.0 nimHasEnforceNoRaises
# nimHasOutParams >20221006 (10-09 intro'd atomics) mm=markAndSweep for tcc
# This is fragile if any such features vanish, but avoids NimScript slowness.
#NOTE 2023-12-14: cc=tcc breaks nimsuggest w/nil destructor assert, but adding
# '-d=release' to 'args' in autoload/nim/suggest/manager.vim fixes it.
@if release or danger: cc=gcc @else cc=tcc @end # Rapid edit-compile-test
.
.
.
@if big chain of various compilers
.
.
.
@else
@elif tcc: # tcc for debugging
line_dir=off # infinite loops tcc lately; github.com/nim-lang/Nim/pull/23488
@if threads: tlsEmulation=on @end
passL="-lm"
@if nimHasNoReturnError: mm=arc @else
@if nimHasCastExtendedVm: @else # <2.1 broke mm:arc on tcc for a time
@if nimHasEffectsOf: mm=markAndSweep # >= 1.6.0; Use mm to nix deprecation
@else: gc=markAndSweep # else use --gc
@end
@end
@end
@end
To save some more time, NimScript is slower than nim.cfg parsing/eval'ing. So, for even faster compiles, I also typically rename /usr/lib/nim/config/config.nims to /usr/lib/nim/config/config.nims- (it does almost nothing anyway, and I also almost never use the nim cpp backend). With all that, I get compile times for "the empty .nim" around 113.21 +- 0.19 ms on just the P-cores of my laptop (i7-1370P). Obviously the more you import/compile, the longer it takes.
$ tim 'rm -rf ~/.cache/nim/*; nim c j >/n 2>&1'
97.9 +- 1.6 μs (AlreadySubtracted)Overhead
(1.1321 +- 0.0019)e+05 μs rm -rf ~/.cache/nim/*; nim c j >/n 2>&1
tim is bu/tim and "/n" is a symlink to /dev/null since I type it way to often and also I have time-unit = "μs" in my ~/.config/tim.
Oh, and just to support my config.nims- point, I renamed it back to config.nims and then re-ran the above to get:
tim 'rm -rf ~/.cache/nim/*; nim c j >/n 2>&1'
99.7 +- 1.5 μs (AlreadySubtracted)Overhead
(1.4926 +- 0.0061)e+05 μs rr ~/.cache/nim/*; nim c j >/n 2>&1
So, that /usr/lib/nim/config/config.nims is costing about (1.4926 +- 0.0061)-(1.1321 +- 0.0019) or (using a little wrapper I have around @Vindaar's Measuremancer) 36.05 +- 0.64 ms, at least on that machine, or in this case around a 1.3184 +- 0.0058 (very best case) speed ratio. Again, only for compiling the empty file - so a minimal case { unless you do one of those "no system.nim" style builds }.
These times may seem trivial, but they add up when you have wrappers that generate code, compile it and run it, e.g. the awk-like row-processor rp or when you are iterating on some cligen help message text.