About a month ago a post by Ben Hoyt with his performance review of programming languages via a variation on a classic Knuth-McIlroy problem made a noticeable splash. The post itself didn't contain any Nim code, but he opened a repo which quickly gathered enough PRs for a comprehensive list of languages to be added to the comparison table in the original article, Nim included.
The comparison is especially interesting as it strives to show both the basic, idiomatic solution to a given problem and a comparable optimized one. Good news, the most straightforward solution in Nim beats almost all other competitors. However, what caught my attention is that an optimized version wasn't fast enough, compared to other languages.
Specifically, optimized.nim not only was slower than C, C++ and Rust, but showed even 23% speed loss to Go, which for me was unexpected, so I was interested enough to poke in the code a bit. The repo was closed to contribution (due to author being swamped) rather soon after the post, so I made another repo:
https://github.com/ZoomRmc/countwords_nim
This is a rewrite which I made with the limitation of not changing the basic approach taken by almost all optimized solutions in the original repo. The results are positive. The code I ended up with is faster than Go by a factor dependent on input size:
While my version IMO looks cleaner and reads better than the optimized.nim from the benhoyt repo, the credit for the speed gain really goes to @leorize, which showed a nice trick to avoid excessive string copying during table insertion, thus bumping the speed by ~ 20% instantly. (Big thanks to him for being so helpful.) This is a significant difference which I think deserves to be taken into consideration for improving the Tables in Nim's std.
The other improvements to the code were using FNV-1a hashing and rewriting the hot loop in a nice FSM. After those changes it was natural to avoid filling the word buffer iteratively and then computing the hash for the whole buffer, considering the hashing function works char-by-char, but current tables implementation prohibit computing hash computation on the go since you can't just pass it in. For this I had to make a "rear hatch" for the hash via including tables.nim and modifying inc proc. This went a bit beyond the original goals, so I separated this hack into a separate branch.
I was a bit slow to the party so my repo wasn't seen by many people, but I thought It's a nice educative example of basic Nim. I'm not doing much coding these days so I'm always eager to hear other people's reactions, bring them in.
Also, are there any performance gains you see which are achievable without changing the approach entirely and, especially, without inflating the line count or code complexity?
I'll probably repost this in my personal blog just to document my dabblings, it would be nice to improve the results a bit and incorporate something from your possible suggestions. Thanks for your attention.
This "benchmark" while classic because of the historical importance of both Knuth & McIlroy is not very good because it is very sensitive to the size of CPU caches which vary on test/comparison machines. IIRC, the particular example text of King James bible from Gutenberg*10 can, with a lot of effort, just get the table fitting into 256 KiB which is the L2 cache for many, but 1/2 or 1/4 the L2 cache for others. So, the usual "you better run it on the same machine" caveats are much, much more true here.
As one sugggestion, you can get denser layout by using a bump allocator for strings and 4 byte pointers for 8 byte hash table cells. There is a real issue of over engineering "against the data set" in this one, though.
Also, if you are interested, I have a parallel version here that I did long before the Ben Hoyt article made a few waves. It was mostly just example code to demo a cligen API that I added to help some new user @Domino in this thread and to sort of show off/demo some of the flexibility of adix/LPTabz, such as not always saving hash codes.
In general, this kind of "benchmark" can be extremely misleading. It is mostly a way to measure developer effort and/or problem space knowledge not "programming language speed". To even approach the latter, people feel the need to add a bunch of constraints and subjective judgements that would not be in force in "The Real World" if performance actually mattered. These constraints then nullify the value of the "test" as "PL speed". Processors have wide vector units that can make a 5-20X speed boost. Either you have access to that or you don't. If you do, and performance is paramount, you'll very likely need to use that access "to win" at the cost of portability. If you don't, then yes, programming languages "cleave" into have or have not. Nim has. Here is one of many examples.
As just one less abstract example, everyone knows pure Python is slow (not its only problem), but anyone who needs performance (conditioned upon staying in Python) has many options like Cython, Pythran, PyPy, etc., and you can just directly invoke SIMD intrinsics from at least Cython. An even less abstract example is the SSE-ification of all the Alioth Prog Lang Benchmarks Game.
As far as I can tell, "pure XYZ" was something that was not a popular thing" until the Java "write once run anywhere" hype cycle. Before that less purity/more balanced attitudes seemed to prevail like the great ones in Nim with easy both C and C++ FFIs. That said, sure, portability can matter but then you have a big discussion about tradeoffs in the actual application setting where it is not a toy problem and performance matters.
Anyway, the TL;DR of all this is that people should not conclude much, if anything, about 2x or bigger time deltas in tests like these, especially when people have "gone to town optimizing". Literally this morning I was trying to chase down why gcc's PGO was making 1.3x speed deltas just from me simply re-arranging the order of Nim variable declarations in a proc header block. My jaw fell. Well, not really. I've seen that kind of thing a lot. As well as 2x speed-ups from gcc PGO vs not for Nim-compiled C code. So, your Nim may be faster than you think if you aren't doing PGO..but really 1.5 to 2x is not as big a performance delta as most people evaluating things like this think.
Thanks for your thoughtful (as usual) reply. You can see I deliberately avoid using the word "benchmark" for the whole affair, as I understand the usual concerns: "better run on the same machine", "this compares a dev's familiarity with the language, not the language itself", etc. In my view the original post works well for the same case as code snippets on rosettacode.org: showcase differences for idiomatic problem solving in various languages. Comparing the speed is just a bonus measurement which can give the reader an additional perspective, to counterweight syntax readability/flexibility/tools available. (BTW, speaking in context of How to make Nim More Popular, in this regard this is a nice example of content Nim should have more of ).
I saw your example and your comments in that HN thread and it's very impressive, even though it requires a considerable effort to read into on my part. To be frank, I was a bit nervous you'd show up here bringing gifts I can't really comprehend ;) Again, going full-performance on this problem is tangential and there's already a great list of solutions, like in this codegolf.stackexchange thread.
Speaking of the simple.nim, I too would write it slightly differently. Last time I checked, the code for the same problem on Rosetta could also be improved.
Taking all of the above into consideration doesn't change the fact that straightforward implementation of the "chunk->iterate->fill word buffer->CountTable->sort" algorithm (on which most of the solutions converged) in Nim bumped into suboptimal allocation behaviour. This is a known problem and as far as I understand it is planned to be addressed with bringing Nim's standard library up to date with latest language improvements.
To be clear, I didn't mean to be overcritical of you or your efforts -- good job stdlib tuning (and thanks!), but the general 1.25x faster in lang X than lang Y kind of subculture. For this particular case, I couldn't even reproduce ratios on various CPUs as accurately as their speed ratios on Ben Hoyt's table. So, are you measuring the language or the CPU or what? Mix in any uncontrolled cache competition, no repeated trials, no noise control, etc., etc. To me, that level of irreproducibility means you really just get a couple of clusters of languages which is fine as far as it goes, but then people just see one table, don't understand this, don't know enough about measurement to think critically and overinterpretation ensues. Even pros mess this up..routinely, even.
Anyway, sorry for the rant. I am not specifically attributing any of these problems to you/your post/work. I also analogized it to RosettaCode. I do see real value in that..even just style/API/language learning, but Rosetta Code itself has waaaay more languages on that score. :-)
FWIW, the lowCaseWords tokenizer with the KM def in that adix/test/wf.nim example is not so complicated...Just a 13 line iterator..Easy to "port from other lang examples", but maybe just past "too tricky" in a job interview with all the other moving parts. In my experience, job interview questions evolve as you test them against people. You also get a lot of feedback from candidates. So, FNV1a is probably not a "random" choice, but email feedback from the 37th candidate who tried a slew of hash functions. It happens to work really well on that KJ text.. ;-)
In practice, in my experience, when these things start to matter (= when you consider which PL to pick) is when you have lots of data, so your oh-so-efficient O(1) slices into mmap'ed memory fail as there is not enough RAM available... ;-) So you need to use streaming and your stdlib's split proc becomes hardly usable (not that it ever was the right choice anyway)...
You then need to write custom code (so competitions that focus on "lines of code" are pretty misleading) and it helps if the language has some type checking or that local variables are not accidentally introduced via assignments. (And no, an annotation like "this one here is an array of ints" is not type checking...) ;-)
Well, ok, page replacement dynamics (and control thereof) can be real issues. Very true, but very "algo dependent". :-) Many read all the input like DB "full table scans" while others can save. There are also things like madvise on Unix.
Your argument can find support from other things that can push you to streaming in big data contexts, though, like uncompressed files not fitting on your drives. Needing to read from a pipe connected to a decompressor implies streaming.
A response to that is that random access & compression are not logically exclusive. Block size differences (and variation) create a lot of complexity making it a less supported feature (and one's compression ratio usually takes a hit). I know Btrfs, ZFS, etc. have been making inroads into compressed random access files that could probably be memory mapped, but I don't know how efficient they are/if Windows supports similar. Similarly, Zstd, bgzip, etc. have independent compressed blocks for parallel decompression which is a 3/4 step towards full random access..You just need an index to tell you which compressed bytes to decompress. I personally look forward to a day when all that support is "standard". :)
The "draw"/reason for engaging with the complexity (which we 100% agree makes those language competitions misleading) is that these days you might have 64..128 CPU cores and especially with multi-GB/s NVMe IO smoothly shared among parallel readers (i.e. no seek contention). So, random access may get you two orders of magnitude speed-up over serial streaming, and when it matters 100x might be 360 seconds instead of 100 hours { or a coffee break vs most of a week..and one can sometimes write a lot of software in a week. :-) }. OTOH, maybe all your big data is trapped behind Winchester drives/spinning rust or a network. It all just depends.
It is true that my wf.nim is imperfect (I never meant to suggest otherwise)...Moving from pointing backward into the input file to a little "string repository" with back to back words would have better memory locality. I'm not sure when people started calling such things "bump allocators". It was "string interning" of most of my life, but people like to make up new terms and "cute" often wins out. :-)
Anyway, it's all just food for thought and I'm not trying to be overly "contrary" or suggest @Araq doesn't know all of this. One always needs to analyze one's particular case thoroughly and he & I are mostly just saying that in different ways and agreeing it undermines the relevance of "optimize away competitions". I just didn't want people left with a too easy conclusion of "avoid mmap" when I think it much more often simplifies life at the application layer. Beats me what makes people choose PL's or the best way to market them. :)
It is true that my wf.nim is imperfect (I never meant to suggest otherwise).
Ha, I didn't look at your wf.nim and didn't mean to criticize it, I only made some general remarks.
Since no one here actually suggested any changes to the actual code, I tried PGO just to see for myself how much of a difference it can bring. Tried both llvm and gcc, the former gave negative results, but gcc works great:
!/usr/bin/env bash
set -e
# Substitute your paths to the Nim lib and program's C files
NIMLIB="/home/user/.choosenim/toolchains/nim-#devel/lib/"
SRC=(~/.cache/nim/optimized_r/*.c)
# Regular compilation to generate .c files and a baseline executable:
nim c --gc:orc -d:danger --passC:"-flto" --passL:"-flto" -o:optimized_nim optimized.nim
# Compilation of an exec which will produce the profiling data
gcc -O3 -flto -fprofile-generate -I$NIMLIB ${SRC[@]} -o pg
# Generate a test file
BIBLE=/tmp/kjvbible_x10.txt
if [ ! -f "$BIBLE" ]; then
for i in {1..10}; do
cat kjvbible.txt >>"$BIBLE"
done
fi
# Run the profiling
./pg < /tmp/kjvbible_x50.txt >/dev/null
# This produces "@moptimized.nim.gcda" and a few .gcda files for the stdlib
# Compile an optimized program
gcc -O3 -flto -fprofile-use -I$NIMLIB ${SRC[@]} -o optimized_nim_pgo
# And an even more optimized one
gcc -O3 -march=native -mtune=native -flto -fprofile-use -I$NIMLIB ${SRC[@]} -o optimized_nim_pgo_native
# You obviously need hyperfine installed for this
echo -Go optimized
hyperfine -r 10 "./optimized-go <$BIBLE >/dev/null"
echo -Nim optimized
hyperfine -r 10 "./optimized_nim <$BIBLE >/dev/null"
echo -Nim optimized + pgo
hyperfine -r 10 "./optimized_nim_pgo <$BIBLE >/dev/null"
echo -Nim optimized + pgo + native
hyperfine -r 10 "./optimized_nim_pgo_native <$BIBLE >/dev/null"
(btw, why no Bash highlighting support here?)
I ran the script on the latest version of the code (from the hashrearhatch branch) and the condensed results are:
-Go optimized
Time (mean ± σ): 2.075 s ± 0.048 s [User: 1.994 s, System: 0.100 s]
-Nim optimized
Time (mean ± σ): 1.924 s ± 0.068 s [User: 1.838 s, System: 0.068 s]
-Nim optimized + pgo
Time (mean ± σ): 1.785 s ± 0.033 s [User: 1.697 s, System: 0.067 s]
-Nim optimized + pgo + native
Time (mean ± σ): 1.739 s ± 0.028 s [User: 1.658 s, System: 0.063 s]
Which is a 10% speed bump for the Nim version. That makes Go a 19% slower on this size of the testing data. Pretty neat for a zero amount of manual work.Here's another optimized version. Benchmark performance is the same as the official one on my machine, but it uses an iterator for chunking and zero_functional for the main loop. Looks a lot tidier IMHO:
import zero_functional
from tables import newCountTable, inc, sort, pairs
from algorithm import SortOrder
from strutils import split, toLowerAscii
const
minChunkSize = 64*1024
whiteSpace = {' ', '\n'}
iterator chunks(buffer: var string, inFile: File): var string =
while(not inFile.endOfFile):
buffer.setLen minChunkSize
buffer.setLen inFile.readChars(buffer, 0, minChunkSize)
var rest: string
if inFile.readLine(rest):
buffer.add rest
yield buffer
proc main() =
var
buf = newString(minChunkSize)
table = newCountTable[string]()
chunks(buf, stdin) --> map(it.toLowerAscii).foreach(
strutils.split(it, whiteSpace) --> filter(it.len > 0).foreach(
table.inc it
)
)
table.sort SortOrder.Descending
table.pairs --> foreach(
echo(it[0], ' ', it[1])
)
when isMainModule:
main()
Some casual optimization rules for mere mortals:
Thanks for providing your version, @gemath. On my machine your code is about 1.5% slower than the original optimized version by GH user tauplus from Ben Hoyt's repo. This is negligible difference and of course your version is a logical and proper next step after the Simple version (for line, for word, table inc, sort, for pairs echo). However, we're a bit past that (your version is 50% slower). The issue which prompted me to try to "fix" the original optimized version is that while the algorithm was basic (and almost identical to the one almost all other languages converged on), the measurement showed Nim was lagging behind. We identified why that happened, @leorize provided a workaround hack, I made a couple of additional tiny fixes here and there and everything came together nicely. This thread is just a report and a question if anything obvious was missed and we could squeeze a bit more off the same algo. Seems not.
I feel a bit embarrassed because I actually prefer this style and it's the second time here on the forums you demonstrate a more elegant solution where I previously posted a more imperative versions of the code. My excuse for this particular case is in the first paragraph (+ no user of third-party libs), and for the other occasion in the Star Voting thread - I tried to showcase some basic control structures to a newbie.
If Nim supported out of the box what zero_functional provides I would definitely advocate for this kind of code to be considered idiomatic. I see a great ally in you ;) Thanks for sharing and I'd love to hear your suggestions on the topic.
This thread is just a report and a question if anything obvious was missed and we could squeeze a bit more off the same algo. Seems not.
That's what I didn't get. My post is about getting the performance of the "official" optimized version with more maintainable high-level code. It wasn't a comment on what you and @cblake achieved here at all, it was for people who cannot or don't wan't to put in the effort you two invested. Should have made that clear, sry.
My excuse for this particular case is in the first paragraph (+ no use of third-party libs), and for the other occasion in the Star Voting thread - I tried to showcase some basic control structures to a newbie.
Valid points, especially the last one. As elegant and powerful as it is, zero_functional has a few quirks and newbies without experience in functional programming can have a hard time with it. E.g. the error messages can be horrible.
Hey, so I did some additional investigation and here are my findings:
First, some timings with nim-devel (git hash: 2ca6169ba10dca704712b59339a24208e30e79ac). Compiling with c --gc:arc -d:danger:
The time measurements were taken by wrapping the processing loop and table sorting (but not the final printing step) with two calls to getMonoTime and then calculating the difference between their results. As input I just used the kjvbible.txt from the repository.
While Zoom's version addresses the creation and destruction of a new string object every time a word is counted, it still has to do a string resize/copy every time. In theory, we would actually only need to create a new string when a new word is introduced into the CountTable, but since the current implementation does not support heterogenous lookup we have to have a string available for every lookup.
Here's a working (albeit limited) version of CountTable that supports heterogenous lookup for the inc proc:
# hl_ctables.nim
from algorithm import SortOrder, sort
from std/hashes import Hash
import std/math
# UNMODIFIED
type KeyValuePairSeq[A, B] = seq[(A, B)]
# UNMODIFIED
template maxHash(t): untyped = high(t.data)
template dataLen(t): untyped = len(t.data)
# UNMODIFIED
const defaultInitialSize = 32
include std/tableimpl
# UNMODIFIED
type
CountTable[A] = object
data: seq[tuple[key: A, val: int]]
counter: int
isSorted: bool
# UNMODIFIED
proc initCountTable*[A](initialSize = 32): CountTable[A] =
initImpl(result, initialSize)
# UNMODIFIED
template cntMakeEmpty(i) = t.data[i].val = 0
template cntCellEmpty(i) = t.data[i].val == 0
template cntCellHash(i) = hash(t.data[i].key)
# PROC SIGNATURE MODIFICATION: Add B type param, use B for key's type
proc ctRawInsert[A, B](t: CountTable[A], data: var seq[tuple[key: A, val: int]],
key: B, val: int) =
mixin hash # MODIFICATION
var h: Hash = hash(key) and high(data)
while data[h].val != 0: h = nextTry(h, high(data))
# MODIFICATION: use construct[A](key) instead of key
mixin construct
data[h].key = construct[A](key)
data[h].val = val
# PROC SIGNATURE MODIFICATION: name change, use sink annotation for `key`. Otherwise same as ctRawInsert
proc ctRawInsertDirect[A](t: CountTable[A], data: var seq[tuple[key: A, val: int]],
key: sink A, val: int) =
mixin hash # MODIFICATION
var h: Hash = hash(key) and high(data)
while data[h].val != 0: h = nextTry(h, high(data))
data[h].key = key
data[h].val = val
# MODIFIED
proc enlarge[A](t: var CountTable[A]) =
var n: seq[tuple[key: A, val: int]]
newSeq(n, len(t.data) * growthFactor)
for i in countup(0, high(t.data)):
# MODIFICATION: use ctRawInsertDirect instead of ctRawInsert
if t.data[i].val != 0: ctRawInsertDirect(t, n, move t.data[i].key, move t.data[i].val)
swap(t.data, n)
# MODIFIED
proc rawGet[A, B](t: CountTable[A], key: B): int =
mixin hash # MODIFICATION
if t.data.len == 0:
return -1
var h: Hash = hash(key) and high(t.data)
while t.data[h].val != 0:
if t.data[h].key == key: return h
h = nextTry(h, high(t.data))
result = -1 - h
# MODIFIED: signature change. Add K type param, use K for key's type
proc inc*[A, K](t : var CountTable[A], key : K, val = 1) =
assert(not t.isSorted, "CountTable must not be used after sorting")
mixin hash # MODIFICATION
var index = rawGet(t, key)
if index >= 0:
inc(t.data[index].val, val)
if t.data[index].val == 0:
delImplIdx(t, index, cntMakeEmpty, cntCellEmpty, cntCellHash)
else:
if val != 0:
insertImpl()
# UNMODIFIED
func ctCmp[T](a, b: tuple[key: T, val: int]): int =
result = system.cmp(a.val, b.val)
# UNMODIFIED
proc sort*[A](t: var CountTable[A], order = SortOrder.Descending) =
t.data.sort(cmp = ctCmp, order = order)
t.isSorted = true
iterator pairs*[A](t : CountTable[A]) : (lent A, int) =
var i = 0
let L = t.data.len()
while i < L:
if not cntCellEmpty(i):
yield (t.data[i].key, t.data[i].val)
inc i
This is mostly a copy'n'paste of the CountTable implementation from std/tables.nim with some changes done here and there. Every routine marked with # UNMODIFIED was pasted verbatim. Changed routines are marked with # MODIFIED and the changes with # MODIFICATION.
One thing to note is that in the original implementation the sink inference for ctRawInsert seems to fail, which results in enlarge doing an unnecessary copy for every table item move. The above version fixes the issue by manually annotating the key parameter with sink.
To use the modified CountTable in optimized.nim, import it and also add a hash(openArray[char]) routine that is compatible with hash(string). Copy and pasting from std/hashes we get:
from std/hashes import Hash
# UNMODIFIED
template imul(a, b: uint32): untyped = a * b
# UNMODIFIED
proc rotl32(x: uint32, r: int): uint32 {.inline.} =
(x shl r) or (x shr (32 - r))
# UNMODIFIED
proc murmurHash(x: openArray[byte]): Hash =
# https://github.com/PeterScott/murmur3/blob/master/murmur3.c
const
c1 = 0xcc9e2d51'u32
c2 = 0x1b873593'u32
n1 = 0xe6546b64'u32
m1 = 0x85ebca6b'u32
m2 = 0xc2b2ae35'u32
let
size = len(x)
stepSize = 4 # 32-bit
n = size div stepSize
var
h1: uint32
i = 0
# body
while i < n * stepSize:
var k1: uint32
when defined(js) or defined(sparc) or defined(sparc64):
var j = stepSize
while j > 0:
dec j
k1 = (k1 shl 8) or (ord(x[i+j])).uint32
else:
k1 = cast[ptr uint32](unsafeAddr x[i])[]
inc i, stepSize
k1 = imul(k1, c1)
k1 = rotl32(k1, 15)
k1 = imul(k1, c2)
h1 = h1 xor k1
h1 = rotl32(h1, 13)
h1 = h1*5 + n1
# tail
var k1: uint32
var rem = size mod stepSize
while rem > 0:
dec rem
k1 = (k1 shl 8) or (ord(x[i+rem])).uint32
k1 = imul(k1, c1)
k1 = rotl32(k1, 15)
k1 = imul(k1, c2)
h1 = h1 xor k1
# finalization
h1 = h1 xor size.uint32
h1 = h1 xor (h1 shr 16)
h1 = imul(h1, m1)
h1 = h1 xor (h1 shr 13)
h1 = imul(h1, m2)
h1 = h1 xor (h1 shr 16)
return cast[Hash](h1)
# Same as stdlib's hash for string, except that it takes a string view
func hash*(x : openArray[char]) : Hash {.inline.} =
result = murmurHash(toOpenArrayByte(x, 0, high(x)))
For the modified CountTable implementation to work, we have to provide it with a construct routine (a bit more work, but works around the need to use a converter).
For our use case, those would be:
func construct[T: string](x : openArray[char]) : T {.inline.} =
result = newString(x.len)
copyMem(addr result[0], unsafeAddr x[0], x.len)
template construct[T: string](x : string) : T =
x
Then replace the string slicing operations with corresponding toOpenArray calls, and done. Doing all that, the execution time I now get for optimized.nim is around 39.5ms.
Changing @Zoom's version to use the modified CountTable (just replace wordBuf.copy(buf{wstart..<wend}); words.inc(wordBuf)) with words.inc(buf{wstart..<wend}) and add the construct routines) and timing it, I now get ~34ms. Just using the modified CountTable with no other changes, I get ~45ms.
To sum it up:
By having the ability to pass a string view (openArray[char]) to inc, a new string only needs to be created when a new word is introduced into the table, instead of on every call to inc. The amount of total words for the input file in our test case is 821,133, while the amount of entries in the final table is just 32,187! So we basically reduced the amout of string copies from 821,133 down to 32,187!
Additionally, I also measured how long the optimized C version from benhoyt/countwords takes. Using the gcc from the Nim's Mingw64 distribution, I get:
~ 30ms
The measurement was done by wrapping the processing loop and sorting (but not the printing) with two calls to Windows' QueryPerformanceCounter function and then calculating the difference. Basically the same as how I did it in nim (getMonoTime uses QueryPerformanceCounter on Windows). The optimization level used was -O3.
Comparing the performance of single file C/C++ projects and single file Nim projects like that is a bit unfair since in the case of C/C++, the whole program more or less resides in one translation unit, while with Nim, the whole program is most of the time spread accross multiple translation units thus reducing the available knowledge for the C/C++ optimizer. To make the comparision a bit more fair, we can use whole-program optimization (also called link-time optimization), so that the optimizer has access to the whole program's code at once. Compiling the optimized.c file with -O3 -flto, the execution time I get is ~28.2ms. The small speedup probably comes due to the fact that the optimizer now also has access to libc function implementations.
When using lto for compiling the Nim programs - by passing the additional command line arguments --passC:"-flto" and --passL:"-flto" to nim - I get:
So in conclusion, the further optimized Nim version is - with lto - almost as fast as the optimized C version (on the small input), while, IMO, being alot more readable!