https://github.com/jinyus/related_post_gen/tree/main
nim implementation: https://github.com/jinyus/related_post_gen/blob/main/nim/src/related.nim
I'm wondering why nim did not end up in the top five, or under ~100ms. std/json too slow? something else?
Note that the time starts running after the file is read in. So its not std/json.
Sure you likely could do optimizations on that, see whether ref-types in various places will fare better due to causing less copies, think the algorithm through, do a flamegraph to check where the slow points are and more.
It just seems like a time-burner for me personally for little gain.
for i, post in posts.pairs:
for tag in post.tags:
var lst = tagMap.getOrDefault(tag, @[])
lst.add(i)
tagMap[tag] = lst
This part will cause a double lookup, changing to
for i, post in posts.pairs:
for tag in post.tags:
tagMap.mgetOrPut(tag, @[]).add(i)
That being said, it will probably not have the biggest impact I'm guessing this is mostly a Table[string, X] benchmark, and other languages with better Table will win
Yep, mgetOrPut was the first thing I noticed in the code but I'm pretty sure the performance impact is insignificant. I bet the hottest part is string allocations.
Don't know the size of the input data, but keeping tags as a string seq for each post smells. It's better to keep them in a single container with the links stored in the Post objects.
Also, tags seem like a great fit for SSO.
I made a couple tweaks and added LTO and PGO to the compilation, about halved the runtime on my machine: https://github.com/jinyus/related_post_gen/pull/91
I also tried with -d:danger --boundChecks:on which ran faster than -d:release, but I didn't add it in because I wasn't sure if it somehow disabled something else.
int32 also sped it up slightly, but this of course messes with readability.
While I haven't profiled this I believe the table stuff is likely the slowest part. @Araq you mention faster table implementations, anything particular in mind?
Update for anyone interested:
Nim is currrently #5 for single-threaded, or #4 if you don't count the Julia HO entry that was written by a space alien (I think). The top 10 are all very close in the 5k runs.
Nim is currently #8 for multi-threaded, edged out just slightly by the normal Julia entry (ha!)
The official benchmarking VM is an Azure Standard F4s v2 (4 vcpus, 8 GiB memory) (std/cpuinfo countProcessors == 4), and the runs are done in a Docker container (Dockerfile) on that VM.
Nim implementations on main:
https://github.com/jinyus/related_post_gen/blob/main/nim/src/related.nim
https://github.com/jinyus/related_post_gen/blob/main/nim_con/related_con.nim
I switched nim_con from malebolgia to nim-taskpools because malebolgia wasn't saturating the cores; not sure if I misunderstood or did something incorrectly with malebolgia or if it's a bug.
For all the discussion earlier about the hash, it doesn't seem to make that much of a difference. I ended up with the fastest times for nim_con when I switched back to (automatically) using Nim's built-in murmur hash, but note that's in combination with inlining hints and clang instead of gcc.
Anyway, it would be nice to shave off a few more milliseconds in both the single-threaded and multi-threaded impls. The multi-threaded D impl is using SIMD, so maybe nim_con could go that route, hints and tips appreciated. Can we at least beat Julia?
I switched nim_con from malebolgia to nim-taskpools because malebolgia wasn't saturating the cores; not sure if I misunderstood or did something incorrectly with malebolgia or if it's a bug.
I believe that's because currently the main thread participates as a worker, so if the main thread is still working, no new work is started.
I believe that's because currently the main thread participates as a worker, so if the main thread is still working, no new work is started.
The commit is here https://github.com/jinyus/related_post_gen/commit/925e3502a47dace7b28494b4beb25d41b8ada834#diff-71e8b983955d084e323c92579bfab73f29e7b5a348a072004f3914e8b329a3b8R143
There is no difference, Taskpool also uses the main thread for work and both enqueued all the work and then the main thread participated once the barrier awaitAll/syncAll was reached.
Now I'm not sure about the timing difference between the two. There are large architecture differences (work-stealing, task-local queues vs global taskqueue, using malloc vs Nim memory allocator, ...) but just in case, do you have more than 8 cores? Because malebolgia will only use 8 by ddefault: https://github.com/Araq/malebolgia/blob/32fd877d46478735f1dbe9c937bddcff9255b3e0/src/malebolgia.nim#L88
When the numbers with malebolgia seemed a bit off to me, I did a larger run over 250k random posts. Activity Monitor on macOS (or htop on Linux) showed that for a pool size of N (4, 8, 16), only N -1 vCPUs were pegged at 100%.
When I tried with taskpools, N were pegged, and the completion times were, unsurprisingly, faster
I've continued to tinker with the nim and nim_con implementations, trying some changes based on the D and Go impls, etc. Sometimes it works out and Nim moves up in the rankings.
However, I'm run into a problem and am unsure how to proceed. If anyone has any ideas...