nimforum mirror - related_post

ITwrx (orginal) [2023-10-03T16:46:34+02:00] view original

https://github.com/jinyus/related_post_gen/tree/main

nim implementation: https://github.com/jinyus/related_post_gen/blob/main/nim/src/related.nim

I'm wondering why nim did not end up in the top five, or under ~100ms. std/json too slow? something else?

Araq (orginal) [2023-10-03T16:50:57+02:00] view original

Dunno but the code looks stupid.

Isofruit (orginal) [2023-10-03T17:35:57+02:00] view original

Note that the time starts running after the file is read in. So its not std/json.

Sure you likely could do optimizations on that, see whether ref-types in various places will fare better due to causing less copies, think the algorithm through, do a flamegraph to check where the slow points are and more.

It just seems like a time-burner for me personally for little gain.

PMunch (orginal) [2023-10-03T19:02:31+02:00] view original

Compiling with LTO and removing a couple superfluous things brought it down to sub 50ms for me. But I think my machine was possibly quite a bit faster than theirs because it ran the original in about 75ms (or they messed up during compilation). It seems to be mostly to measure the performance of Table. But there are certainly things which can be optimised there.

Tanguy (orginal) [2023-10-04T12:06:59+02:00] view original

for i, post in posts.pairs:
  for tag in post.tags:
    var lst = tagMap.getOrDefault(tag, @[])
    lst.add(i)
    tagMap[tag] = lst

This part will cause a double lookup, changing to

for i, post in posts.pairs:
  for tag in post.tags:
    tagMap.mgetOrPut(tag, @[]).add(i)

That being said, it will probably not have the biggest impact I'm guessing this is mostly a Table[string, X] benchmark, and other languages with better Table will win

Araq (orginal) [2023-10-04T12:38:52+02:00] view original

Plus there are better table implementations available for Nim.

Zoom (orginal) [2023-10-04T13:37:13+02:00] view original

Yep, mgetOrPut was the first thing I noticed in the code but I'm pretty sure the performance impact is insignificant. I bet the hottest part is string allocations.

Don't know the size of the input data, but keeping tags as a string seq for each post smells. It's better to keep them in a single container with the links stored in the Post objects.

Also, tags seem like a great fit for SSO.

tommo (orginal) [2023-10-04T13:38:41+02:00] view original

The repo is using -d:release, building with -d:danger can save some time from bound checking. Using int32 instead of int might help a bit too.

Zoom (orginal) [2023-10-04T13:43:02+02:00] view original

Turning off bound checking is against the rules. Also, I've noticed release builds got a bit closer to danger in speed lately (I've done no measurements, though).

PMunch (orginal) [2023-10-04T13:52:14+02:00] view original

I made a couple tweaks and added LTO and PGO to the compilation, about halved the runtime on my machine: https://github.com/jinyus/related_post_gen/pull/91

I also tried with -d:danger --boundChecks:on which ran faster than -d:release, but I didn't add it in because I wasn't sure if it somehow disabled something else.

int32 also sped it up slightly, but this of course messes with readability.

While I haven't profiled this I believe the table stuff is likely the slowest part. @Araq you mention faster table implementations, anything particular in mind?

roxmellisa (orginal) [2023-10-25T08:52:20+02:00] view original

niceee

brainproxy (orginal) [2023-10-25T21:45:37+02:00] view original

Update for anyone interested:

Nim is currrently #5 for single-threaded, or #4 if you don't count the Julia HO entry that was written by a space alien (I think). The top 10 are all very close in the 5k runs.

Nim is currently #8 for multi-threaded, edged out just slightly by the normal Julia entry (ha!)

The official benchmarking VM is an Azure Standard F4s v2 (4 vcpus, 8 GiB memory) (std/cpuinfo countProcessors == 4), and the runs are done in a Docker container (Dockerfile) on that VM.

Nim implementations on main:

https://github.com/jinyus/related_post_gen/blob/main/nim/src/related.nim

https://github.com/jinyus/related_post_gen/blob/main/nim_con/related_con.nim

I switched nim_con from malebolgia to nim-taskpools because malebolgia wasn't saturating the cores; not sure if I misunderstood or did something incorrectly with malebolgia or if it's a bug.

For all the discussion earlier about the hash, it doesn't seem to make that much of a difference. I ended up with the fastest times for nim_con when I switched back to (automatically) using Nim's built-in murmur hash, but note that's in combination with inlining hints and clang instead of gcc.

Anyway, it would be nice to shave off a few more milliseconds in both the single-threaded and multi-threaded impls. The multi-threaded D impl is using SIMD, so maybe nim_con could go that route, hints and tips appreciated. Can we at least beat Julia?

tsojtsoj (orginal) [2023-10-25T22:34:16+02:00] view original

I switched nim_con from malebolgia to nim-taskpools because malebolgia wasn't saturating the cores; not sure if I misunderstood or did something incorrectly with malebolgia or if it's a bug.

I believe that's because currently the main thread participates as a worker, so if the main thread is still working, no new work is started.

mratsim (orginal) [2023-10-25T23:20:02+02:00] view original

I believe that's because currently the main thread participates as a worker, so if the main thread is still working, no new work is started.

The commit is here https://github.com/jinyus/related_post_gen/commit/925e3502a47dace7b28494b4beb25d41b8ada834#diff-71e8b983955d084e323c92579bfab73f29e7b5a348a072004f3914e8b329a3b8R143

There is no difference, Taskpool also uses the main thread for work and both enqueued all the work and then the main thread participated once the barrier awaitAll/syncAll was reached.

Now I'm not sure about the timing difference between the two. There are large architecture differences (work-stealing, task-local queues vs global taskqueue, using malloc vs Nim memory allocator, ...) but just in case, do you have more than 8 cores? Because malebolgia will only use 8 by ddefault: https://github.com/Araq/malebolgia/blob/32fd877d46478735f1dbe9c937bddcff9255b3e0/src/malebolgia.nim#L88

brainproxy (orginal) [2023-10-25T23:29:17+02:00] view original

When the numbers with malebolgia seemed a bit off to me, I did a larger run over 250k random posts. Activity Monitor on macOS (or htop on Linux) showed that for a pool size of N (4, 8, 16), only N -1 vCPUs were pegged at 100%.

When I tried with taskpools, N were pegged, and the completion times were, unsurprisingly, faster

Araq (orginal) [2023-10-25T23:59:08+02:00] view original

I think Malebolgia is missing some very simple patch that disables its blocking nature for the main thread.

Isofruit (orginal) [2023-10-27T16:19:40+02:00] view original

Impressive work from you and everyone involved!

ITwrx (orginal) [2023-11-11T06:44:36+01:00] view original

Yes, thanks for all the responses. Consider all your posts "hearted"! :)

brainproxy (orginal) [2023-11-29T00:06:14+01:00] view original

I've continued to tinker with the nim and nim_con implementations, trying some changes based on the D and Go impls, etc. Sometimes it works out and Nim moves up in the rankings.

However, I'm run into a problem and am unsure how to proceed. If anyone has any ideas...

https://github.com/jinyus/related_post_gen/discussions/482

Mirror of forum.nim-lang.org

10525 :: related_post_gen benchmark