Hey Nimians,
Over the holidays I started a port of Tim Bray's topfew logfile utility for fun and educational purpose. with the intent of comparing Nim's relative performance against the original Go implementation.
Github link: https://github.com/aboisvert/topfew-nim
If you're not familiar with topfew, you can read more about it here and here. It's basically an equivalent to shell pipelines involving sort | uniq -c | sort -rn | head, with parallelized reading of the input file.
Overall the performance story is mixed - my Nim version performs both worse and better than the Go-lang version depending on the use-case. You can see benchmark results in the README file.
There's a lot of reasons and factors involved, and I wanted to mention a few here summarily:
Anyway, this has been a fun exercise for me thus far. If anybody wants to take a look and see how/where they could improve things, please be my guests! I'd love to see how others would approach optimizing this utility.
cheers.
This problem is basically the classic Knuth-McIlroy contrast, cast in parallel form with some generalization Re: tokenization. So, you may want to have a look at https://github.com/c-blake/adix/blob/master/tests/wf.nim It may or may not be optimal for your specific application, but there may be an idea or two of interest to you. It was also discussed somewhat in this thread.
When analyzing performance, the most important question is if your CountTable or equivalent can be crammed into a CPU private L2 cache vs. shared L3 (& maybe contended in a hot&heavy way). It's not even clear if the way that wf.nim works (not copying but just referring back to old memory) is optimal. That kind of depends how long keys are. If you can bound them as short then a copy in the hash table may be best. The importance of cache effects here can make comparative results on "small test data" and "real big data" diverge quite a bit.
Another axis is data representation. I see in your topfew-nim you are really assembling very general keys. Specialization could be a big win. E.g., if you know you are really histogramming IPv4 addresses, for example, then you will almost surely do better to have a sentinel-endowed uint32 style LPTabz even though there is some small cost to ASCII-to-binary the number. Partly the savings would be cache memory since the ASCII takes ~2X more and partly hash operation costs since hashes and compares are so much faster for integers than for strings.