nimforum mirror - python dict implementation for Nim's Table

slonik_az (orginal) [2020-10-14T18:39:22+02:00] view original

There has been recent complains that run-time performance of Nim's Table implementation got worse:

https://forum.nim-lang.org/t/6920#43331
https://github.com/timotheecour/Nim/issues/82#issuecomment-625613116

I am curious, whether anyone tried to implement Python 3.6+ dict algorithm in Nim. New python dict is ordered (by order of insertion), uses less memory with the same sparsity factor. The gist of implementation is in the original Raymond Hettinger's email https://mail.python.org/pipermail/python-dev/2012-December/123028.html

cblake (orginal) [2020-10-14T19:24:34+02:00] view original

The performance regression in question was a known and not mysterious very temporary fix until the weak integer hash function was replaced with a stronger one (the original problem). In that sense, a complaint of a performance regression across that commit is a red herring for anything except as an example that automated detection might have caught (which is what I took @timothee's comment to be mostly about).

Relying on hopping around memory to build a strong hash function is a bad idea because of memory latency. It is easy to fool yourself to believe otherwise, especially if you "mis-estimate" (by up to 10x) lookup latency as the "reciprocal throughput" of some hot loop, uncontended cache benchmark. Instead, you can just use a strong hash in the first place and incur (almost always for small objects) just one cache miss which is what Nim now does and had has done since shortly after the mentioned "commit regression".

While the pseudo-random probing idea is fundamentally bad from a lookup latency point of view, the "compact ordered" idea of Python's tables is fine for purposes where insert-order matters. To answer your question, that is represented in the lptabz module of https://github.com/c-blake/adix as well as https://github.com/LemonBoy/compactdict.

cblake (orginal) [2020-10-14T19:45:48+02:00] view original

Oh, and in the adix git history, I do have a pseudo-random probing impl along with one that has an idea by @timothee to start linear probing and only switch to pseudo-random after some number of cache lines. Those used to be called otset.nim and ottab.nim and more or less are exactly the Python 3.6 dict impl. I decided this whole general idea space was a distraction from better designs and just deleted it.

The Python algo may be ok for the very limited "language-symbol-table" use cases Python mostly applies it to (where it really should not be doing run time hash lookups in the first place, IMO). It is just obviously bad for very large tables. It "doubles up" the already more indirect than necessary old Python pseudo-random design. Each step in the collision chain must fetch the index out of the hash-table part and then the hash code out of the real data part. A cold cache, poorly predicted lookup with a collision search depth of 4 could take 8*60..120ns DIMM memory load latencies.

A simple way to improve this by 2x is in lptabz which is to store some of the hash code along with the index in the "hash-table-like part". Then if the hash table part uses linear probing this reduces to just 2 loads. This is 2x more than the 1 load "unordered" variant, but you get compactness & insertion-order in exchange (as well as optimal iteration over elements performance).

slonik_az (orginal) [2020-10-14T20:26:05+02:00] view original

A simple way to improve this by 2x is in lptabz which is to store some of the hash code along with the index in the "hash-table-like part". Then if the hash table part uses linear probing this reduces to just 2 loads. This is 2x more than the 1 load "unordered" variant, but you get compactness & insertion-order in exchange (as well as optimal iteration over elements performance).

Is there any chance that lptabz becomes Table implementation is the std lib?

cblake (orginal) [2020-10-14T21:37:42+02:00] view original

lptabz is more or less drop-in compatible with both sets & tables (both ordered & unordered) and maybe more explicitly states up front that it is a multiset/multitable which has caused quite a bit of confusion/surprise (What!??! Dups in a hash?! The heck you say! Take it out!! They are now deprecated in std/tables).

lptabz also has the "sentinel value" optimization which can use < one third the memory space for small integer sets (e.g. 4 bytes per slot vs. 12 or maybe 16 depending on C compiler padding for a uint32 set element). lptabz also adds a more "performance guarding" resize protocol using search depth (the actual worst case time) not "filled fraction" (a guess to keep average depth low), and optional Robin Hood reorg for dense memory utilization. I added a few missing API things like editKey for the ordered table mode based on forum thread 6924, saving & loading to files (hacky right now) to optimize forum thread 5917, topByVal based on forum thread 6604, retrieval by nth in insertion-order per forum thread 4966. The compact variant also uses my seqUint which is a packed array of back-to-back B-bit integers. So, it actually can use ~1/2 the space of the Python index which is just 1,2,4,8 byte integers. (e.g. 17 bits per entry instead of 32 bits for the 2**17 slot case.). There may be a bug or three since I am the only user, but I do use it every day in a couple of non-distributed programs.

I don't know about stdlib taking it on. I wouldn't mind, but Araq often says this like "the stdlib cannot keep up with anything". My guess is that it might be considered "too featureful". It is definitely more work to "drive" (meaning instantiate mostly) this "omnibus implementation" of what are really 5 different impls in the stdlib (counting CountTable).

slonik_az (orginal) [2020-10-15T16:07:52+02:00] view original

lptabz is more or less drop-in compatible with both sets & tables (both ordered & unordered) and maybe more explicitly states up front that it is a multiset/multitable which has caused quite a bit of confusion/surprise (What!??! Dups in a hash?! The heck you say! Take it out!! They are now deprecated in std/tables).

IMHO, it does make sense to keep Table/Set API separate from MultiTable/MultiSet in std lib.

I wouldn't mind, but Araq often says things like "the stdlib cannot keep up with anything".

One of the main Nim's selling points is the performance (both CPU and memory). Together with sequences hash tables are the most frequently used container data types. Having state of the art, cache-line friendly hash table implementation is the must have, IMHO.

cblake (orginal) [2020-10-15T16:29:32+02:00] view original

Large/complex type diagrams do not usually carry their weight, IMO. There are many good uses of a multitable such as anagrams. As long as the history of an instance does not use add, allItems, or dups=true then it's pretty clear to client code they have no dups. They are free to use seq[T] values when they need/want a more full-featured API.

"State of the art" gets over-used, IMO. I don't do anything in lptabz that could not have been done in the 1980s. Fast, flexible engineering is sadly too rare. While I try my best to keep std/tables from straying into cache/branch predictor/etc.-hostile territory, your question/proposal is really for Araq. It's not up to me. I have not yet added the abilities of sharedtables to lptabz which would be one blocker. Plus many tests. I did open an RFC about this, though we closed it after adding a better integer hash.

Araq (orginal) [2020-10-15T17:11:22+02:00] view original

Well it's not that I mind PRs that change our tables implementation for the better, but we do have to maintain compatibility and people usually depend on even little details like iteration order (if only in the unit tests) or the fact that tables are const-able (something I regret btw). Now we can also add std / tables2 for a better implementation with a better API but in the end we don't really know the benefits/costs tradeoff.

In the end, a goal like "the stdlib must offer the fastest hash table solution for all the unkown workloads out there" is not realistic.

Here is a better goal: "Every table keeps the insertion order so that the split into Table and OrderedTable is gone. Keeping insertion order costs 1% of performance sometimes but cuts the API surface into half and makes Nim easier to use."

cblake (orginal) [2020-10-15T17:39:16+02:00] view original

The way I do ordered tables in lptabz is "only" 2x worse than the current tables for giant/out of cache tables with as mentioned a very (optimally?) packed index level memory use-wise. I have a feeling some people would rebel against that 2x.

Most of this goes to what is "most user-friendly" -- often a hard & subjective question. E.g., the stdlib has (deprecated) multi-tables, but not multi-sets. I tried to keep a full Cartesian product of features in lptabz. Some people think any feature/capability "confusing"..even any unfamiliar aspect of any language/runtime/tool. Normalizing for learning impatience over the many dimensions in play is more art than a science. Rather than try to do that, I just provide "all of the above"/the pile of basic features that is easy-ish to implement. That makes it "only" a training problem (not to minimize that problem...).

It's totally fine with me/understandable if Araq wants to support less. He has a lot on his plate with the compiler. Nim, like C or C++, is far less reliant upon the stdlib being "state of the art" than say, Python (without Cython or PyPy).

slonik_az (orginal) [2020-10-15T17:39:26+02:00] view original

Here is a better goal: "Every table keeps the insertion order so that the split into Table and OrderedTable is gone. Keeping insertion order costs 1% of performance sometimes but cuts the API surface into half and makes Nim easier to use."

Actually, in the Python 3.6+ case, new compact dictionary implementation reduced size and increased performance for small and medium size (<64K keys) dicts while getting insertion order for free. It got lot of exposure and beating in the wild and both CPython and PyPy stick to it.

cblake (orginal) [2020-10-15T17:43:08+02:00] view original

Reduced size and increased perf compared to what? Python's old, slow hash table impl? It's very easy to write your own table to beat Python's stdlib table, especially in special circumstances like only integer keys. I used to do that with Cython when I used CPython, actually. That "code marketing" has left people with an impression otherwise is, well, hard to correct.

Mirror of forum.nim-lang.org

6924 :: python dict implementation for Nim's Table