nimforum mirror - OrderedTable is not an ordered table

lightness1024 (orginal) [2018-03-21T10:37:14+01:00] view original

I just read in the module tables:

OrderedTable is like Table but remembers insertion order

Well.. lol, thank you very much but that's not what we expect.

This is a "sequentially stable table", I suggest a rename to "SeqTable" maybe.

In C++ a set or a map are ordered tables. In C# there are sorted collections as well, they do what you expect. And apparently this has bitten some people in the past :( https://forum.nim-lang.org/t/2308#14161

Araq (orginal) [2018-03-21T11:15:50+01:00] view original

An OrderedTable is an ordered table. But it is not a sorted table. IMO, but I'm not a native speaker.

lightness1024 (orginal) [2018-03-21T11:41:01+01:00] view original

Hey that makes sense though. Maybe adding a SortedTable will be enough to solve this confusion. The mere presence of an OrderedTable and a SortedTable will pick reader's curiosity and will naturally force us to look into why there are two concepts that sounds similar.

cblake (orginal) [2018-03-21T11:47:35+01:00] view original

"order" is just ambiguous - ordered by a key or ordered by an action like insertion. Some resolve that ambiguity easily in their heads and others require it to be spelled out. For the same reason, "KeyOrdered" would be clearer than "sorted". Similarly, "PutOrdered" might be better than the current "Ordered", but I'm not sure it's worth the perturbation.

Nim's usage of "ordered" here is probably a minority usage, but not at all unique. For example, Python uses "OrderedDict" for exactly the same idea.

"Ordered" is probably slightly clearer than "sorted" because the order is maintained dynamically as opposed to (re)created by a sort/sorting algorithm which much more typically refers to some kind of batch reorganization. OTOH, "sorted" tells the user exactly what they do not need to do - namely, sort the thing.

cblake (orginal) [2018-03-21T11:50:15+01:00] view original

@lightness1024 - good point. Another argument in favor of "KeyOrderedTable" would be that it would probably show up in searches of the more generic "OrderedTable". I don't think Nim has a key-ordered collection (EDIT - in the stdlib) right now, but Araq has said he has a B-Tree waiting in the wings. :-)

lightness1024 (orginal) [2018-03-21T12:20:59+01:00] view original

@cblake - ok perfect. also naming things search-aware is important so I concur.

The python's OrderedDict: I didn't know that, it's a good thing nim is not alone then. And in fact the presence of this is enough to not go and rename it. (less disturbance)

In C++ the set/map is usually implemented with a red black tree. It fits a category which the language lawyers call a "node base container". Sorts are log(N) and happens at each operation that must trigger a rebalancing of the tree. (insert/remove)

The undered_set/unordered_map stuff are hash tables, implemented with an index(hashes)-array of pointers to linked lists. This allows for stable memory positioning of the values during the needs for rehashings. (decoupling index from values) Which allows operations to be "non iterator invalidating".

The hashes are not bound to the table's size, which means that they pass through a modulo operator before being used as indices. The statistics say that modulo is unfair in most cases, but the most fair for prime numbers. Which means that the "buckets" (hash table reservation size) are maintained and resized by jumps to the next reasonable prime number.

A problem of this scheme is that memory gets fragmented horribly, requiring allocators for serious work. I have written an open address hash map (here https://sourceforge.net/projects/cgenericopenaddresshashmap/) that allows to keep everything in a flat buffer. This has always better performance from small types.

The nim implementation appears to require powers of two for the table size, which means that the modulo must be done by discarding the most significant bits of the hashes, which loses entropy compared to the prime numbers solution.

cblake (orginal) [2018-03-21T14:24:17+01:00] view original

@lightness1024 - fair criticism of C++ STL.

modulo prime reduction of the hash code to a table address/array index is slow because division is slow. In my tests using modulo can sometimes triple the entire lookup time for simple integer keys due to the slowness of division! I recommend you consult Knuth's Art Of Computer Programming Volume 3 (section 6.4 in my 1998 edition, but his treatment mostly dates back to the 60s). He motivates why prime may be a good choice and then immediately moves on to a multiplicative hash. If you really need to mix the bits a multiply & shift is much faster than modulo. While the upper bits of the half-product do have more entropy, masking seems about as good in my experience and the code is slightly simpler due to the close relation of the mask and table size.

Knuth goes on at some length about choosing multipliers - about 10X as much length as the module prime. For some reason (perhaps an overzealous affection for number theory by CS theorists that write algo books or maybe just simplicity?) "modulo prime" got repeated more (everywhere - in classes/lectures/textbooks/code, etc.). modulo prime sadly "stuck" in programming culture needlessly. Python even has some array of primes less than powers of two. The pedagogical situation seems to be improving in the past decade or so.

In the context of Nim's Table, for integers the absolute most trivial identity hash is used. If that is not good enough, a user should override system hash functions with their own better scrambled hash code producer. Believe it or not (I always recommend testing on your own real data..), the trivial hash is often plenty fast. While adequacy of randomization is data dependent, with linear probing and modern CPU cache prefetching even large collision clusters can be scanned quickly (especially compared to many "improved randomness" mitigations).

Here are two examples of improved randomness failing. You could use a super expensive cryptographic hash function to get super random hash code bits, but the time you spend doing that would likely be much more than the extra time from scanning bigger clusters. Another example is double hashing which hops all over the table randomly in a very cache unfriendly way. The clusters are smaller, but the number of cache misses goes way up. To avoid dealing with hardware variation, people, including smart researchers, often use machine-neutral surrogates for actual time like "number of comparisons" or "cluster sizes". This can create all sorts of confusion and cross-talk when the surrogates are not very faithful to real machines.

Also, this is slightly off-topic for Nim, but I looked at your package and you probably have a subtle bug. I don't see you counting "tombstone" or deleted markers - usually desirable to reorganize the table if there are too many. See this thread for details. You also might like to read the thread on hash functions.

lightness1024 (orginal) [2018-03-21T15:10:49+01:00] view original

@cblake omg such a concentration of goodness in this answer, thank you for the tombstone warning. If I understand what you mean, I gave it some thought at the time, and concluded it would be an idea to "garbage collect" if some metric goes red, which is the solution google dense map appears to have chosen (auto rehash at 50% deleted). In the meantime, I did a much simpler mitigation, line 832 explains: I don't mark all deletion as deleted if the condition allows for return to clean/empty state. I'm relieved to see that nim is in the hands of better experts than the feeling it gives out.

cblake (orginal) [2018-03-21T16:12:04+01:00] view original

Thanks for the compliments. You should probably learn about backward shift deletion. It's a little tricky the first time you see it, but ultimately it's better than tombstones and a bunch of ad hoc garbage collection rules. You can look at lib/pure/collections/tables.nim. There are really only a few improvements to Nim's Table that I think might be good.

We could do Robin-Hood reorganization on inserts, being sure to retain linear probing and backward shift deletion. In my tests, Robin Hood costs about 10% more time on workloads which are mostly successful lookup but saves about 50% of the time for workloads that are mostly failed lookups. (Note that every insert of a novel element starts with an unsuccessful lookup.) So, Robin Hood never adds much on average and is often a net win on "many workloads" and as a bonus squashes the variance of lookup times.

The second idea to speed up might be storing the hash codes in a separate array from the keys so that a single cache line can fit a larger collision cluster. That speeds up searches, but it also slows down inserts and deletes which need to hit twice as many memory spots. So, your mileage may vary even more than RH.

Finally, Tables could also elide saving the hash code for SomeInteger-like "cheaply hashable/comparable" keys. Unlike the prior two ideas, the speed up from this would not be workload dependent. However, to do it there needs to be either a separate occupied bit vector (two cache lines again, like a segregated hash code array) or a special reserved key value signifying "empty". The reserved value is the efficient way to go and just one is usually not too onerous (zero, -1 or -2**31, etc.), but it's not an easy thing to default. So, that expands the API to the user - for just some type classes of key - to specify the empty value. So, a specialized name like SomeIntTable analogous to the specialized histogram CountTable is probably warranted.

Mirror of forum.nim-lang.org

3671 :: OrderedTable is not an ordered table