So, the enum/mark as deleted approach used in collections.tables and collections.sets in the nim standard runtime has a well-known deficiency. When you age the table in non-trivial ways, it can fill up with all marked-as-deleted entries (or used and deleted entries, but zero truly empty-never used slots). When this happens, the current rawGetImpl will go into an infinite loop for not-present keys. Basically the slot is never empty and the key never matches. If you would like I could provide a test program, but it's pretty clear just looking at the code.
There are a few ways around this problem. The most basic would be to prevent looping forever with a smarter loop. This has the problem that unsuccesful rawGet()s still suffer major performance degradation in aged tables since there may still be many deleted entries to scan over. successful gets usually work fine, though. So, that aspect of the problem can go unnoticed.
Another approach is to to keep track of the population of "tombstones" aka marked-as-deleted and resize the table if that is too large. This has the virtue of being close to the exiting implementation - it is basically like "enlarge", but does not need to actually grow the table space. Indeed, it might make sense to "right size" the table to be smaller when rehashing to remove tombstones. In this approach delete operations remain very fast and simple except for the rare cases they induce an expensive whole table rehashing.
My personally favored approach is to actually do some re-hashing on each deletion. The easiest way to do this latter approach is Knuth's volume 3 algorithm 6.4R, though that only works with linear probing (i.e. "proc nextTry(h, maxH: THash): THash = result = (h + 1) and maxH"). [ v3 6.4R as written actually assumes (h - 1) and maxH, but can be easily adjusted. ] This has the virtue that the table is basically "ageless". It always looks just like it would if you built it from scratch. This approach also has the virtue that nextTry being "(h+1) and maxH" also has better cache locality which makes such tables slightly faster (10..20%) in my testing.
I'm happy to actually do this and do a pull request if necessary, but it seemed worth asking opinions about approaches. There are actually 6 or 7 of the same basic table implemention floating around in Nim. If you search for "((5 * h) + 1)" you can quickly see them all. It would be a little nicer showcase of genericness to have just one or two, but eh, software can evolve that way and at least one is deprecated. Most of these quasi-duplications do not support deletion. So, this is not a real issue for those. It just bears noting in terms of the actual work to the extent that "different but similar" can be confusing. E.g., if a move to nextTry of "(h+1) and maxH" happened it might make sense to do that in all those places. Or if any of those other places grew a need to support deletion, it would have to change.
Also, forgive if this would be better raised as an issue/bug report on the github page. This forum seems pretty active, though.
For probing, I'd actually recommend fixing the current implementation. It is based on Python's implementation, but incorrectly and does not include the perturbation part. As a result, it currently is basically a somewhat more expensive implementation of linear probing. One can either use Python's approach or use the recurrence h[i] = (hash + x[i]) mod size, with x[i] = (x[i-1]*5+1) mod size (note that this differs from the current implementation, which essentially has h[i] = (h[i-1]*5+1) mod size).
Linear probing I would advise against because it makes worst-case performance due to clustering much more likely.
With respect to dealing with deleted entries, the proper approach (IMO) would be to keep a count of deleted entries and rebuild the table when the deleted plus occupied entries exceed a certain percentage of the table (or if the deleted entries on their own become too many). This has the most predictable performance characteristics (insofar as hash table performance can be predictable), as it basically has the same worst-case scenario as hash tables without deletion.
Ok. Yeah, I understand about code evolution. I'll just do a PR for the h+1 linear probing for collections.sets & collections.tables (probably in a couple days) and if you guys like it you can maybe re-visit other tables.
The performance won't be identical - deletes become about as slow as inserts instead of successful gets (and yes, there is a small gap there). Performance can't really remain "identical" while fixing the bug, though. Most prefer "less lumpy" performance anyway (less stop-the-world GC/whole table rehashing, etc.) if the amortized cost is roughly similar - which it is in this case. The amortized cost ultimately depends on the mix of operations (how slow tombstones make searches vs. how good hash functions are, etc.) Right now the bug is truly pessimal performance under a certain workload that must just not have "come up yet" with Nim users. :)
Also, h=(h+1) can put more pressure on using a good hash function than the current h=(5*h+1) which is a kind of pseudo-double-hashing probe sequence. That is a subtle usage/performance/workload/key space question without simple answers - avoidable in the lumpy peformance tombstone counter approach since the probe sequence can stay the same, but not in the linear probing approach. E.g., the current "identity" hash function for integers actually works pretty well in my testing even though it barely hashes/mixes bits at all - just range reducing the numbers to the table address space, really. If there is any "overall benchmark suite" for Nim we could throw the different impls at that to exercise a variety of cases, but it still wouldn't cover what all users might be doing now. This is just a caveat. I still think linear probing is the way to go, and I doubt anyone will complain about slight performance changes. And that's what I'll do in GitHub in the very near future.
Jehan - I believe the analysis based on clustering (a classic in the analysis of algorithms) is less relevant in today's cache sensitive environments. Even there the "worst case" is not really so bad - it's the sizes of clusters which are much smaller than the table size. With caches, people often put a lot of work into making clustering happen in other contexts. :-) With tombstones and certain kinds of aging you can pollute the entire table with all occupied slots and tombstones. It can end up requiring a scan of the whole table to do unsuccessful searches - because you don't know the history of inserts/deletes. That whole table scan really is the worst case. Which approach is more or less likely to hit bad cases more frequently comes down to workload stuff - rate of tombstone pollution, etc. You can tune how frequently those giant scans can happen by the threshold of tombstone density, but that's yet another parameter and issue. Instead of a "total number of tombstones" you could keep track of how many you hit in a search and rehash at that point, too. There are a few ways to go. It's just not quite as clearcut a question as I think your post makes it. The property of always being the same table you would build from scratch is a kind of nice one, too, and the improved cache locality almost always more than makes up for the clustering (in my experience, though of course "it depends"). I would expect Python tables often have much more expensive comparisons than a Nim equivalent due to various expenses like method lookup/dispatch/etc. So, they may be more appropriately defensive against even nicely cached clustered comparison chains.
For what it's worth, if we are going to spend a whole 32-bit or 64-bit word (that used now for the enum just to know if a slot is occupied - only one bit of information), we could also save 31 or 63 bits of the hash code (the non-range reduced hash function) and use that as a boolean short-circuiting "comparison prefix". Before doing a possibly expensive per-entry comparison, first see if their hashes are the same. Same keys always yield same hashes, after all. There is still a birthday paradox and it's not perfect, and you still have to compare at least once, but collisions within the maximum address space are typically very, very rare. So, this results in only one "real, non-integer comparison" almost all the time. For long string keys it can speed things up quite a bit. It probably belongs in the "general purpose" table with user-defined keys and it goes a long way to totally eliminating key clustering concerns since cache misses can cost as much time as quite a few integer compares. This can slightly slow down integer-like key tables, though where you are replacing one fast compare with two. So, a specialization for those may still be desirable. I'm still kind of sussing out best practices for parametrizing aspects like this in Nim.
One other subtle issue that I should have mentioned in my first message is that once you can re-organize the table on deletion (either by tombstone driven whole table rehashing or by more local per-delete rehashes), it becomes problematic/complicated to delete some entries inside a loop iterating over the same table. The array slots can change all around on you. I was looking through the current Nim github code and didn't see any uses like that inside Nim proper. End-user code might very well do it, though. Just marking it deleted doesn't cause any trouble since nothing is moved. Inserting into a table while iterating over it does cause trouble as the slots in the bigger table could be all permuted.
I don't think the current implementation is robust in that "insert into table while looping over it" scenario either. So, that might better be solved by some less table-structure-oriented approach, or even just "documentation of limitations".
cblake: Jehan - I believe the analysis based on clustering (a classic in the analysis of algorithms) is less relevant in today's cache sensitive environments. Even there the "worst case" is not really so bad - it's the sizes of clusters which are much smaller than the table size.
The problem with clustering is that it greatly increases the likelihood of the worst case, because as you get clustering, you are increasingly likely to get bigger clusters, until you eventually have O(n) rather than O(1) performance in the worst case. If you're using hash tables, you're implicitly accepting that there's some randomness to your performance, but driving down μ a bit at a bigger increase in σ does not look like it would be a good trade-off. The absolutely last thing you want in a hash table is increasing the risk of worst-case behavior. It strikes me as the kind of micro-optimization that I would be very cautious about because it can carry a steep price.
One particular problem you have to consider in particular is hash functions that aren't very random and that do cluster naturally. Python's string hashing function is an example, and it's part of why linear probing wasn't a good idea for Python. Such hashing functions often occur naturally and are often desirable (for example, Fibonacci hashing is very well behaved in distributing integer or pointer values evenly over a range, but it's avalanche effect is poor). Linear probing very strongly depends on the hash function being very random; in order to get that randomness, you may have to employ a more expensive hash functions that takes away any performance benefits you can get.
cblake: Before doing a possibly expensive per-entry comparison, first see if their hashes are the same.
While I think that this may be a good idea (Python does it, too), just in case you have a very expensive comparison function (such as calling out to Lua/Python to compare Lua/Python objects), this may in general provide less of a payoff than you think, at least for common comparison functions. Two entries that hash to the same value are likely to be very different (assuming a sane hash function), so usually (assuming some sort of lexicographic comparison) will make the comparison fail after 1-2 bytes/words.
cblake: I would expect Python tables often have much more expensive comparisons than a Nim equivalent due to various expenses like method lookup/dispatch/etc.
Not really a concern. Python dictionaries were primarily optimized for the common case of string keys with the standard built-in comparison function. I was involved in that effort and can share what I remember if you want.
Those are all fine points. Hash function are sometimes clustered "by design" (say nearby keys like fooooooA, fooooooB map to nearby or adjacent addresses as is sometimes recommended - make common distinct things map to distinct addresses) which works great right until you run out of space/"unrelated" clusters interfere. So, you want both less random and more random functions at the same time. It's just all so contingent. :-)
In terms of average/variance trade-offs, the tombstone approach can suffer from a lot of rehashing if you set tombstone density thresholds low and if you don't things are way worse than linear probing clustering. In the case with a lot of insertions/deletions but similar table sizes it can be really costly if you set them high. Anyway, there's little question the best approach depends on operation mixes, object sizes, key sets, hash functions, etc., etc.
Everything in hashing has a lot of "it depends". That makes it really hard to determine one-size-fits-all solutions even at the table structure level. It might be "nicest of all" to provide a couple options with community experience driving what is best as the "default"..maybe even options like external chaining, cuckoo, something like what PyPy does with the new insert-ordered dict sparse-dense style (dating back to similar work by Preston Briggs in the late 80s/early 90s), etc. I may well get around to stuff like that, but they seem higher order concerns. I'll start with something that doesn't infinite loop on me if I delete too much. :) Or if you had a strong desire to run with a tombstone thing and fixing the probe sequence, that's ok with me, too.
cblake: In terms of average/variance trade-offs, the tombstone approach can suffer from a lot of rehashing if you set tombstone density thresholds low and if you don't things are way worse than linear probing clustering.
You can always do rebuilding in expected amortized constant time (if you don't want amortized constant time, but expected constant time, you need bucket hashing, anyway).
First, store the hash values along with the entries.
Second, rebuild the hash table after an insertion if the number of used plus the number of deleted entries exceed 2/3 of the table size (or whatever maximum occupancy ratio you want).
Third, and optionally, rebuild the hash table after a deletion if the number of deleted entries exceeds the number of used entries (or a fraction of the table size).
This guarantees that you will rebuild the table only after O(tablesize) insertions or deletions, making the expected amortized cost per insertion or deletion constant. Note that the third part is only necessary if you want to save space after having lots of deletions without any insertions; Python basically only does the first two things, and it's a simple and robust approach that has served them well for decades now.
cblake: Or if you had a strong desire to run with a tombstone thing and fixing the probe sequence, that's ok with me, too.
I have a journal paper deadline in a couple of weeks and a workshop to attend next month, so I'm not sure if I have a whole lot of time to spare in the foreseeable future.
Yeah. I get all that. That was my 2nd approach in my first post (though I abbreviated). Doing something like you say with a giant tombstone density threshold can kill performance even just in the mean. E.g., A table 50% occupied with 50% tombstones will be 100% full for unsuccessful searches and be vastly slower than a linear probing table with any but the most pessimal hash function. So, "equal deleted and occupied" as you mention would be a terribly too high threshold, guarding against an infinite loop, but poorly guarding against performance degradation.
Even though it ignores cache effects, consulting something like Knuth v3 figure 44 is pretty informative (on page 545 in the 1998 edition). You would want a pretty small tombstone density threshold - more like under 5% of the table space to have double hashing or fancy probe sequences be ahead of LP for giant objects..maybe under 1..3% to be "safe". Then the question just becomes about "likely workload" vagaries determining how often do you delete a few percent of entries (before inserting stuff that replaces tombstones), etc. That can honestly come down to "end programmer loop styles", like do they do two passes deleting a bunch of stuff and then adding or otherwise.
Then there are all the cache effects that are strongly sensitive to how large objects are - a whole other dimension of "workload". Cache lines are 64 bytes and modern hardware will often prefetch the next line or three. So, if you can fit like 24..32 8-byte objects in 3..4 lines then even the max linear probe cluster scan over a large table will be very fast because the hardware masks memory latency for you and if you saved hash codes you're only doing integer compares. (In my experiments with string tables on real world key sets, I usually see the maximum probe sequence over a table roughly log base 2 of the table space, though of course hash function quality drives that and there are also theoretical results for random-like hash functions.) If objects are dozens of bytes, that hardware prefetching bonus shrinks a lot or vanishes, but object size more than anything else will be the big determiner of performance.
Python hash slots are also pretty big compared to minimal Nim slots..a hash code and a couple 8-byte pointers or 24 bytes. Nim's could be even e.g., 2 bytes for a table of int16's (with a special empty key instead of an enum state). That size ratio of 12X can make for large but still "constant factor" differences - as large as latency ratios across various levels of cache. Lines are often only 64-bytes and typical collision clusters are only 2-20 entries. Constant factors matter and "efficiency" is one of Nim's top priorities.
Locality matters even more as tables get really large or memory competition great and things are out of L3 in DIMMS. I've seen linear probing beat even (at least a naive, standard) Cuckoo dynamic semi-perfect hashing just because the latter sometimes does two accesses. Cuckoo never does more than two, but it does two often enough (and in independent memory regions by design) to be slower than the common case linear probing. There may be fancier Cuckoo's, but most Cuckoo hashing can also often be quite sensitive to hash quality. Identical 64-bit hash codes can cause bad cycles/problems, for example. It's probably not great for a default std lib choice without a lot of care. I think LP is simpler.
There are also quite a few authors that make this LP/caching observeration in the literature over the past 25 years. I don't think myself making kooky conclusions based on uncommon key sets or uncommonly great hash functions, and my conclusions aren't just theoretical but come from real timings in other contexts. There also remains a lot of "received wisdom and/or starting points" from analyses of the 60s..80s that is I think decidedly slow. The example I usually use is "hash code modulus prime table size" reduction to the table range. Integer division is still so much slower than bit-masking or shifting that this can make a 2X difference in a weakly loaded integer table in L1. Last I checked, CPython did that modulo prime stuff, too. Division is probably faster than even one C function call (like to memcmp or a hash function)..So, it may be a small deficit in CPython in context, but Nim has a lot of good inlining ability. I think you could measure that roughly 2X change.
Just because CPython performance/impl ideas have worked "ok" for a long time really doesn't make them some gold standard, especially to inform a properly compiled language with various lower overhead scenarios. It means you might not go "too wrong", I guess, but I think Nim can have higher ambitions. It's actually got the fastest initial std lib integer keyed table I've tested against my own stuff in about 20 years of checking/trying stuff out. Kudos/thanks for that, Araq! :-)
As I mentioned, a wide variety of tables/containers with different strenths and weaknesses is better and should be a more long-term ambitions. This is just about the default not blowing up.
Well, for the record this is a good back and forth. I guess I have more free time than you. So, I'll give LP delete a go. I doubt anyone will notice any peformance degradation..much more likely the opposite. If they did we could always have two types of table - the old one truly identical to before but without a del/excl proc and a new one that supports deletes via LP. It's mostly just changing one method and simplifying a couple others and in some ways actually less invasive than tombstone counters/rehashing/cuckoo/etc/etc. Good luck with your paper/workshop!
cblake: E.g., A table 50% occupied with 50% tombstones will be 100% full for unsuccessful searches and be vastly slower than a linear probing table with any but the most pessimal hash function.
Huh? I have no idea what you are saying here. If you seriously mean that you have, say, a table with 128 slots of which 64 carry data and 64 have deletion markers in them, that should not happen, period, and you're seriously misunderstanding something.
cblake: So, "equal deleted and occupied" as you mention would be a terribly too high threshold, guarding against an infinite loop, but poorly guarding against performance degradation.
No. After an insertion, if the number of occupied plus deleted slots exceed the maximum occupancy ratio, then you rebuild. This guarantees that the occupancy ratio (including deleted slots) will never exceed that maximum and the worst case performance will never be worse than for a table without deletions. You may be wasting space, but performance can never degrade to the point that you mention. If it were a problem, then Python's implementation would have crashed and burned long ago.
When I say that you rebuild when the number of deleted slots exceeds the number of occupied slots, that's only after a deletion, and only as a space-saving measure.
Rebuilding, of course, should never construct a table with a higher than the desired occupancy ratio, but we know the number of occupied slots in advance when we do that, so we know the size that we need.
cblake: I've seen linear probing beat even (at least a naive, standard) Cuckoo dynamic semi-perfect hashing just because the latter sometimes does two accesses.
Yes, I know that. I'm worried about the case where "the statistician drowned in a lake with an average depth of three feet". If the best case can be improved slightly at an increase in risk of a worst-case situation happening, then the tradeoff is really not acceptable IMO. This is a standard library function that must work robustly with a variety of hash functions and use cases without breaking down, not a special-purpose library that you fine-tune for a specific application.
Ah..you seem to mean to define load factor as (tombstones+occupied) over space which is fine while I was conceiving of separate thresholds. Guess we were talking past each other a little on that (and perhaps more).
Even so, rebuilding to remove deleted spots is definitely not "only a space-saving measure". It's more akin to why one grows at 66% rather than 99%. Even with double hashing a 66% full table is much slower than a 33% table - over 2X slower (the clusters are just a bit smaller than linear probing ones). With say 33% tombstones and 33% occupied you would have performance similar to that 66% when it could have been over twice as good. It's an even bigger performance change if you had 65% tombstones and 1% real keys. Feel free to consult Knuth if you don't understand. That diagram just graphs probes and doesn't suss out the cache load (or hit) from checking for empty from comparing keys and so on. An unsuccessful search in L1 cached table is usually very fast in an empty table. So, it's not the end-all, but a lot of people seem to have access to Knuth's TAOCP. Or maybe that's just a foolish hope of mine. ;-)
What I'm starting to sense is that you have internalized the trade-offs/risks for the Python impl, but maybe not for many other choices and are afraid of different/fear change. E.g., you write as if there is a hyper-sensitivity/precipitous breakdown linear probing is susceptible to that double hashing is not. Not true. All open addressing (besides Cuckoo) has clustering and breaks down as the table fills. Fancier probe sequences just soften that a little. Various separate chaining and other collision resolution strategies fully address it if you are really worried about it. I believe the g++ STL uses some separate chaining solution and it's (comparatively) slow as molasses and crazy memory hungry for integer keys last I tried. So, there are performance/memory/indirection/etc/etc trade-offs there, too, in terms of protection from three foot drowning. Anyway, there's no reason Nim couldn't grow those kinds of tables, too, and even include them in the std lib. What should be the default can always be sorted out over time. Anyway, I'm also starting to sense you won't ever feel cozy with my idea without an implementation to test. So, I should get back to that instead.
On that topic of 66% vs 99% sensitivity of performance to load, it might make sense for users to be able to pass a parameter to initTables to specify the load factor at which enlarging happens. It's not like that threshold isn't already a kind of arbitrary heuristic. Right now the library is choosing the space-speed trade-off for the user, but some users may know how big tables are compared to memory pressure, etc., or might glean which ones matter from profiling. Then you could just say growLoad=1/2 instead of the growLoad=2/3 currently in effect. If you did that, new linear probed tables 1.3X larger should be nigh identical to the double hashed ones (usually faster, I think, but you seem to be taking issue without measurement or even argument beyond "Gah, Python! Gah! Different! What if some 'dumb' user sets growLoad=99%?"). It's true users can "right size" the table if they know, but there's no easy way to tune the growth algorithm parameter.
Why, if we did do the special key value and non-hashcode-cached version for maybe another kind of table then we could even reclaim far more than that 1.3X space for tiny objects..probably could reduce the footprint by 2X+. [ Special key values as state codes aren't exclusive to linear probing, of course, but LP+Algo6.4R does need one less special value than deletion marking..so that approach is more amenable just to the extent to which it's easier to pick one special value than it is two. ]
Anyway, measurement/argument-wise, PyPy recently reported significant speedups (for some things, slow downs for some workloads) relative to your favorite CPython comparison point with a very different data structure you might enjoy learning about if you don't know it. While not always faster, it smells more robust to me, statistician drowning-wise. There are PyPy blog posts links in the PyPy-2.5.0 announcement. I also mentioned that earlier, but I will try stop to avoid repeating myself as these messages are already all far too long (or too short/not pedagogical enough), inspiring pwernersbach's "over my head comment", I suppose.
Feel free to consult Knuth if you don't understand
relative to your favorite CPython comparison point with a very different data structure you might enjoy learning about if you don't know it.
but you seem to be taking issue without measurement or even argument beyond "Gah, Python! Gah! Different! What if some 'dumb' user sets growLoad=99%?").
Consider yourself warned. There is no reason whatsoever to start personal attacks. The huge wall of text makes these offenses harder to spot, but unfortunately for you, I was bored enough to read it all.
On that topic of 66% vs 99% sensitivity of performance to load, it might make sense for users to be able to pass a parameter to initTables to specify the load factor at which enlarging happens.
From an API design point of view we should consider an enum expectOnlyPositiveSearches, expectOnlyNegativeSearches, expectMostlyPositiveSearches etc.
cblake: What I'm starting to sense is that you have internalized the trade-offs/risks for the Python impl, but maybe not for many other choices and are afraid of different/fear change.
Then you are sensing wrong. I have been using the Python implementation as a baseline because it (1) is one of the most battle-tested implementations in existence and (2) is very simple, so it's fairly easy to analyze and difficult to screw up. That does not mean that I am not keenly aware of existing and ongoing research in the field, than you, and I have read TAOCP, thank you (though I think you know as well as I that it doesn't really reflect the state of the art anymore).
I have also written my share of specialized hash table implementations, with a variety of requirements, often differing from the ones we have here. It's not as though I'm religiously opposed, but I'd like to see demonstrated rather than conjectured benefits. Most importantly, the main requirement of a standard library implementation is not that it exhibits optimal performance for a given use case, but robustness in the general case; you can always write a specialized implementation for when you have specific requirements.
cblake: E.g., you write as if there is a hyper-sensitivity/precipitous breakdown linear probing is susceptible to that double hashing is not.
No. I noted that linear probing, especially in the presence of weak hash functions, increases the risk of performance degradation, but did not make any claim about that being a binary on/off property. I was very specific there precisely to avoid giving precisely that impression, so I'm not sure where you got that from.
cblake: I didn't meant the "Gah" stuff personally, more jokingly and Jehan himself basically said CPython was his favored comparison point in other words.
That would be a misunderstanding. In fact, I think there are plenty of use cases where it is suboptimal (simple example: fine-grained concurrency, which is much easier to do with bucket hashing, though I've also implemented open addressing with fine-grained concurrent access in the past).
make personal "than you" attacks as you also kinda did above
Um, Jehan's "than you" was a typo for "thank you".
my admittedly shameful text mountains
You should work on that. It's pointless to write all that when it turns people off from reading it. Focus.