https://andrewkelley.me/post/string-matching-comptime-perfect-hashing-zig.html
This post from about Zig, a language with slightly different goals than Nim, but is probably interesting in the context of Nim as well.
Perfect string hashing is useful for matching one-of-x strings, when the list of strings is known in advance; at runtime, it requires a small number of constant operations (good implementations get 1 or 2), usually a single short (often single byte) lookup, and then you have a candidate which is either the right one, or the string you hashed isn't in the list.
There are various methods to find a perfect hash; one is guaranteed to exist although the algorithms to find it are often non trivial. Most implementations eventually resort to picking a random seed, using it to compute a standard hash, and just retry until finding one with no collisions on the list given. In this context, this paragraph has an interesting suggestion:
Another way this can be improved, to reduce compile times, is to have the perfectHash function accept the RNG seed as a parameter. If the seed worked first try, great. Otherwise the function would find the seed that does work, and emit a compile error instructing the programmer to switch the seed argument to the good one, saving time in future compilations.
But he also mentions other strategies (e.g. by length or by a single byte or pair of bytes that differ) - and the good thing is, its compile time. When I have the time (any day now for the past 3 years :( oh well) I'll try to implement something like that - but I thought others would be interested in the mean time,
Perfect string hashing is useful for matching one-of-x strings, when the list of strings is known in advance
It rarely is though, at least for error detection you can also have none of X strings. And hashing is getting more complicated by every passing day thanks to security considerations. (Can you prove it's robust against every malicious input?) I'd rather spend the time on BTrees. :-)
at least for error detection you can also have none of X strings
Yes, the result of a minimal perfect hash is always "it is either this specific string, or none of them; you'll have to compare to this specific string to tell". The result of a non-minimal perfect hash is "its none of these strings; or it's this specific string (compare to verify) or none of them". Error detection is part of the standard use case.
The "list of strings is known" is a common use case, e.g. for keywords in a stream parser, or an an IFF atom (... yes, I'm old), or Windows messages, etc. Obviously it's not for every use ...
Am I understanding correctly that you want to implement BTrees for read-only compiled "case" statements or read only mapping types?
Am I understanding correctly that you want to implement BTrees for read-only compiled "case" statements or read only mapping types?
Nim's "string cases" already compile to hash lookups and could maybe use "perfect" hashing here indeed, but nobody ever complained about the performance of string case statements and the collision avoidance wouldn't avoid the final memcmp call anyway (which is the most expensive step).
But yeah, a BTree for Nim's stdlib collections Table and CountTable would be nice, or maybe a separate type. Hash tables are hard to make secure, hard to give them predictable performance (occasionally you have to resize them) and may use more space than a BTree. Doesn't seem to be a wise modern "default" implementation anymore.