nimforum mirror - Interesting idea implemented in Zig

cumulonimbus (orginal) [2018-09-16T11:27:56+02:00] view original

https://andrewkelley.me/post/string-matching-comptime-perfect-hashing-zig.html

This post from about Zig, a language with slightly different goals than Nim, but is probably interesting in the context of Nim as well.

Perfect string hashing is useful for matching one-of-x strings, when the list of strings is known in advance; at runtime, it requires a small number of constant operations (good implementations get 1 or 2), usually a single short (often single byte) lookup, and then you have a candidate which is either the right one, or the string you hashed isn't in the list.

There are various methods to find a perfect hash; one is guaranteed to exist although the algorithms to find it are often non trivial. Most implementations eventually resort to picking a random seed, using it to compute a standard hash, and just retry until finding one with no collisions on the list given. In this context, this paragraph has an interesting suggestion:

Another way this can be improved, to reduce compile times, is to have the perfectHash function accept the RNG seed as a parameter. If the seed worked first try, great. Otherwise the function would find the seed that does work, and emit a compile error instructing the programmer to switch the seed argument to the good one, saving time in future compilations.

But he also mentions other strategies (e.g. by length or by a single byte or pair of bytes that differ) - and the good thing is, its compile time. When I have the time (any day now for the past 3 years :( oh well) I'll try to implement something like that - but I thought others would be interested in the mean time,

Araq (orginal) [2018-09-16T12:13:30+02:00] view original

Perfect string hashing is useful for matching one-of-x strings, when the list of strings is known in advance

It rarely is though, at least for error detection you can also have none of X strings. And hashing is getting more complicated by every passing day thanks to security considerations. (Can you prove it's robust against every malicious input?) I'd rather spend the time on BTrees. :-)

cumulonimbus (orginal) [2018-09-17T08:41:48+02:00] view original

at least for error detection you can also have none of X strings

Yes, the result of a minimal perfect hash is always "it is either this specific string, or none of them; you'll have to compare to this specific string to tell". The result of a non-minimal perfect hash is "its none of these strings; or it's this specific string (compare to verify) or none of them". Error detection is part of the standard use case.

The "list of strings is known" is a common use case, e.g. for keywords in a stream parser, or an an IFF atom (... yes, I'm old), or Windows messages, etc. Obviously it's not for every use ...

Am I understanding correctly that you want to implement BTrees for read-only compiled "case" statements or read only mapping types?

Araq (orginal) [2018-09-17T09:06:34+02:00] view original

Am I understanding correctly that you want to implement BTrees for read-only compiled "case" statements or read only mapping types?

Nim's "string cases" already compile to hash lookups and could maybe use "perfect" hashing here indeed, but nobody ever complained about the performance of string case statements and the collision avoidance wouldn't avoid the final memcmp call anyway (which is the most expensive step).

But yeah, a BTree for Nim's stdlib collections Table and CountTable would be nice, or maybe a separate type. Hash tables are hard to make secure, hard to give them predictable performance (occasionally you have to resize them) and may use more space than a BTree. Doesn't seem to be a wise modern "default" implementation anymore.

sflennik (orginal) [2018-09-18T04:46:43+02:00] view original

I've used perfect hashing with rust. https://github.com/sfackler/rust-phf I think it is a great idea and it might be implemented in nim with a macro. If you know the entries at compile time, why not make the map as fast and small as possible. In the compile time case you don't have to worry about security issues and protect against malicious input.

Mirror of forum.nim-lang.org

4200 :: Interesting idea implemented in Zig