nimforum mirror - Is it a good idea to downcast Hash

planetis (orginal) [2022-02-06T20:07:13+01:00] view original

Hi,

while trying to optimize bitabs.nim a bidirectional table implementation taken from the compiler, I had to group the key and hcode together because benchmarking showed that the getKey check was faster due to cache locality (25%). Now for that specific benchmark which does no favors to this implementation btw, downcasting the cached hash(val) showed another 5% speedup and 15% reduction in size (a tuple (int64, uint32) takes 16 bytes due to padding). Is this downcasting (as you can see in the links) a good choice or it may cause more hash collisions in other cases?

DeletedUser (orginal) [2022-02-07T06:51:45+01:00] view original

I didn't understand what you meant by "downcasting" at first. It should be fine to do this. Java uses 32 bit hashes and does fine.

planetis (orginal) [2022-02-07T10:31:34+01:00] view original

Thanks for the affirmation.

Maybe making LitId 64 bits as well as the hash could help?

I could do that, but how would it help?

DeletedUser (orginal) [2022-02-07T13:29:59+01:00] view original

It wouldn't necessarily help, but if aligning the uint32 does any extra work that extra work would be gone. If you are not doing any calculations on LitId either, then you won't be doing anything that would be faster if it had less bits. In this case though the normal performance pitfalls that come with alignment don't really apply.

planetis (orginal) [2022-02-07T14:23:38+01:00] view original

Ok you have a point there, changing to int64 will also prevent overflows when parsing big JSON files.

Araq (orginal) [2022-02-07T14:52:12+01:00] view original

Is this downcasting a Hash (as you can see in the links) a good choice or it is misguided and may cause more hash collisions in other cases?

It seems to be a good choice and I should do the same in my implementation.

planetis (orginal) [2022-02-07T15:36:16+01:00] view original

Yes please and I could help if you want, my project has stagnated anyway.

icedquinn (orginal) [2022-02-07T16:28:59+01:00] view original

The only downside to truncating a hash as far as I know is that it increases the possibility of a collision. Hash tables tend to severely truncate the result (ex. the final step where they modulo to how many buckets, say 200) but the quality of the hash algorithm itself can be paramount.

Araq (orginal) [2022-02-07T17:21:53+01:00] view original

Also fwi the benchmark results are

You should also try other JSON benchmarks. But the memory consumption looks too high to me, maybe we need to use SSO based strings inside the BiTable.

planetis (orginal) [2022-02-07T18:31:53+01:00] view original

but the quality of the hash algorithm itself can be paramount.

exactly it depends how good murmurHash is at producing unique hashes, I intend to find out with a benchmark that produces different combinations from the same letters, see how much truncating worsens the performance.

But the memory consumption looks too high to me

I tried to approximate it and I am 86mb less but very close: https://play.nim-lang.org/#ix=3OXu

maybe we need to use SSO based strings inside the BiTable.

will do that.

mratsim (orginal) [2022-02-10T10:11:46+01:00] view original

In Ethereum "attack nets", there were vulnerabilities raised over the poor quality of Java hash functions as it only uses 32-bit and a poor hash function:

https://archive.wikiwix.com/cache/index2.php?url=http%3A%2F%2Fwww.cogs.susx.ac.uk%2Fcourses%2Fdats%2Fnotes%2Fhtml%2Fnode114.html#federation=archive.wikiwix.com

https://www.javamex.com/tutorials/collections/hash_function_technical_2.shtml

A collision for Java can easily be done with data[i]*31 + (data[i+1]+31) = (data[i]+1)*31 + data[i+1]

Nim's murmurhash would be more involved but it's still 32-bit only. There are collision benches here:

https://github.com/nim-lang/Nim/pull/12022

https://github.com/nim-lang/Nim/issues/11581

https://github.com/nim-lang/Nim/pull/13418

For our use-case, hash tables are used to store P2P message topics and generating collisions on 32-bit hash is relatively inexpensive.

For a generic hash table in the standard library, it's important that they have decent worse behavior on collisions so that network applications don't need to reimplement them.

planetis (orginal) [2022-05-06T11:53:33+02:00] view original

If anyone is still interested, especially @Araq, I have ported it to use SSO strings, internally. The same benchmark result are now:


packedjson2:  used Mem: 386.075MiB time: 2.85s
packed  sso:  used Mem: 308.023MiB time: 3.35s
packed json:  used Mem: 94.02MiB time: 2.0s
stdlib json:  used Mem: 1.32GiB time: 3.3s

There is a bug that needs investigation with some Strings containing garbage when echoing but I doesn't affect the bench. I had an idea of removing nodes of kind opcodeKeyValuePair in order to trim some overhead, but I don't know if it's feasible unless I try to implement it. Lastly the remaining issue is with duplicate keys stored in the tree which I haven't found a performant way to avoid.

Araq (orginal) [2022-05-06T14:43:46+02:00] view original

Huh, I don't know, looks like "packedjson" is just unbeatable? Time to embrace it?

planetis (orginal) [2022-05-07T21:40:54+02:00] view original

I made up a benchmark that I can win just for fun.


packedjson2:  used Mem: 307.755MiB time: 5.68s
packed json:  used Mem: 312.02MiB time: 7.90s
stdlib json:  used Mem: 1.577GiB time: 17.1s

Mirror of forum.nim-lang.org

8873 :: Is it a good idea to downcast Hash