Given millions of records -- specifically citation templates on Wikipedia. Example citation:
{{Akademik dergi kaynağı|başlık=Air Pacific Ltd. History|tarih=2005|çalışma=International Directory|yayıncı=St. James Press|cilt=70}}
These are to be stored in a key-value database which require unique keys, preferably one that is derived from the citation itself. The variety of the citations is vast (300 languages etc) with no literal way to create a key from the text of the citation itself. I came up with this idea:
import hashes, base64, awk, unicode
import zip/zlib
#
# Reverse string
#
# credit: https://github.com/def-/nim-unsorted/blob/master/reverse.nim
#
proc isComb*(r: Rune): bool =
(r >=% Rune(0x300) and r <=% Rune(0x36f)) or
(r >=% Rune(0x1dc0) and r <=% Rune(0x1dff)) or
(r >=% Rune(0x20d0) and r <=% Rune(0x20ff)) or
(r >=% Rune(0xfe20) and r <=% Rune(0xfe2f))
proc uniReversedPreserving*(s: string): string =
result = newStringOfCap(s.len)
var tmp:
seq[Rune] = @[]
for r in runes(s):
if isComb(r):
tmp.insert(r, tmp.high)
else:
tmp.add(r)
for i in countdown(tmp.high, 0):
result.add(toUtf8(tmp[i]))
var
origtx = "{{Akademik dergi kaynağı|başlık=Air Pacific Ltd. History|tarih=2005|çalışma=International Directory|yayıncı=St. James Press|cilt=70}}"
comptx = compress(origtx, stream=RAW_DEFLATE)
encotx = encode(comptx)
key = substr(uniReversedPreserving(encotx), 0, 31)
decotx = decode(encotx)
uncotx = uncompress(decotx, stream=RAW_DEFLATE)
echo "Original: " & origtx
echo "Encoded: " & encotx
echo "key: " & key
It compress()'s the string to binary, encode()'s to ascii, reverses the string (the headers often contain repeatable characters), and take the first (last) 32 characters as the key.
Is there is a better/easier way to create a unique key from a string that is repeatable ie. each time the key is created it generates the same key? Understood that changes in spacing or capitalization will result in a different key even if the citation is otherwise the same, this is OK. Also, using the entire citation as the key won't work as they can be very long.
My vision is for a method that is repeatable across programming languages ie. any library that uses compress and encode would generate the same key, although I have not tested and suspect it would not work due to implementation differences of compress and encode in other languages.
yes as @jackhftang use a hash
like md5, sha, xxhash ...
A cryptographic hash would be safer, because in those it’s a requirement that it’s effectively impossible to find two strings with the same hash. In other hashes that’s _desirable but not required, because hash tables can handle duplicate hashes.
Blake2b is a good modern hash that’s supposed to be faster than SHA-1. It’s available in libSodium and Monocypher.
I agree that Blake3 is a choice that balances @stbalbach's needs well. It may not be even 2x slower than murmur or xxhash (for large inputs) and drastically more collision resistant both with a 256-bit output and because it is trying to be cryptographically secure.
Really, though, @stbalbach may be worrying about hash performance too early in the game. I recommend just using the stdlib SHA1 for your initial design and making it "easy" to swap out a different hash. When you are done, do some profiling. Maybe all your mucking about with Unicode will dominate your run time more than hashing.