nimforum mirror - Creating unique ID from strings

stbalbach (orginal) [2020-11-14T04:55:32+01:00] view original

Given millions of records -- specifically citation templates on Wikipedia. Example citation:


{{Akademik dergi kaynağı|başlık=Air Pacific Ltd. History|tarih=2005|çalışma=International Directory|yayıncı=St. James Press|cilt=70}}

These are to be stored in a key-value database which require unique keys, preferably one that is derived from the citation itself. The variety of the citations is vast (300 languages etc) with no literal way to create a key from the text of the citation itself. I came up with this idea:


import hashes, base64, awk, unicode
import zip/zlib

#
# Reverse string
#
# credit: https://github.com/def-/nim-unsorted/blob/master/reverse.nim
#
proc isComb*(r: Rune): bool =
  (r >=% Rune(0x300) and r <=% Rune(0x36f)) or
    (r >=% Rune(0x1dc0) and r <=% Rune(0x1dff)) or
    (r >=% Rune(0x20d0) and r <=% Rune(0x20ff)) or
    (r >=% Rune(0xfe20) and r <=% Rune(0xfe2f))

proc uniReversedPreserving*(s: string): string =
  
  result = newStringOfCap(s.len)
  var tmp:
    seq[Rune] = @[]
  
  for r in runes(s):
    if isComb(r):
      tmp.insert(r, tmp.high)
    else:
      tmp.add(r)
  for i in countdown(tmp.high, 0):
    result.add(toUtf8(tmp[i]))

var
  origtx = "{{Akademik dergi kaynağı|başlık=Air Pacific Ltd. History|tarih=2005|çalışma=International Directory|yayıncı=St. James Press|cilt=70}}"
  comptx = compress(origtx, stream=RAW_DEFLATE)
  encotx = encode(comptx)
  key = substr(uniReversedPreserving(encotx), 0, 31)
  decotx = decode(encotx)
  uncotx = uncompress(decotx, stream=RAW_DEFLATE)

echo "Original: " & origtx
echo "Encoded: " & encotx
echo "key: " & key

It compress()'s the string to binary, encode()'s to ascii, reverses the string (the headers often contain repeatable characters), and take the first (last) 32 characters as the key.

Is there is a better/easier way to create a unique key from a string that is repeatable ie. each time the key is created it generates the same key? Understood that changes in spacing or capitalization will result in a different key even if the citation is otherwise the same, this is OK. Also, using the entire citation as the key won't work as they can be very long.

My vision is for a method that is repeatable across programming languages ie. any library that uses compress and encode would generate the same key, although I have not tested and suspect it would not work due to implementation differences of compress and encode in other languages.

jackhftang (orginal) [2020-11-14T05:16:46+01:00] view original

Isn't what you want is a hash algorithm? An example of fast and good hash is murmur3. As long as your input space is larger than the key, you cannot have a guarantee on uniqueness of keys. Though the chance to have collision of key with 128-bit murmur3 for millions record is rather low.

enthus1ast (orginal) [2020-11-14T07:52:01+01:00] view original

yes as @jackhftang use a hash

like md5, sha, xxhash ...

stbalbach (orginal) [2020-11-14T16:13:07+01:00] view original

Excellent, thank you for the leads. Based on information from https://cyan4973.github.io/xxHash/ I'll try xxhash from https://github.com/OpenSystemsLab/xxhash.nim .. it appears to be very fast and has many cross language.platform versions that are consistent in result. I am new the world of hash, never needed it before.

cblake (orginal) [2020-11-14T16:31:20+01:00] view original

stdlib has SHA1 (I know this is no longer "secure", but it's right there and his requirements don't sound like they have high security requirements). It should be far more collision resistant than xxhash, anyway, though indeed also quite a bit slower.

cblake (orginal) [2020-11-14T16:44:36+01:00] view original

NOTE: for identity the way you imagine, you probably want at least a 128-bit output hash.

stbalbach (orginal) [2020-11-14T23:40:42+01:00] view original

128 bit.. then murmur3 might be better, nearly as fast as xxhash and about 10x faster than SHA

snej (orginal) [2020-11-15T23:36:43+01:00] view original

A cryptographic hash would be safer, because in those it’s a requirement that it’s effectively impossible to find two strings with the same hash. In other hashes that’s _desirable but not required, because hash tables can handle duplicate hashes.

Blake2b is a good modern hash that’s supposed to be faster than SHA-1. It’s available in libSodium and Monocypher.

slonik_az (orginal) [2020-11-16T01:18:46+01:00] view original

There is already BLAKE3 which is several times faster than BLAKE2

cblake (orginal) [2020-11-16T01:35:33+01:00] view original

I agree that Blake3 is a choice that balances @stbalbach's needs well. It may not be even 2x slower than murmur or xxhash (for large inputs) and drastically more collision resistant both with a 256-bit output and because it is trying to be cryptographically secure.

Really, though, @stbalbach may be worrying about hash performance too early in the game. I recommend just using the stdlib SHA1 for your initial design and making it "easy" to swap out a different hash. When you are done, do some profiling. Maybe all your mucking about with Unicode will dominate your run time more than hashing.

Mirror of forum.nim-lang.org

7088 :: Creating unique ID from strings