this isn't Nim specific, of course, but how do i design a mapping from pointers in a single process's address space to int32 to minimize the likelihood of collision?
To start with, i know that only the lower 48bits are used.
and if i know the alignment of the pointer type i might be able to eliminate a couple of the low bits
is there anything else about this domain i can use to maximise the collision resistance of this mapping? say, a way to weight the lower bits more than the higher (if that's even appropriate)
assume that it's not important that the hash function have any other properties other than collision resistance, i.e. uniform distribution etc are all subsumed by this goal.
It is important here to distinguish between collision resistance in terms of like a "self avoiding random walk" - i.e. minimizing collision, and just "being close to random".
Assuming you mean the latter, the regular hash(uint64) in the stdlib should be quite resistant when given 32-bit inputs. I tested it with this large-scale pseudo-random number generator test suite. If you have counter examples not just "worry" that would be interesting.
If you want to guard it from an adversary you could put some hard to guess salt in there (e.g. hash(adr xor salt)). If you are in a table context the salt could be derived from the VM addr of each instance of a data structure like I think I do in adix by default.
You could use CPU/OS physical-based RNG to be even more secure at some cost.
Anyway, to help you it might help for you to give slightly more context on how you will use these hashes..
If instead you took the lower bits of the full product of 42*42 of 86 bits (easy to do - just multiply in a uint64 and then mask to get the lower 32), then you would get less bit mixing, but more resistance in the small..like (I think) guaranteed non-collision of hashes from all or almost all input values a certain distance apart.
Usually you want some bit mixing or you would not call it a "hash" (which comes from a kind of hash browns/chopping up and mixing together metaphor). Anyway, I think these posts of mine may at least help in making your request/ask more precise.
is there anything else about this domain i can use to maximize the collision resistance of this mapping? say, a way to weight the lower bits more than the higher
I don't think you want any used bits to be weighted higher than other. In any case you'd be better off with testing for your specific input and looking at bitmaps for patterns and clustering + collision counter (or heatmap). Collect a representative sample of your universe and start with something simple: hash(p), hash(p shl 8) or some other basic transformation first, before trying other hashing functions.
Your problem can be viewed as noisy-channel encoding. Though in practice we map a block to larger space, your case is mapping to a smaller space, but the mathematical structure is the same. There is a family of coding call linear code . Given you do know the collision probability of each bit, you can actually calculate the collision probability of the whole codeword and there must be a minimum one.
Some notations:
let x[i] = the i-th bit of input, i = 1..48
let y[j] = the j-th bit of output, j = 1..32
let a[i,j] = the value at (i,j) in linear code, either 0 or 1
x and y have a relationship that y[j] is the sum of product of a[i,j]*x[i] over field-2.
denote as y[j] = sum[i=1..48](a[i,j] * x[i])
The probabilities:
Given p(i) = the collision probability of at i-th bit.
(i.e. given any two input x1, x2, they have the same bit at i-th)
let q(i) = 1 - p(i)
let p(i,j) = the collsion probability of x[i] + x[j] over field-2, i != j
= p(i) * p(j) + q(i) * q(j)
let q(i,j) = 1 - p(i,j)
let p(i,j,k) = the collsion probability of x[i] + x[j] + x[k]
= p(i) * p(j,k) + q(i) * q(j,k)
let q(i,j,k) = 1 - p(i,j,k)
in general,
let p(i[1],i[2],...,i[n]) = p(i[1]) * p(i[2],..,i[n]) + q(i[1]) * q(i[2],...,i[n])
let Py(j) = the collision probability of of y[j]
= P(i[1], i[2], ... i[n]) where a[i[k], j] = 1 for k = 1..n
For two distint inputs x1, x2 to have the same output y, y[i] has to be collide for all i and so the collision prob. of y is
A = product[j=0..32](Py(j))
Our goal is to find a[i,j] such that A is minimum, which is equivalent to finding the 32 vectors v with the smallest p(v[1],v[2]...)
I do not see any special structure to explicit here, finding the smallest 32 vectors is a discrete optimization problem, which in general only have brute force solution. With DP, the space requirement is 2^48*float with time complexity O(2^N), N = 48. And I stop here for now =]
As another example, if you had 48-bit addrs with FF's up top and 12-bit-aligned (4096) say virtual mem page-aligned, then you only have 36 bits. So, if you literally mask off the top 4 then instead of 64 Gigispots you'd have 4 GiSpots. If you are doing DLL/shared library load addresses you may not even have diversity in those top 4 bits.
The OS may have quadrants or octants. If it has hexigants (not sure that's a word) then the lower 32-bits of that 36-bit possibility are all unique and perfection is trivial (x shr 12) and 0xFFFFFFFF'u32. If it doesn't then you want to distribute the entropy from those 4 bits over "some amount" of the lower bits..Depending on what you are doing you might want to bias towards lower bits or upper bits. E.g., if you think the minimum size of a shared lib/DLL is 16 pages then you really get another 4 bits of alignment and you could just shift more.
With hashing what is best almost always "depends". Just the nature of the beast, unfortunately. An interesting modern variant is "perceptual hashing" where you try to create non-colliding hash summaries of one or more images/jpegs. When I did this in the past I'd use the discrete cosine weights of a small size-standardized black & white-ized image. There what is best could literally depend upon the person looking at images. :-)
Sample code of the above idea. Not verified =]
import heapqueue
import strutils
import strformat
import bitops
import algorithm
proc findMinCollision*(nVec: int, xs: seq[float]): (float, seq[int]) =
var n = xs.len
# ps[i] = probability of collision of bit-vector i
var ps = newSeq[float](1 shl n)
# initialize ps
for i in 0 ..< ps.len:
ps[i] = -1
for i, x in xs:
ps[1 shl i] = x
ps[0] = 1
# DP
proc run(i: int): float =
if ps[i] >= 0: return ps[i]
let n = firstSetBit(i) - 1
let m = 1 shl n
let p1 = ps[m]
let p2 = run(i xor m)
ps[i] = p1*p2 + (1-p1)*(1-p2)
return ps[i]
# brute force
var heap = initHeapQueue[(float, int)]()
for i in 0 .. ps.high:
let p = run(i)
if heap.len < nVec:
heap.push (-p, i)
elif p < -heap[0][0]:
heap.push (-p, i)
discard heap.pop()
# result
var py = 1.0
var ans: seq[int]
while heap.len > 0:
let (p,n) = heap.pop()
py *= -p
ans.add n
# NOTE: ans do not necessarily be linearly independent
result = (py, ans)
# show collision prob. of each vector
# echo ps
let probOfEachBits = @[0.6, 0.5, 0.4, 0.3]
let (p, ans) = findMinCollision(7, probOfEachBits.reversed)
echo "collision probability = ", p
for n in ans:
echo fmt" {toBin(n, probOfEachBits.len)} ({n})"