After discussing fuzzy string finding in https://forum.nim-lang.org/t/10221 we came upon the topic of Bloom filters. In order to give them a whirl I checked out the two bloom filter implementations in Nimble, nimrod-bloom and flower (ignoring eth-bloom since it's deprecated and now part of a much bigger project). For nimrod-bloom I went through and updated it to more modern Nim, but nothing which should've changed the speed considerably. Flower was used as is.
The settings I used was 1 million entries and an error factor of 0.001 initially
For 1M insertions: nimrod-bloom: 89ms flower: 39ms
For 1M lookups: nirod-bloom: 91ms flower: 45ms
However the nimrod-bloom version got a false positive rate of 0.0007 while flower only got 0.0012, slightly higher than the requested 0.001. By changing flowers error rate argument to 0.00048 I managed to get an actual rate of 0.0007 (with 718 false positives for nimrod-bloom and 712 for flower). The timings for flower now changes to:
For 1M insertions: flower: 40ms
For 1M lookups: flower: 52ms
So only slightly slower, and still about twice as fast as nimrod-bloom.
The code to test each was this for bloom, and this for flower if you want to run your own tests.
As an aside I can also mention that since flower is pure Nim it is likely to work on more platforms, and since it uses the built-in hash type you can put anything which implements hash into it, and not only strings which is the case for nimrod-bloom.
While this misunderstanding happens a lot, Bloom Filters have always been about space optimization with little consideration of caching/blocking, even in the aboriginal 5 page paper by Bloom (as footnote 2). My bl.tab test elaborates a bit more.
Just piggybacking on the @Pmunch flower code example, though, because some prefer the ultra-concrete - if I just tack to the end of your flower test program this snippet:
import std/hashes, std/sets
let t0 = getMonoTime()
var hs = initHashSet[uint32](1_000_000)
for k in kTestElements: hs.incl cast[uint32](hash(k) or 1)
let dt = getMonoTime() - t0
var missings: seq[string]
for i in 0..(nElementsToTest - 1):
var missing = ""
for j in 0..8: missing.add(sampleChars[rand(51)])
missings.add missing
var nFP = 0
let t02 = getMonoTime()
for m in missings:
if cast[uint32](hash(m) or 1) in hs: inc falsePositives
let dt2 = getMonoTime() - t02
echo dt, " sec to populate; ", nFP, "/1e6 false positives\n", dt2, " sec"
and then run it, I get this output:
Took (seconds: 0, nanosecond: 70817380) to insert 1000000 items.
N false positives (of 1000000 lookups): 488
False positive rate 0.0005
Took (seconds: 0, nanosecond: 77113991) seconds to lookup 1000000 items.
N lookup errors (should be 0): 0
(seconds: 0, nanosecond: 53354267) sec to populate; 0/1e6 false positives
(seconds: 0, nanosecond: 43402444) sec
I.e., regular old std/sets builds about 1.33X faster and queries about 1.78X faster than flower with zero false positives.. and this is not even surprising. If you want to compact the space even more you can use adix/sequint as adix/bltab does to use just the number of bits for your desired false positive target. (Also, adix/lptabz, in particular initLPSetz[uint32,uint32,0] builds 1.8X queries 2.9X faster than flower).
In the comparison above, Bloom Filters, while bad, are not SO awful because the whole thing fits in my L3 CPU cache. Were the scale much larger and if those hashes had had to actually hit DIMMs (i.e., if the N independent hashes blew out the speculative work ahead/fetch ahead budget &| it was not in query-after-query hot loop where successive DIMM latencies can be successfully hidden/pipelined), then Bloom Filters might be 10X slower for a small space savings (2-4X).
storing all trigrams of a file in a normal hash set
The comparison is not between an exact set & Bloom, but between Bloom & another approx idea: hash sets of B-bit numbers (sometimes called fingerprints or truncated hashes or just B-bits of hash), just as what Bloom 1970 does.
It is true that my auto-growing std/sets.HashSet above does not make for the best cmp (among other issues with your test - I did that out of expediency / expected familiarity). Apologies if that confused, but I did already link to a better way to measure) with a more complete analysis. I'm repeating myself, though.
The two ideas have distinct discreteness (bits of hash v. num of hashes, discrete v.gradual saturation) & Bloom can be smaller. An intermediate idea (using 1.5ish not 1 mem accesses) is a Cuckoo filter. I have found Cuckoo tables to be fragile in practice.. needing high quality hashes or else infinite looping &| using up all mem. So, I advise Robin-Hood Linear Probing, as in adix/[bltab, lptabz].
A while back I implemented this, maybe you can test and see if it fits your needs: