nimforum mirror - Bloom filter comparisson

PMunch (orginal) [2023-05-26T13:39:05+02:00] view original

After discussing fuzzy string finding in https://forum.nim-lang.org/t/10221 we came upon the topic of Bloom filters. In order to give them a whirl I checked out the two bloom filter implementations in Nimble, nimrod-bloom and flower (ignoring eth-bloom since it's deprecated and now part of a much bigger project). For nimrod-bloom I went through and updated it to more modern Nim, but nothing which should've changed the speed considerably. Flower was used as is.

The settings I used was 1 million entries and an error factor of 0.001 initially

For 1M insertions: nimrod-bloom: 89ms flower: 39ms

For 1M lookups: nirod-bloom: 91ms flower: 45ms

However the nimrod-bloom version got a false positive rate of 0.0007 while flower only got 0.0012, slightly higher than the requested 0.001. By changing flowers error rate argument to 0.00048 I managed to get an actual rate of 0.0007 (with 718 false positives for nimrod-bloom and 712 for flower). The timings for flower now changes to:

For 1M insertions: flower: 40ms

For 1M lookups: flower: 52ms

So only slightly slower, and still about twice as fast as nimrod-bloom.

The code to test each was this for bloom, and this for flower if you want to run your own tests.

As an aside I can also mention that since flower is pure Nim it is likely to work on more platforms, and since it uses the built-in hash type you can put anything which implements hash into it, and not only strings which is the case for nimrod-bloom.

cblake (orginal) [2023-05-26T16:09:17+02:00] view original

While this misunderstanding happens a lot, Bloom Filters have always been about space optimization with little consideration of caching/blocking, even in the aboriginal 5 page paper by Bloom (as footnote 2). My bl.tab test elaborates a bit more.

Just piggybacking on the @Pmunch flower code example, though, because some prefer the ultra-concrete - if I just tack to the end of your flower test program this snippet:

import std/hashes, std/sets
let t0 = getMonoTime()
var hs = initHashSet[uint32](1_000_000)
for k in kTestElements: hs.incl cast[uint32](hash(k) or 1)
let dt = getMonoTime() - t0

var missings: seq[string]
for i in 0..(nElementsToTest - 1):
  var missing = ""
  for j in 0..8: missing.add(sampleChars[rand(51)])
  missings.add missing

var nFP = 0
let t02 = getMonoTime()
for m in missings:
  if cast[uint32](hash(m) or 1) in hs: inc falsePositives
let dt2 = getMonoTime() - t02

echo dt, " sec to populate; ", nFP, "/1e6 false positives\n", dt2, " sec"

and then run it, I get this output:


Took (seconds: 0, nanosecond: 70817380) to insert 1000000 items.
N false positives (of 1000000 lookups): 488
False positive rate 0.0005
Took (seconds: 0, nanosecond: 77113991) seconds to lookup 1000000 items.
N lookup errors (should be 0): 0
(seconds: 0, nanosecond: 53354267) sec to populate; 0/1e6 false positives
(seconds: 0, nanosecond: 43402444) sec

I.e., regular old std/sets builds about 1.33X faster and queries about 1.78X faster than flower with zero false positives.. and this is not even surprising. If you want to compact the space even more you can use adix/sequint as adix/bltab does to use just the number of bits for your desired false positive target. (Also, adix/lptabz, in particular initLPSetz[uint32,uint32,0] builds 1.8X queries 2.9X faster than flower).

In the comparison above, Bloom Filters, while bad, are not SO awful because the whole thing fits in my L3 CPU cache. Were the scale much larger and if those hashes had had to actually hit DIMMs (i.e., if the N independent hashes blew out the speculative work ahead/fetch ahead budget &| it was not in query-after-query hot loop where successive DIMM latencies can be successfully hidden/pipelined), then Bloom Filters might be 10X slower for a small space savings (2-4X).

PMunch (orginal) [2023-05-27T14:43:19+02:00] view original

Ah yes, I should've been more clear here, the reason I'm comparing their speeds here is just to find the fastest bloom filter if you've already decided that you do in fact need a bloom filter. The reason for choosing a bloom filter for the task of optimising string search is simply that storing all trigrams of a file in a normal hash set would mean the hash set would be considerably larger than the file you're indexing. And if you're indexing an entire harddrive you definitely dont want to store more data it your index than on the drive itself.

PMunch (orginal) [2023-05-27T16:17:15+02:00] view original

That said adix looks super interesting, I'll definitely peruse the algorithms and data structures there

cblake (orginal) [2023-05-27T21:08:51+02:00] view original

storing all trigrams of a file in a normal hash set

The comparison is not between an exact set & Bloom, but between Bloom & another approx idea: hash sets of B-bit numbers (sometimes called fingerprints or truncated hashes or just B-bits of hash), just as what Bloom 1970 does.

It is true that my auto-growing std/sets.HashSet above does not make for the best cmp (among other issues with your test - I did that out of expediency / expected familiarity). Apologies if that confused, but I did already link to a better way to measure) with a more complete analysis. I'm repeating myself, though.

The two ideas have distinct discreteness (bits of hash v. num of hashes, discrete v.gradual saturation) & Bloom can be smaller. An intermediate idea (using 1.5ish not 1 mem accesses) is a Cuckoo filter. I have found Cuckoo tables to be fragile in practice.. needing high quality hashes or else infinite looping &| using up all mem. So, I advise Robin-Hood Linear Probing, as in adix/[bltab, lptabz].

alexeypetrushin (orginal) [2023-05-27T21:22:44+02:00] view original

For interesting ideas about probabilistic data strucutres, there's OnlineStats Julia package https://github.com/joshday/OnlineStats.jl

PMunch (orginal) [2023-05-30T12:51:24+02:00] view original

Tried another approach, similar to what you're proposing here. Essentially I used a Rabin-Karp rolling hash to calculate the hashes of each of the trigraphs. The implementation let me easily select how many bits the hash should be, so I matched it to a 4Kb block and simply set a single bit for presence. It worked fine, and was 4x faster than the bloom filter in my test. I didn't check the actual hit-miss rate though, but increasing the size of the set didn't seem to impact the speed much.

marcazar (orginal) [2023-06-28T09:11:39+02:00] view original

A while back I implemented this, maybe you can test and see if it fits your needs:

https://marcazar.github.io/BitVector/

Mirror of forum.nim-lang.org

10232 :: Bloom filter comparisson