Wondering if there's fuzzy or trigram substring search library in Nim. It should be fast, so it should use some sort of index, in-memory index would be fine.
let doc1 = "therearemultiplereasonstoaboutforexeverysingleofthoseforextrendisbigenoughto"
let doc2 = "someothertext"
let se = SearchEngine()
se.add "doc1", doc1
se.add "doc2", doc2
echo se.fuzzy_search("foextrend") # => @[("doc1", "forextrend")]
The brute force implementation for trigram search is simple a) make 3 sliding windows of query's length b) move sliding windows through text c) for every substring in sliding window calculate trigram vector and compare it to query trigram vector and get the similarity score d) sort substrings by similarity scores and get the first N items.
P.S. By the way, ChatGPT was able to built that algo. I tell it to build trigram search, a) it did but didn't used sliding window, b) I told it to use sliding window and it successfully updated the code b) but it didn't used proper similarity vector, I told it to use cosine similarity, and again it successfully updated the code. Really impressive... sadly, it wasn't able to build search with index.
Does this help?
https://en.wikipedia.org/wiki/Approximate_string_matching
n-grams? adjust the window size according to your content
do you need something faster for your fuzzy search?
One alternative is the bitap algorithm, you give it two strings and the amount of errors you accept. It creates a bit pattern which it then uses to scan through the text pretty fast. Here is a sample implementation adapted from here:
import sequtils
proc searchString(text: string, pattern: string, k: int): int =
result = -1
if pattern.len == 0: return 0
if pattern.len > 63: return -1 # The pattern is too long?
var
R = newSeqWith[uint64](k + 1, not 1.uint64)
patternMask: array[char.high.int + 1, uint64]
for i in 0..char.high.int:
patternMask[i] = not 0.uint64
for i in 0..<pattern.len:
patternMask[pattern[i].int] = patternMask[pattern[i].int] and not (1.uint64 shl i)
for i in 0..text.high:
var oldRd1 = R[0]
R[0] = R[0] or patternMask[text[i].int]
R[0] = R[0] shl 1
for d in 1..k:
let tmp = R[d]
R[d] = (oldRd1 and (R[d] or patternMask[text[i].int])) shl 1
oldRd1 = tmp
if 0 == (R[k] and (1.uint64 shl pattern.len)):
result = (i - pattern.len) + 1
break
echo searchString("The quick brown foax jumps over the lazy dog", "fox", 1)
echo searchString("world hello", "hllo", 1)
Pros: It uses bitwise operations so it should be pretty fast
Cons: Doesn't index anything about the document, so it still needs to search through the text each time Requires you to set the "k" factor, or how many errors you accept. This means that if you want to start strict and ease up the k factor as you go along you might need to run many iterations. This can be somewhat alleviated by doing multiple scans at the same time or similar tricks to the algorithm.
You can also have a look at other fuzzy finder tools around, for example fzf has pretty readable sources: https://github.com/junegunn/fzf/blob/master/src/algo/algo.go. I implemented a simplified version of their V1 algorithm:
proc fuzzyMatchV1(text: string, pattern: string): tuple[start: int, stop: int] =
var
pos = 0
found = 0
first = -1
# Forward pass
while pos < text.high:
if text[pos] == pattern[found]:
inc found
if first == -1: first = pos
if found == pattern.len: break
inc pos
if found != pattern.len: return (start: -1, stop: -1)
result = (start: first, stop: pos)
if pos - first == pattern.high: return
let last = pos
dec found
# Backward pass
while pos > first:
if text[pos] == pattern[found]:
if found == 0:
return (start: pos, stop: last)
dec found
dec pos
It still passes through the entire text, and since it also goes backwards as well it might end up doing it twice. But as with the last algorithm it gives up as soon as it finds a match so it can be pretty fast.
These two algorithms works quite differently from one another. The first one I shared essentially allows k errors, but the length is always the same. So matching the input string "happy birthday" with "hack" and an error of two matches the first characters because "haXX" is two errors away from "happ". The second one requires all the characters to be present, but allows any number of insertions between the characters. This means that things like "a_____b__bc__" matched against "abc" would succeed and match from the first "a" to the last "c". With the first algorithm you would need to match "abc" with 1 error and it would hit "Xbc", ignoring the missing "a".
I haven't tested the speed of these against each other, but both should be fairly fast. On your sample my first algorithm finds it with k set to two, and fuzzyFind finds it as expected only in the first string.
Just did some tiny benchmarking, generated a 10Mb file with cat <(tr -dc A-Za-gi-z' ' </dev/urandom | head -c 10000000) <(echo "hello") and then searched for the word "hello", for the first algorithm I used a k value of 0 and 1, which both yielded about the same time.
The first algorithm takes about 30ms, and the second takes about 3ms. This is of course with searching all the way to the end or both searches.
Then I used this command cat <(echo -n "h") <(tr -dc A-Za-df-gi-km-np-z' ' </dev/urandom | head -c 10000000) <(echo "ello") to generate a similar file, but this time starting with an "h" and ending in "ello" with none of the characters in "hello" in the generated text. This means that the second algorithm scans all the way to the end, then all the way back to the start in pursuit of a shorter match. I set k to be 1 so that it would match the "Xello" at the end of the file.
The first algorithm now takes about 30ms, same as before. And the second algorithm takes 14ms. The first algorithm now matches the "Xello" at the end of the file as mentioned, but the second algorithm just matches the entire file.
The second algorithm could of course be made to return the positions of the individually matched characters instead of a start and stop which would make it more useful in this last case. This could also be used to create some kind of weighting based on how close together the characters are.
One alternative is the bitap algorithm
You can also have a look at other fuzzy finder tools around, for example fzf
Thanks for links, I checked agrep with tre algo, but didn't knew about these two ^ you mentioned.
The first algorithm takes about 30ms, and the second takes about 3ms.
Yes I also had a thought that maybe I worry about the speed too early and should just use bruteforce for start.
bloom filter over the N-grams of the document ... you would need at least N consecutive correct characters
Interesting idea, maybe, it would be possible to put docs into buckets of N, and create bloom filter for the whole bucket, and then check for potential ngram presence in the whole N docs at once.
Also, maybe it should be possible to use encodings for ngrams, and store/compare it not as strings but as ints.
Going to explore it later, for now going with brute force :)
Interesting idea, maybe, it would be possible to put docs into buckets of N, and create bloom filter for the whole bucket, and then check for potential ngram presence in the whole N docs at once.
Bloom filters are easily combinable, so it would be fairly trivial to make a tree structure of bloom filters. Kinda similar in concept to how you would use a Merkle tree. Check the root node filter to see if any of your files can contain your trigrams, then check each half set (or even just do it based on the folder structure) to see if their filters contain your trigrams, and so on until you have narrowed it down to a set of files to actually search through. Combining the bloom filters is as I said a pretty trivial, so once you've first calculated one for each file it would probably be computationally worth it to build the tree.
Just a little update, did some initial testing with bloom filters:
import flower, std/monotimes, sets
var
file = readFile("/tmp/test2.txt")
cache: HashSet[array[3, char]]
bf = newBloom(255*255*255, 0.001)
let start = getMonoTime()
for i in 0..file.high - 2:
let trigram = [file[i], file[i+1], file[i+2]]
if trigram notin cache:
bf.add trigram
cache.incl trigram
echo "Indexing time took: ", getMonoTime() - start
var pattern = "gtk_source_space_location_flags_get_type"
block check:
let start = getMonoTime()
for i in 0..pattern.high - 2:
if pattern[i..i+2] notin bf:
echo "Document can't contain pattern"
break check
echo "Check took: ", getMonoTime() - start
echo "Document can contain pattern"
import strutils
let start2 = getMonoTime()
echo file.find(pattern)
echo "Naive check took: ", getMonoTime() - start2
The /tmp/test2.txt file contains my 8.9Mb Futhark'd Gtk wrapper. The pattern occurs near the bottom of the file.
Building the bloom filter trigram index takes 94ms in release mode, searching the index for each trigram in the pattern takes 3105ns (or 0.003ms), while just running a text search with find takes 0.8ms. So if you have a lot of files and can eat the penalty of indexing the files it might be worth it. Of course this hides a very important detail. Reading the file into memory takes 3ms on this fast SSD. So a read+search of the file takes ~4ms. The bloom filter can be stored (and as we talked about structured as a tree), so spending those 0.003ms to figure out if you even need to touch the file can make a big difference.
Just a side note.
I did what I always do when I have idea to reinvent a bycicle. Stop, have a good rest, lay down on coach, had a walk in the forest, and rejected idea to make my own casino with blackjack and hookers. I already have my own web framework for toy plays and that's enough :).
Yet, I think the in-memory Search Engine could be a very good showcase of Nim. As it has bunch of interesting tasks a) IO for high request / sec, ideally CSP instead of async. b) threads / multicore to use all CPU cores to scan in-memory search indexes in parallel c) multicore, as you share the search index and process it in parallel d) memory and algorighms, the index structure should be both efficient in memory and colocated (value types) for fastest CPU (maybe even GPU) scan. e) the code size should be relatively small if you limit it to in-memory only, so it would be still accessible for people to examine and learn.
It could be a blueprint example of what Nim could do and how to use it. And it should be good enough for real usage as a a small and lightweight private search engine. And even could be compiled to JS and used for in-browser search.
It could be implemented in 2 ways:
The tokens could be clasical tokenisers, trigrams etc.
There is no built-in fuzzy or trigram substring search library in Nim. However, it is possible to implement one using the following steps:
Create a data structure to store the documents. This data structure could be a simple array, a hash table, or a more complex data structure such as a trie. For each document, create a trigram vector. A trigram vector is a vector of integers that represents the frequencies of each trigram in the document. Create an index for the trigram vectors. The index can be a simple hash table or a more complex data structure such as a kd-tree. To search for a substring, first calculate the trigram vector for the substring. Then, use the index to find all documents that contain the trigram vector. Finally, sort the documents by their similarity to the substring and return the top N documents.