import std/[tables, hashes, strutils]
type SSlice = object
s: ptr string
i, l: int
proc sslice*(s: ptr string, i: int, l: int): SSlice =
SSlice(s: s, i: i, l: l)
proc `$`*(s: SSlice): string =
for j in 0..(s.l - 1): result.add s.s[s.i + j]
proc `==`(a, b: SSlice): bool =
if a.l != b.l: return false
for j in 0.int..(a.l - 1):
if a.s[a.i + j] != b.s[b.i + j]: return false
true
proc hash*(s: SSlice): Hash =
result = 0
for j in 0..(s.l - 1):
result = result !& s.s[s.i + j].hash
result = !$result
var codes: Table[SSlice, uint16]
template encode_trigram*(s: SSlice): uint16 =
var r: uint16
with_value(codes, s, value):
r = value[]
do:
codes[s] = codes.len.uint16
r = codes[s]
r
proc to_trigrams*(text: string): seq[uint16] =
if text.len < 3: return text.align_left(3).to_trigrams
for i in 0..(text.len - 3):
result.add sslice(unsafe_addr text, i, 3).encode_trigram
let last_two = text[^2..^1] & " "
result.add sslice(unsafe_addr last_two, 0, 3).encode_trigram
let last_one = text[^1..^1] & " "
result.add sslice(unsafe_addr last_one, 0, 3).encode_trigram
echo "some text".to_trigrams
echo "some".to_trigrams
P.S.
type SSlice = object
s: string
i, l: int
template sslice*(s: string, i: int, l: int): SSlice =
SSlice(s: s, i: i, l: l)
echo sslice("some", 0, 3)
.nim(10, 13) Error: identifier expected, but found '"some"'
There is no reason why it should work as your ptr values can refer to memory that has long been freed:
proc to_trigrams*(text: string): seq[uint16] =
if text.len < 3: return text.align_left(3).to_trigrams
# broken, you store pointers into text.align_left(3) which is a temporary string that isn't kept alive
The easiest way which should be fast enough seems to be:
type
Trigram = object
a: array[3, char] # fill with space if length of substring < 3
proc `==`(x, y: Trigram): bool {.inline.} = x.a == y.a
proc hash(x: Trigram): Hash {.inline.} = ...
var tab: CountTable[Trigram]
...
I'm doing simple full text fuzzy search for note taking app, couple of thousands docs, each a page or couple of text. The simple optimised version should be good enough.
Thanks, for fasttext, will keep that in mind if need to make it faster.
Nim won't allow to use addr and suggest to use unsafe_addr instead, why?
unsafeAddr is on its way out for 2.0 iirc but, moving beyond reorienting around arrays: parameters are constant in Nim (unless you use something like var) through which whether a copy or an implicit ref is passed becomes an implementation detail (barring certain pragma specifiers like {.byref/bycopy.}). Unsafe addressing enables addressing of runtime constant data.
or ap pended if you ever want to just cast[cstring](trigm[0].addr) :) The speed-up is mostly about the impl of hash, I expect. So, if that 25% storage vs array[3] is not paid for by better memory alignment you could also re-do hash(Trigram) to load a uint32 and then hash() that..
There is an old article from Russ Cox, a soft-spoken in person but smart fellow, about how to trigram index a larger set of docs with a repo in Go and there is a nice much faster replacement for locate on Unix called plocate in C++. More material Russ mentions from Managing Mega ne GigaBytes can also be helpful. I also started an experiment in pedagogy along these lines over at nimsearch if you are interested in persisting, but did not get very far. There is also Cello, and I wrote a pattern matching trie to support a fancy auto-abbreviation system used mostly by lc and probably many other "close but not quite" Nim things.
Thanks for the advices, the nimsearch library looks interesting.
I finished the simple search version, documents are partially pre-indexed, and there are some simple optimisations. Hopefully it should be enough for quick search couple of hundreds of short documents.
let db = Db(docs: [
"this is smme text message",
"this is some text message",
"another message"
].mapit(Doc.init(it)))
let score_fn: ScoreFn[Doc] = build_score[Doc]("some te")
var found: seq[(Match, Doc)]
for doc in db.docs: score_fn(doc, found)
p found
.sortit(-it[0].score) # Sorting by score
.mapit(match(it[0], it[1].text)) # => @["some te", "smme te"]
Going to plug it into note taking app, and see how it goes, and then take look at resources and improve it.