nimforum mirror - Why `unsafe_addr` had to be used in StringSlice?

alexeypetrushin (orginal) [2023-06-15T14:16:02+02:00] view original

Hi, I'm building trigrams for text. And not sure how to optimise it.

Nim won't allow to use addr and suggest to use unsafe_addr instead, why?

The code fails with unsafe_addr

It seems that the whole idea to store StringSlice in HashTable is wrong. The trigram string should be stored in HashTable, and Slice should be used only to get the code from the table, and not be stored there. But Table[string, int] won't allow to be queried by table[key_of_string_slice_type].

import std/[tables, hashes, strutils]

type SSlice = object
  s:    ptr string
  i, l: int

proc sslice*(s: ptr string, i: int, l: int): SSlice =
  SSlice(s: s, i: i, l: l)

proc `$`*(s: SSlice): string =
  for j in 0..(s.l - 1): result.add s.s[s.i + j]

proc `==`(a, b: SSlice): bool =
  if a.l != b.l: return false
  for j in 0.int..(a.l - 1):
    if a.s[a.i + j] != b.s[b.i + j]: return false
  true

proc hash*(s: SSlice): Hash =
  result = 0
  for j in 0..(s.l - 1):
    result = result !& s.s[s.i + j].hash
  result = !$result

var codes: Table[SSlice, uint16]

template encode_trigram*(s: SSlice): uint16 =
  var r: uint16
  with_value(codes, s, value):
    r = value[]
  do:
    codes[s] = codes.len.uint16
    r = codes[s]
  r

proc to_trigrams*(text: string): seq[uint16] =
  if text.len < 3: return text.align_left(3).to_trigrams
  for i in 0..(text.len - 3):
    result.add sslice(unsafe_addr text, i, 3).encode_trigram
  let last_two = text[^2..^1] & " "
  result.add sslice(unsafe_addr last_two,  0, 3).encode_trigram
  let last_one = text[^1..^1] & "  "
  result.add sslice(unsafe_addr last_one, 0, 3).encode_trigram

echo "some text".to_trigrams
echo "some".to_trigrams

P.S.

Also, not sure why this code fails?

type SSlice = object
  s:    string
  i, l: int

template sslice*(s: string, i: int, l: int): SSlice =
  SSlice(s: s, i: i, l: l)

echo sslice("some", 0, 3)

.nim(10, 13) Error: identifier expected, but found '"some"'

PMunch (orginal) [2023-06-15T14:28:49+02:00] view original

Haven't looked at the rest of your code, but your P.S. questions fails because you use the same identifier on both sides of the colon. The result of calling that template would be: SSlice("some": "some", 0: 0, 3: 3) which is obviously not valid code.

alexeypetrushin (orginal) [2023-06-15T14:31:30+02:00] view original

Thanks, didn't realised it, I updated the question and removed the P.S.

Araq (orginal) [2023-06-15T14:50:57+02:00] view original

There is no reason why it should work as your ptr values can refer to memory that has long been freed:

proc to_trigrams*(text: string): seq[uint16] =
  if text.len < 3: return text.align_left(3).to_trigrams
  # broken, you store pointers into text.align_left(3) which is a temporary string that isn't kept alive

Araq (orginal) [2023-06-15T15:08:01+02:00] view original

The easiest way which should be fast enough seems to be:

type
  Trigram = object
    a: array[3, char] # fill with space if length of substring < 3

proc `==`(x, y: Trigram): bool {.inline.} = x.a == y.a
proc hash(x: Trigram): Hash {.inline.} = ...

var tab: CountTable[Trigram]

...

alexeypetrushin (orginal) [2023-06-15T15:17:26+02:00] view original

Thanks, I got the idea.

mratsim (orginal) [2023-06-15T15:40:41+02:00] view original

I see tri-grams and hahstable, I don't know what you're doing however just in case look at Facebook's fasttext. They need to store 3 to 5-grams and it's pretty optimized and short: https://github.com/facebookresearch/fastText/tree/48171ac/src

alexeypetrushin (orginal) [2023-06-15T15:56:02+02:00] view original

I'm doing simple full text fuzzy search for note taking app, couple of thousands docs, each a page or couple of text. The simple optimised version should be good enough.

Thanks, for fasttext, will keep that in mind if need to make it faster.

awr1 (orginal) [2023-06-15T19:33:08+02:00] view original

Nim won't allow to use addr and suggest to use unsafe_addr instead, why?

unsafeAddr is on its way out for 2.0 iirc but, moving beyond reorienting around arrays: parameters are constant in Nim (unless you use something like var) through which whether a copy or an implicit ref is passed becomes an implementation detail (barring certain pragma specifiers like {.byref/bycopy.}). Unsafe addressing enables addressing of runtime constant data.

Zoom (orginal) [2023-06-15T20:23:04+02:00] view original

Store them as 4-byte arrays with a '0' prepended and cast as uint32 for hashing for a ~ 15% speed boost.

cblake (orginal) [2023-06-15T21:22:44+02:00] view original

or ap pended if you ever want to just cast[cstring](trigm[0].addr) :) The speed-up is mostly about the impl of hash, I expect. So, if that 25% storage vs array[3] is not paid for by better memory alignment you could also re-do hash(Trigram) to load a uint32 and then hash() that..

There is an old article from Russ Cox, a soft-spoken in person but smart fellow, about how to trigram index a larger set of docs with a repo in Go and there is a nice much faster replacement for locate on Unix called plocate in C++. More material Russ mentions from Managing Mega ne GigaBytes can also be helpful. I also started an experiment in pedagogy along these lines over at nimsearch if you are interested in persisting, but did not get very far. There is also Cello, and I wrote a pattern matching trie to support a fancy auto-abbreviation system used mostly by lc and probably many other "close but not quite" Nim things.

alexeypetrushin (orginal) [2023-06-16T01:25:39+02:00] view original

Thanks for the advices, the nimsearch library looks interesting.

I finished the simple search version, documents are partially pre-indexed, and there are some simple optimisations. Hopefully it should be enough for quick search couple of hundreds of short documents.

let db = Db(docs: [
    "this is smme text message",
    "this is some text message",
    "another message"
  ].mapit(Doc.init(it)))
  
  let score_fn: ScoreFn[Doc] = build_score[Doc]("some te")
  var found: seq[(Match, Doc)]
  for doc in db.docs: score_fn(doc, found)
  p found
    .sortit(-it[0].score) # Sorting by score
    .mapit(match(it[0], it[1].text)) # => @["some te", "smme te"]

Going to plug it into note taking app, and see how it goes, and then take look at resources and improve it.

Mirror of forum.nim-lang.org

10274 :: Why `unsafe_addr` had to be used in StringSlice?