nimforum mirror - String "interning" attempt

Serge (orginal) [2021-02-15T15:38:09+01:00] view original

Hi Nim community, is the following code safe or does it create a memory leak (i.e. will the values stored in the table be garbage collected or not)?


import tables

var strTable = initTable[string,  ptr string]()

proc internSting(s: string): ptr string =
    if s in strTable:
        return strTable[s]
    else:
        strTable[s] = unsafeAddr(s)
        return strTable[s]


var a = "hello"
var b = "hello"

a = internSting(a)[]
b = internSting(b)[]

echo "a = ", a
echo "b = ", b

echo (a == b)
echo (addr(a) == addr(b))

Stefan_Salewski (orginal) [2021-02-15T16:17:42+01:00] view original

strTable[s] = unsafeAddr(s)

Such code can easily generate crashes.

It may work when you pass to internSting() a string which is a global variable, but when you pass a proc local variable that variable may not exists any more when that proc is terminated.

Why would you want to write such strange code in Nim?

Serge (orginal) [2021-02-15T17:47:20+01:00] view original

In that case, in order to halve the size of the lexicon holding the set of strings to be « interned ».

The aim of this table lookup is to make sure that if the words « the » or « a » appear a million time in a large corpus of text, they all point to the same value.

Example: https://medium.com/@bdov_/https-medium-com-bdov-python-objects-part-iii-string-interning-625d3c7319de

Stefan_Salewski (orginal) [2021-02-15T18:04:06+01:00] view original

In that case one really should know what one is doing. And Nim is not Python.

"string pooling" is enabled for the gcc backend, and the question is how big the overhead by the hash table is? A pointer is already 4 or 8 byte in size. No idea from me, sorry.

Serge (orginal) [2021-02-15T18:18:35+01:00] view original

"string pooling" does not appear to be the default in Nim:


echo (unsafeAddr(a) == unsafeAddr(b))

Stefan_Salewski (orginal) [2021-02-15T18:44:32+01:00] view original

It is the default for the gcc backend. But string pooling works on the string constants stored in the executable. Your vars a and b are different entities of course. You do not want that var a changes when you modify var b. And a Nim string is a complex entity, containing length, capacity and a pointer to the actual data. Well it is more complex I know, we have a string version 2, so I think size of empty string is really small, as length and cap is stored in the data buffer when data is allocated, but I always forget details. Maybe you are more interested in Nim's cstring data type, or maybe you want to access the actual data by addr(mystring[0])?

You may explain in words what you really need or intent, then I assume that someone can tell you more.

Serge (orginal) [2021-02-15T19:23:06+01:00] view original

In words: Say, one wants to do some NLP on a large corpus of text, but one knows that the number of distinct words occurring in said corpus is far less than the total number of words appearing in it.

In that case, it is far more efficient to store the corpus in memory as a list of integers representing the words, combined with a correspondence table (hash table) which ties the words to the integers representing the respective words.

Basically, it is an indirection of sorts, with the correspondence: word -> integer rather than the usual address -> value.

In my proposed implementation, above, I simply use the memory address of the first occurrence of every unique word [of the corpus] to create the word->integer correspondence table.

shirleyquirk (orginal) [2021-02-15T19:35:45+01:00] view original

interesting problem, bear in mind that each Table entry contains a hash, along with the key and value, so that more than doubles the size of your 'halved' lexicon.

irrespective of whether a table is the right data structure to solve this problem, I see no benefit to using pointers; a Table[string,string] would work just as well without the headaches.

Serge (orginal) [2021-02-15T19:54:22+01:00] view original

A table [string, string] does not provide you with a placeholder/index/integer to store as an alias for a string...

Serge (orginal) [2021-02-15T20:01:51+01:00] view original

Wrong about python: if you use the function intern(str), strings are "interned" at run time as well ( https://stackoverflow.com/questions/15541404/python-string-interning )

cblake (orginal) [2021-02-16T17:52:53+01:00] view original

One more probably too advanced thing - if your calculation uses inequality comparisons a lot, you can just sort str[], clear the table, and re-insert them all to get the property that string1 < string2 if and only if wordNum1 < wordNum2. This of course invalidates any prior wordNum's. So, it is likely only worth doing if you have multiple "phases" to your calcs and past the first phase cares about order. That dovetails into order preserving minimal perfect hash functions which then connects to information retrieval techniques indexers like Google use (covered fairly accessibly in Witten et al. 1999 Managing Gigabytes). { You probably needn't concern yourself with minimal perfection, though, esp. since Nim Table uses simple linear probing for collision resolution and has a strong-ish string hash. } Anyway, "interning" and "corpus representation" are along a continuum.

cblake (orginal) [2021-02-17T19:19:44+01:00] view original

Here is a minimally optimized module to do interning - basically 30 lines of a specialized hash table, 30 of the main logic, and a little test thing (more useful if you change the XXX 1024 to 2). Only minimal error checking is done and interning "" may not work and several things should probably be made into compile-time parameters like lenBits, but it seems mostly ok.

Time/space measuring it on the above linked SOWPODS file as input suggests it is about 2.5x faster and uses < 1/4 the space according to cgmemtime. { As a witty fella named @Araq once observed, "optimization is specialization". ;-) }:

import hashes

type InTab*[Q, N] = object
  data: seq[N] # could adix/seqUint to mem optimize
  count: int

proc find[Q, N](t: InTab[Q, N], q: Q|N, h: Hash): int =
  let mask = t.data.len - 1
  var i = h and mask
  while t.data[i] != N(0):
    if t.data[i] == q: return i
    i = (i + 1) and mask
  -i - 1 # missing, return -(insert point) - 1

proc grow[Q, N](t: var InTab[Q, N]) =
  var alt = newSeq[N](2 * t.data.len)
  swap t.data, alt
  for n in alt:
    if n != N(0): t.data[-t.find(n, n.hash) - 1] = n

proc getOrPut*[Q, N](t: var InTab[Q, N], q: Q, n: N): N =
  if t.data.len == 0: t.data.setLen 1024 #XXX 2 to test
  let h = q.hash
  var i = t.find(q, h)
  if i < 0:
    if (t.count + 1) * 4 > t.data.len * 3:
      t.grow; i = t.find(q, h)
    i = -i - 1
    t.data[i] = n
    t.count.inc
  t.data[i]

type InStr* = distinct uint32       #spc =~ 11..15*nInStr()
const lenBits = 6                   # 2^6=63 is v.long
const lenMask = (1 shl lenBits) - 1 # 2^26 is much unique
var tab: InTab[string, InStr]
var dat = newStringOfCap(6000)
proc off*(n: InStr): int = n.int shr lenBits
proc len*(n: InStr): int = n.int and lenMask
proc `$`*(n: InStr): string = dat[n.off ..< n.off + n.len]
proc `==`*(a, b: InStr): bool {.borrow.} # the fast one
proc space*(): int = dat.len + tab.data.len * InStr.sizeof
proc nInStr*(): int = tab.count

proc `==`*(a: InStr, b: string): bool =
  proc mcmp(a, b: pointer; n: csize): cint
       {. importc: "memcmp", header: "<string.h>" .}
  a.len == b.len and
    mcmp(dat[a.off].addr, b[0].unsafeAddr, b.len) == 0
proc `==`*(a: string, b: InStr): bool = b == a # swap args

proc hash(n: InStr): Hash = # MUST MATCH hash(string)!
  let p = cast[ptr UncheckedArray[byte]](dat[n.off].addr)
  hash(toOpenArray[byte](p, 0, n.len - 1))

proc intern*(s: string): InStr =
  if s.len > lenMask:
    raise newException(ValueError, "s too long")
  if dat.len + s.len >= 1 shl (InStr.sizeof * 8 - lenBits):
    raise newException(ValueError, "too much unique")
  let n = InStr((dat.len.InStr.int shl lenBits) or s.len)
  result = tab.getOrPut(s, n)
  if result == n: dat.add(s)

when isMainModule:
  import strutils
  for itr in 1..2:
    for w in "hello more interned world".split():
      let a=intern(w);echo a.int," ",a," ",tab.data," ",dat
  echo intern("hello")=="hello", intern("there")=="world"
  echo tab.data," ",dat

Oh, and for completeness here is my test program for that SOWPODS..pretty trivial but @Serge seems new to Nim:

import os, strutils
when defined(trivialIntern):
  import intern0
else:
  import intern
for w in open(paramStr(1)).readAll.split(): discard intern(w)

where intern0.nim is just my above 10-line Table & seq variant with Str -> InStr (for "interned string").

Exact space will, of course, always depend upon average word length in your corpus. It's like 9.11 bytes in SOWPODS but like 7.46 bytes in (cd /usr/src/linux-5.6.12; cat **/*.txt) for purely alphabetic words. But I added an nInStr() and space() calls to assess that if you like.

Anyway, in spite of the many authors there are only like 21965 words in the Linux in-kernel doc corpus. So, 128 KiB for the hash table part and 142067 for the data part or total of 266 KiB means almost fully L2 resident (L2 is usually 256 KiB, but it has been growing).

rahulmittal79 (orginal) [2021-02-18T20:09:03+01:00] view original

I'll try to test your model against mine for speed, but I guess appending to an array is not ideal for performance.

cblake (orginal) [2021-02-18T22:03:18+01:00] view original

Someone can correct me if I am wrong, but in lib/system/strs_v2.nim it seems growth will be doubling until 64 KiB and then 1.5x-ing afterward. So, there will be a little allocation time, but mostly just the standard amortized constant from exponential growth deal and checking there is space each word and doing a memcpy of the word. memcpy is pretty optimized (built into gcc these days).

I'm not sure why people think this part should be slow..perhaps extrapolating from the behavior of some other prog.lang runtime, but you are the second person to say this. So, I thought a response warranted. Maybe folks are thinking strings are gigantic and not 5-16 byte words? I suppose they could be sometimes, but @Serge did say "Say, one wants to do some NLP". Even German words top out around 84 chars, IIRC, but that is a max not an average which is what matters to estimate performance here.

Also, you seem new around here. When you benchmark, be sure to compile the nim code with -d:danger. Anyway, I don't think that 60 line version is like the end-all/be-all, but it seems like a better initial starting point than memory hoppy linked lists whose costs are so often misunderstood. I'd be curious to see credible evidence to the contrary.

Oh, I didn't really mention but I use Nim's sum types and overloading to really have 2 versions of find - one for a string for querying and one for packed (off,len) numbers for table growth. Otherwise, you could probably port these ideas to many prog.langs.

Also, I am aware of the nimble package names but it looks very memory inefficient..probably more so that the 10 line variant above..2 ptrs, 2 ints, plus an indirect string makes for like > 40..50 bytes per word|name on average. Also, that package is unmaintained, needs --nilseqs:on to compile, and at least 3 tests fail even with that, and it looks link hoppy, though I haven't tried to time it. Anyway, I strongly suspect my above code is the best starting point from what I've seen so far, but have at it.

shirleyquirk (orginal) [2021-02-18T22:47:15+01:00] view original

what's your model, @rahulmittal79 ?

appending to an array is, well, you know, as fast as it can be. Nim's allocator isn't advertised much, but it's honest-to-goodness amortized O(1), unlike some.

cblake (orginal) [2021-02-18T23:07:43+01:00] view original

So, loading up that SOWPODS takes about 25 ms on my computer. That is a pessimal workload for interning since literally every single word is novel and induces an append. Running it under Linux perf record -F25000 I get a perf report the top of which is:


47.90% sowint sowint       [.] NimMainInner
17.12% sowint sowint       [.] memhash_wy.constprop.0
 5.98% sowint sowint       [.] rawAlloc__mE4QEVyMvGRVliD..
 4.97% sowint libc-2.33.so [.] __memcmp_avx2_movbe
 2.84% sowint libc-2.33.so [.] __memmove_avx_unaligned_erms
 1.73% sowint sowint       [.] resizeString
 1.64% sowint libc-2.33.so [.] __memset_avx2_erms

So, you know the alloc/resize/memmove/memset might all be part of the array append. Total that is 5.98+2.84+1.73+1.64=12.19%. The hash function alone (Wang Yi string hash in this case, a bit faster than the murmur..So, your times should be even more tilted to hashing).

Looking into that NimMainInner, at least 10% total time is in memory stalls on a mov from the hash lookup.

So, at a minimum, in the worst worst possible case for appending, hashing costs >2.25x the append. I suspect it is more like 3..4x. Then in a real case interning a corpus it could be 30x after warm-up..the way word frequency distributions work. So, this array append folks are worrying about is probably a 3% impact against an actual corpus, by this admittedly preliminary guestimating analysis.

cblake (orginal) [2021-02-20T13:18:03+01:00] view original

Since 2 people are talking about benchmarking, in the interests of reproducible data and also to know what to expect from an organic corpus, I fetched the 762055 byte Tale Of Two Cities book into /dev/shm/tale and then ran this program on it (intern.nim is my 71 line code snippet above):

import os, times, intern
iterator lowCaseWords(f: File): string =
  ## yeah, yeah...I do not handle EOL hyphenation
  var wd = ""
  for line in f.lines:
    for i, ch in line:
      if   ch in {'a'..'z'}: wd.add ch  #In-word; Extend
      elif ch in {'A'..'Z'}: wd.add char(ord(ch) + 32)
      elif wd.len > 0:                  #Non-word ch
        yield wd; wd.setLen 0           #yield & reset
    if wd.len > 0:
      yield wd; wd.setLen 0

var t0 = epochTime()
var words: seq[string]
for w in paramStr(1).open.lowCaseWords:
  words.add w
echo epochTime()-t0, " to build ", words.len, " words"

t0 = epochTime()
for w in words: discard intern(w)
echo epochTime()-t0, " to intern ", nInStr(), " unique"

To remove re-hashing growth time, I changed the initial hash table size to be 16384 and the initial string size to 16000 (it has to grow to an eventual 71055 for this book). Then I did a PGO build and did best of 5 runs to get:


0.0089173316.. to build 138142 words
0.0012156963.. to get sum 62825179
0.0025112628.. to intern 9718 unique

2.511 - 1.2156 = 1.30 ms/138142 words = 9.4 nanosec/word (above just summing chars (which was the fastest thing I could think of that tripped over the data) seems pretty good to me.

Anyway, that book is a publicly available, reproducible test case for @rahulmittal79 or @Serge to try their approaches on. { It's not quite a random selection of source material since that particular book starts off "It was the best of times, it was the worst of times..". Very benchmarking apropos Re: comparing times. :-) } Even more realistic would be to do a collection of books from project Gutenberg.

Also, words are 4B (off,len). The original corpus has lengths averaging 5.22B (including a separator byte). So, not much space savings here @Serge, but working with ints is naturally much faster for some things. One could trade off transformed document space for interning speed here by using a 14..18 bit word number in the hash table (with adix/seqUint) which is the index of an array of the (off,len) reps. Since TOTC has 9718 unique, 14 is enough and you could get 5.22*8/14 =~ 3x reduction while keeping everything pretty "online & updatable" efficient. (gzip -9 only gets 2.62x:1 ratio, but to be fair I wasn't counting table overhead & the proposed new off,len array in my 3x).

Have a nice weekend!

rahulmittal79 (orginal) [2021-02-23T16:57:36+01:00] view original

appending to an array is, well, you know, as fast as it can be. Nim's allocator isn't advertised much, but it's honest-to-goodness amortized O(1), unlike some. https://get-mxplayer.in/mx-player-for-pc-download/

Yardanico (orginal) [2021-02-23T17:00:56+01:00] view original

I believe you've posted the wrong link :P

Mirror of forum.nim-lang.org

7509 :: String "interning" attempt