I have seen the following module on Github:
https://github.com/pragmagic/names
"names" proposes a solution to the problem of "interning" strings (avoiding unnecessary memory allocation for duplicate identical strings when doing NPL, for instance).
Has anybody tested it?
NPL?
NPL - What does NPL stand for? The Free Dictionaryacronyms.thefreedictionary.com › ... >Acronym, Definition. NPL, National Priorities List (US EPA). NPL, National Physical Laboratory (UK). NPL, >Nepal (ISO Country code). NPL, Nigeria Premier ...
please help
I'm planning to do so as soon as I have managed to master Nim well enough. What's slowing me down is the lack of teaching material (which is normal, I guess for a relatively new language)
:-)
Often "interned string" databases are "append only". I.e., no deletes. If you don't mind 2 copies instead of 1, you can roll your own in like 10 lines of code:
import tables
type
Strings = object
ix2str*: seq[string]
str2ix*: Table[string, int]
proc put*(s: var Strings, x: string): int =
let n = s.ix2str.len
result = s.str2ix.mgetOrPut(x, n)
if result == n:
s.ix2str.add x
when isMainModule:
var s: Strings
discard s.put "two"
discard s.put "does"
discard s.put "use"
discard s.put "two"
discard s.put "copies"
discard s.put "two"
echo s.ix2str
Usage would be to just always use the index you get out of put. Whenever you want the "full string" for an i: int you just use s.ix2Str[i] to print out an index (you could define a distinct int type and $, of course). Non-editing operations on the embedded seq and Table might all be useful, etc.
You might reduce things to just one copy per unique string with some type IString = distinct int and a Table[IString,int]. That may be hard to make the types work out because Table expects to do ==(a,b: K) with a & b the same type. With this IString idea, you are looking up a string but comparing against stored IString which needs indirection. So, you might need a custom hash table here that "knows" about the ix2str part, but I'm not sure the approach of names is the way to go for that.
I've also played with CharRNN in Arraymancer though it's very very raw at the moment in terms of usage:
Obviously it pales compared to HuggingFaces / GPT-3 / BERT at the moment but lack of time ¯\_(ツ)_/¯
For what it's worth, that names module has pretty large objects for unique names - two 8 byte pointers and an 8 byte refcount as well as the string itself. So, probably 32B*nUnique + string data. Natural language words tend to average around 8B. names does use only an array of pointers for his hash table slots which saves table space.
As a ball park estimate, consider a 100,000 word natural language dictionary with words averaging 8B long. Nim strings are 24B + amount of data. The names way (which uses 75% max load factor rehashing) will take about 256k*8 = 1 MiB for its hash table plus (32+24)*100e3=5.6e6 bytes for the data or 6.6 MiB total. The "paired seq+index" way I mentioned will be (24+8)*100e3 * 2 = 6.4 MB for the seq and 256k * (8+24 + 8) = 10.24 MiB for the Table or 16.1 MiB, almost 2.5x bigger.
Were you to do your own custom hash table (as names already does), you could cut the seq part in half and get that Table part down to just 256k * 8 = 2MiB (or maybe *16 if you want to keep the hash code for speed which may only matter if your "dictionary" is dynamic) for a total of 5.173 MiB (or 7.22 MiB if you keep hash codes). Without saving hash codes, the sort of structure I mention is smaller (and probably faster) than the names approach.
If you are considering doing your own nimble package (which I encourage!!), you can of course take things further by storing the strings "back to back" as NUL-byte terminated entries saving about 16B per unique string (1.6 MB in our running example). The integer offset is then "bytes" not "words/strings". 4B uint32 are likely enough. With these two optimizations you get 900 kB + 1024 kB or only 1.9 MB.
At some complexity, that offset could also be shrunk from 32 bits to 20 bits (since the dictionary is =~ 800e3 bytes by assumption) using techniques like sequint in https://github.com/c-blake/adix in the table part . That would take the table to 256k*20bits = 640kB for a total of 1.5 MB, more than 4x smaller than my estimate for names. I don't know if any of the libs @mratsim mentioned do any of this. Maybe cello?
If you happen to have a very dynamic dictionary or otherwise care about "build time" more than space, then you could save 8..20 bits of hash code as a prefix compare to speed failed lookups/inserts for another 256..640 kB. (8 would still bypass almost all string comparisons on lookups, but 20 is needed to avoid hash() calls on table growth).
This is sort of a first step along the lines I was suggesting:
import tables
type
Strings = object
strings*: string
str2ix*: Table[string, int32]
proc put*(s: var Strings, x: string) =
let n = s.strings.len.int32
if s.str2ix.mgetOrPut(x, n) == n:
s.strings.add x & "\0"
proc c_strlen(s: cstring): csize_t {.
importc: "strlen", header: "<string.h>" .}
proc str*(s: var Strings, i: int32): string =
let n = s.strings[i].addr.c_strlen
result.setLen n # this does the zero byte for us
copyMem result[0].addr, s.strings[i].addr, n
when isMainModule:
var s: Strings
s.put "two"; s.put "does"; s.put "use"; s.put "two"
echo s.str2ix
echo s.str(9)
Then
$ nim r intern.nim
{"use": 9, "does": 4, "two": 0}
use
Of course, you get sparse indices (0, 4, 9) not a dense (0,1,2) word numbering this way, but this was unspecified by @Serge's question. Sparse indices are already "fast like integers" and so may be all you really need. If you need dense numbers, you could do more metadata like another seq to map word to offset { also used in the customized table to avoid string, etc. }
There are many succint representations like tries, Nim's own critbits, .., but I doubt they are much smaller (if not bigger) and are likely slower in practice than the above simple idea, and they are often substantially more complex to code.