nimforum mirror - high memory usage with large number of HashSets. 3X more memory than Python

Gtriangle (orginal) [2021-12-24T05:48:07+01:00] view original

Hi,

I'm really loving Nim but ran into a strange issue today when porting some Python code.

I have some simple code that builds a table, mapping an int to a HashSet[int].

The table has about 2.5 million entries.

Each HashSet contains 10 random integers (max integer value being 115000)

I was hoping someone could help me understand why the Nim version uses over 10GB of memory. But the same code in Python uses only 3GB of memory.

Why does the Nim version use 3X more memory than Python ?

Nim:

import tables, random, sets

var mem = initTable[int,HashSet[int]]()

for docid in 0..2554380:
    if not mem.contains(docid):
        mem[docid] = initHashSet[int]()
    for y in 0..10:
        mem[docid].incl(rand(115000))

Python

from random import randint

mem = {}
for docid in range(2554380):
    if docid not in mem:
        mem[docid] = set()
    for y in range(10):
        mem[docid].add(randint(0,115000))

I compiled the Nim code with the default, 'arc', 'orc' and 'boehm' garbage collectors. In all cases the memory grew to over 10GB. In the case of boehm, the application crashes with this error:

Too many heap sections: Increase MAXHINCR or MAX_HEAP_SECTS

I'm running Nim 1.6.0:

git hash: 727c6378d2464090564dbcd9bc8b9ac648467e38 active boot switches: -d:release

Araq (orginal) [2021-12-24T07:19:37+01:00] view original

Your port has a bug, you need to use 0..<10 and not 0..10 etc.

As for the memory consumption ... I don't know, every hash table implementation has its problems. Probably all you need to do is to override the hash for integers, but beware, you need to use distinct int for the type.

Since memory consumption is an issue, I would use fusion / BTree instead.

ElegantBeef (orginal) [2021-12-24T07:26:02+01:00] view original

initHashSet takes in parameter of the default size as can be seen here by default it's 64 so just by doing initHashSet[int](10) we get the memory usage down to 1.6GB and it gets much faster thanks to not allocating as much(atleast on my machine with -d:release). Python probably has a much smaller default than 64 or starts at 0 and grows as it goes.

Araq (orginal) [2021-12-24T07:43:53+01:00] view original

Ah, don't listen to me, listen to @ElegantBeef.

ynfle (orginal) [2021-12-24T10:19:40+01:00] view original

Besides for @ElegantBeef's suggestions, you could also try std/intsets

xbello (orginal) [2021-12-24T10:49:40+01:00] view original

intsets uses 2.7Gb. Slighly faster than HashSet(10) in my machine: 3.10 seconds vs 3.4.

cblake (orginal) [2021-12-24T11:58:08+01:00] view original

You can test/implement @Araq's idea more simply by defining nimIntHash1 which does the identity hash - perhaps useful in combination with @ElegantBeef's suggestion. (Just nim c -d:nimIntHash1) or put the define in a foo.nim.cfg or foo.nims file.)

Also, there is a somewhat more full featured hash table in adix , i.e. (adix/LPTabz) that lets you better control memory usage. In particular you can tune table resize policy as well as eliding hash code saving/comparison prefixing (which is not a space-speed win for ints - only more complex objects with expensive comparisons). It also has a compact-ordered mode which is what Python uses (but with number of bits-tunable "hash part" which can use ~1/2 the memory for that part, though most memory is in the seq part).

LPTabz is not documented as well as it could/should be, though. You may have to know/learn what you are doing to fiddle with all its "knobs & levers". E.g., it also allows monitoring and logging messages in the case of a poorly distributing hash functions. So, you can redefine the hash to identity (like nimIntHash1, say), but then trap distributional collapse issues.

cblake (orginal) [2021-12-24T13:09:19+01:00] view original

Follow-up: also you should use --gc:arc if you are not already (sounds like not). Just with stdlib Table & HashSet (with init size hint of 3), I get 2.57 sec, 1313MB with the default gc, 1.73 sec, 843MB with gc:arc, then adding -d:nimIntHash1 this falls to 1.08 sec, 780 MB. Then with gcc PGO this falls to 0.95 sec, 895MB. So, it does seem like you should listen to both Araq & ElegantBeef after all. :-)

Then with adix & identity hash & hash code elision (code shown later), I get 1.00sec, 480MB and with PGO 0.82 sec, 563MB.

import adix/[lptabz, althash], random
proc hash(x: int): Hash = hashIdentity(x)
var mem = initLPTabz[int, LPSetz[int,int,0], int, 0](0, 16, 1)
for docid in 1..2554381: # 0 used as hash key sentinel!
    if not mem.contains(docid):
        mem[docid] = initLPSetz[int,int,0](0, 16, 1)
    for y in 0..10:
        mem[docid].incl(rand(115000))

The 16 controls table resize - once a linear probe search depth counter hits 16 deep, the table resizes (unless it is already below a sparsity threshold). I did try the compact mode but it was quite a bit slower. I didn't play with having the table be one mode and the sets another.

Parenthetically, on your broader context, it seems like you are working out a search engine style inverted index. A hash set, while conceptually spot on, is both a space & time inefficient representation for the "posting list" of documents containing a term (the value in your Table). You can also do union & intersection efficiently on ordered lists. Ensuring order to the list is not so hard if your always process your documents in docId order in the first place. This is usually easy since you can arrange for docId to be the order number of the document. In fact, because posting lists can be large, fancy search engines may even do "delta encodings" of ordered docId lists (i.e. encoding the difference to the next docId, not the docId itself) as a kind of fast online/context specific data compression. To start I would recommend a seq[tuple[docId, freq: uint32]] and writing a couple of tiny union/intersection procs on those.

There is a pretty good book on all this originally called Managing Megabytes { that became Gigabytes and were it updated today would surely be called Terabytes or maybe Petabytes :-) }.

Gtriangle (orginal) [2021-12-24T14:43:04+01:00] view original

wow, amazing. Thank you @Araq, @ElegantBeef and @cblake.

With Nim it feels like I got the keys to a Lamborghini :)

@cblake: You're right, this part of the code is used to generate an inverted index.

I then calculate the Jaccard-Index between all pairs of 'docs' to find similar documents for a recommender system.

Users of my applications are really feeling the difference after I ported the backend to Nim.

Happy holidays to all and thank you again for the amazing work !

cblake (orginal) [2021-12-24T19:31:47+01:00] view original

FWIW, there was an article early this year whose poor engineering annoyed me. That annoyance inspired me to re-do the impl in Nim (in 114 lines and already faster/more memory conservative as well as briefer) and then make a series of adaptations & modifications towards efficiency...basically all as a nascent pedagogical idea.

I got to the point where doing a search index on the full Wikipedia (not just the abstracts) seemed feasible on modest home computer/laptops (well, for 1 language like English). I ran out of steam writing a Nim parser for the full Wiki syntax (which is not insubstantial), though. But if there is enough interest here, I could toss it up on some github repo.

Pyautogui (orginal) [2021-12-24T21:17:43+01:00] view original

I would be interested!

cblake (orginal) [2021-12-24T22:58:56+01:00] view original

Ok..Here you go: https://github.com/c-blake/nimsearch

Gtriangle (orginal) [2021-12-25T14:07:09+01:00] view original

@Araq's warning was definitely justified, when I compiled with the nimIntHash1 flag the code was many times faster, but I ran into KeyErrors when looking for document ids in the reverse index tables.

I'll try changing all my ints to distinct ints and see what happens !

FWIW, there was an article early this year whose poor engineering annoyed me. That annoyance inspired me to re-do the impl in Nim (in 70 lines and already faster/more memory conservative as well as briefer) and then make a series of adaptations & modifications towards efficiency...basically all as a nascent pedagogical idea.

faster, less memory, and less code ? the holy trinity .. Nim rocks !

Thanks for the sharing the GitHub link of your implementation, definitely going to read it today !

cblake (orginal) [2021-12-25T14:55:19+01:00] view original

With an inverted index your keys generally will be string or similar, not int since a document contains generally contains search terms or whatnot which are strings not numbers. So, with seq rather than HashSet for posting lists there should be no need for an integer keyed hash at all. I don't know - maybe you are only indexing numbers, but I expect it's a demo/trial code artifact. { And Araq's warnings are almost always good..ignore at your peril! :-) }

Also, your example/demo/trial code tests unnecessarily. The main table can never contains the docId for the new loop...So, just

for docid in 0..2554380:
    mem[docid] = initHashSet[int]()
    for y in 0..10:
        mem[docid].incl(rand(115000))

should work and be a bit faster in both Nim & Python).

A related subtle gotcha in such trial/example code is looking up keys densely in the same order as Table order. This is easy to do without thinking with nimIntHash1..just two loops from 1..N - one to []=, one to []. This lets memory hardware prefetching work well giving a kind of "false" acceleration relative to the default int hash. (It may not be strictly false if your use cases do a lot of "in int/key"-order lookup loops, but it's not what many would call a "common" case.)

Mirror of forum.nim-lang.org

8732 :: high memory usage with large number of HashSets. 3X more memory than Python