nimforum mirror - Size of data structures

guth (orginal) [2013-10-29T22:11:36+01:00] view original

At work I have made a Python script that parses a Simulink file into a tree representation and then performing queries on the representation. We do some generation of various text files. The problem is that the simulink file, with associated libraries, is huge. Constructing the representation takes lots of time. Just out of curiosity I spent some spare time to code only the reading of the file in C. Loading takes significantly less time, and the memory consumption is around the same as Python (around 1.1 GB).

Now, again out of curiosity, I converted the loading routine to D and Nimrod using their string, hash table, and list data structures (I roll my own in the C version). The loading is faster than Python, but much slower than C. I noticed the following using Nimrod.

If not turning off gc, it was really slow. I guess it depends on all the objects that are created

More troublesome is that the whole simulink file could not be read in, because the process ran out of memory, i.e., used 2 GB, well before the file was fully read. Only 1/7 th of the file could be read before memory was exhausted. Is it possible to tell the compiler to optimize the data structures with regard to size?

Araq (orginal) [2013-10-29T23:46:14+01:00] view original

If not turning off gc, it was really slow. I guess it depends on all the objects that are created

Unlikely. The GC is quite happy with many allocations. The compiler itself allocates like crazy. You can also try --gc:boehm or --gc:markAndSweep to see if it makes a difference. (You do know about -d:release, right?!)

More troublesome is that the whole simulink file could not be read in, because the process ran out of memory, i.e., used 2 GB, well before the file was fully read.

Certainly worrying but without seeing your code I can't say much. If your loading doesn't match C speeds, you're doing it wrong. :P My guess is you don't know how to deal with Nimrod's strings, seqs and hash tables. Their copying semantics can have serious downsides if you're not careful but there are also lots of optimizations still left to be done in the compiler/stdlib. Of course you can also optimize yourself. ;-)

I don't think the speed of Nimrod's hash tables matches Python's yet but then idiomatic Nimrod code barely uses hash tables anyway.

Jehan (orginal) [2013-10-30T05:03:10+01:00] view original

A possible reason for the high memory consumption is that hash tables in Nimrod are initialized to hold 64 key/value pairs by default, which may waste a lot of space if you have many small hash tables. Use initTableA,B if you want to minimize space usage. Nimrod hash tables use essentially the same implementation (memory-layout-wise) as Python otherwise.

Strings and sequences shouldn't incur space overhead compared to Python (in fact, they should consume less space). As noted by Araq, strings and sequences have copy semantics for assignment by default (this can be avoided using shallowCopy). While this may incur speed overhead, a positive side effect of having copy semantics is that strings and sequences are automatically compacted when they are stored.

Also, as Araq wrote, use -d:release for speed. Or see what I wrote in this thread for more selective optimizations.

Jehan (orginal) [2013-10-30T15:51:34+01:00] view original

Araq: I don't think the speed of Nimrod's hash tables matches Python's yet but then idiomatic Nimrod code barely uses hash tables anyway.

As the person who wrote the original version of the current Python dictionary lookup code, I'm probably in a pretty good position to explain, and I don't think it's a big problem. The short answer is that performance shouldn't be (measurably) worse (and given the usual interpreter overhead in CPython, generally better), and that even where it is, you probably don't want a literal copy of the Python version.

Nimrod's hash tables are very similar to Python hash tables, but there are three differences that one should be aware of.

First of all, Python caches the hash values of strings inside the string objects. This means that hash functions never have to be computed more than once. Obviously, this creates an additional word of overhead per string, and can be emulated in Nimrod (if desired) by wrapping string keys inside simple objects.

Second, Python stores the hash value of the key alongside the key in the hash table. This is because comparison functions may call Python code and can be very expensive; by first comparing the hash values, you get a very fast common path until you get around to do the actual comparison. This behavior could be rolled into tables.nim by replacing the slot enum with an int that encodes most of the bits of the hash value along with the slot enum value; it could also be done by caching the hash value for an object as above without changing the actual tables implementation.

Third, Nimrod uses what is essentially a linear probing variant, which CAN blow up with a bad hash function or if clustering occurs naturally. Explaining how Python avoids that requires some background, though.

Originally, Python had a simple hash table with open addressing and prime sizes. This was relatively straightforward, but had very poor performance on RISC architectures without a built-in integer modulo instruction.

The replacement implementation that I came up with allowed for tables of sizes that were a power of 2 (allowing the use of and rather than mod). The table index was calculated as (hash + offset) mod 2^k, where offset pseudo-randomly cycled through the values in the set {1..2^k-1} (exploiting a property of GF(2^k). The choice of the initial value of offset (derived from hash) broke up clustering from bad hash functions; additionally, each hash value resulted in a different probing pattern, minimizing overlaps in probing patterns and clustering resulting therefrom.

This implementation had the significant downside that you had to find a generating polynomial for GF(2^k) for each possible k and store it in a table. Tim Peters then observed that you could cycle through all elements of the set {0..2^k-1} via the recurrence x = 5*x+1 (mod 2^k), per the usual rules for linear congruential PRNGs. However, this simple recurrence -- note that this recurrence is applied to the hash value, not the offset, as above -- had the disadvantage that its behavior was essentially the same as linear probing (though it helped up breaking clustering from bad hash functions a bit). In order to fix that problem, noise was added from bits of the original hash value to randomize the recurrence, specifically along the following lines:


hash = (hash * 5 + perturb + 1)
perturb = perturb shr 5

with perturb initialized to hash. Eventually, perturb would become zero, but only after having fed several iterations of noise into the recurrence. This broke up both clustering of bad hash functions (including very bad ones) and randomized the probing pattern for the first n/5 iterations (n being the number of significant bits of the hash value). That it essentially became linear probing afterwards but did not matter in practice; if you had run into that many successive collisions, you were already having bad luck.

This means that the expected behavior of Nimrod hash tables is nearly indistinguishable from Python hash tables; however, Nimrod's implementation can have problems with bad hash functions, e.g. the one for integers, where worst-case behavior is much more likely. The following code, for instance, exhibits O(n^2) behavior:


import tables, hashes

const N = 50000

var table = initTable[int,int]()

for i in 1..N:
  table[i shl 16] = i
var s = 0
for i in 1..N:
  s += table[i shl 16]

Picking a reasonably good hash function should largely sidestep that problem (e.g. by feeding the hash function output through Doug Lea's smear() function . In the long term, adding the perturb functionality to tables.nim might be a good idea.

guth (orginal) [2013-11-14T20:32:47+01:00] view original

I have done some more investigations and found out the following.

A key - value pair in Simulink looks like this

whitespace key-string whitespace value-string

or value can span several lines

whitespace key-string whitespace "value-string": "value-string" ...

Then there are curly brackets indicating which "objects" are children.

Pseudo code whitout handling spanning value strings and children is

read a line from the Simulink file

eat whitespace using a while loop

add characters to string until next whitespace

eat whitespace using a while loop

append characters to string until end of line.

I converted the reading to java, and it was slow as well, while using string objects where . By using a string builder object, and resuing the same object for each read key value pair, the speed became much faster, but still double that of C. I now tried to find string builder equivalent methods in Nimrod. They were there. Now the Nimrod version is slightly faster than the C version without turning off GC, which probably is due to a much better hash table implementation than my naïve one.

My first try at string handling was this key &= c, where c is a character. This didn't work as &= is not overloaded for these types. One way to get compilable code was to instead write key &= "" & c, with the hope that the compiler would optimize away the empty string. This resulted in lots of objects that the GC had to administer.

gradha (orginal) [2013-11-15T00:05:06+01:00] view original

Don't know if & for chars is an omission or deliberate, but you can always use the usual add proc:


var t: string = ""
t.add('a')
t.add('b')
echo t

guth (orginal) [2013-11-16T09:23:05+01:00] view original

Indeed, using newStringOfCap, setLength, and add yielded the much improved speed.

Mirror of forum.nim-lang.org

292 :: Size of data structures