nimforum mirror - Looking for a set that sorts and deduplicates

cdome (orginal) [2017-11-22T21:13:18+01:00] view original

I have bunch of ints that I need to insert into set and read in increasing order avoiding duplicates. They are produced at a different time so I don't have them in a seq.

I can probably knock-up a quick SortedSeq as a distinct seq in my code. I have a good capacity estimate for my sue case, so performance would be ok. However, it sounds as rather common problem so I assume Nim has solution for it, I just can't find it.

Any ideas?

Thank you

cblake (orginal) [2017-11-22T23:43:37+01:00] view original

collections.heapqueue does mostly what you want. You would have to remove duplicates yourself, like below, but probably with an iterator called something like unique.

import collections.heapqueue

var heap = newHeapQueue[int]()
let data = [1, 3, 5, 1, 7, 9, 2, 4, 6, 6, 8, 0]
for item in data:
  push(heap, item)

var last = heap.pop() #NOTE: assumes heap.len >= 1
echo last
while heap.len > 0:
  let next = heap.pop()
  if next != last:
    echo next
  last = next

Note that this does actually store duplicates until the final iteration because heaps do not possess a fast { i.e. O(log-set-size) } test for membership. So, if you had a great many copies and/or memory constraints that could be an issue. You didn't mention how high the duplication rate would be, but I bet something like the above is fast enough.

A more memory optimal structure for this is probably a simple B-Tree or some kind of balanced binary tree (less cache friendly than a B-Tree, but typically more available/implemented more often) or possibly an adaptation of collections.critbits or some other digital search tree. I think there is a red-black tree in Nim floating around.

jlp765 (orginal) [2017-11-23T01:31:42+01:00] view original

CountTable from the tables library can also do this. I don't know how it compares speed-wise with the previous solution

The keys() iterator will retrieve the sorted numbers (it sorts on insert??), and the value is the number of occurrences.

If you already have a sequence of data, you can use the toCountTable() proc to convert to a CountTable

cblake (orginal) [2017-11-23T01:58:55+01:00] view original

CountTable is just a histogram and the sorting is by the bin counts not by the keys. Keys are visited in "hash order". hashes does use an identity hash for integers which could create some confusion from a simple test program if the modulo the table size post hash transform doesn't change the order (though it surely can).

Of course, the larger idea embedded in this suggestion is quite legitimate - you could use a HashSet[int] (from sets) to manage the set until you are ready to create a seq[int] via HashSet.toSeq and then sort that seq at once. This is probably the approach that most would use and is probably faster for most access patterns. The heap/tree approach would likely win if you had dynamism -- insert a bunch, sweep in key order, insert some more, sweep in key order again, etc. This complex dynamic pattern wasn't really specified in your question, though.

Araq (orginal) [2017-11-23T02:46:29+01:00] view original

I have a B-Tree implementation. Remind me to release it.

mratsim (orginal) [2017-11-23T09:07:58+01:00] view original

insets dedups for sure and seems extremely fast. I don't know if it's items is sorted though.

cblake (orginal) [2017-11-23T15:21:33+01:00] view original

@Araq - fabulous. I think a B-Tree in the stdlib would be great.

@mratsim - I think you meant collections.intsets with a 't' which does not yield items() in key order.

The built-in set[up to int16] type does (implicitly) order its keys, though. @cdome also did not mention how large the integers he needed to support were, but he said "int" which I took to be the standard Nim int which is too big for a set[]. Also, while the set incl operations of set[int16] are unbeatable performance-wise, iterating over the set can be relatively slow if the set is not populated -- e.g., if you do int16 and only have 10 unique numbers, you still have to loop over 65536 entries to convert to a seq[int].

You can play around with this little program to see the various effects.

import sets, intsets, times, random, sequtils, algorithm
let trials = 10000
let numberRange = 1000 # 10000
let doPrint = false

randomize()
var Is = initIntSet()
let tI0 = epochTime()
for i in 1..trials: Is.incl(random(numberRange))
var Iss = toSeq(items(Is))
Iss.sort(cmp[int])
echo "IntSet time:       ", epochTime() - tI0
if doPrint: echo Iss

var Hs = initSet[int]()
let tH0 = epochTime()
for i in 1..trials: Hs.incl(random(numberRange))
var Hss = toSeq(items(Hs))
Hss.sort(cmp[int])
echo "HashSet[int] time: ", epochTime() - tH0
if doPrint: echo Hss

var Ss: set[int16]
let tS0 = epochTime()
for i in 1..trials: Ss.incl(int16(random(numberRange)))
var Sss = toSeq(items(Ss))
echo "set[int] time:     ", epochTime() - tS0
if doPrint: echo Sss

Udiknedormin (orginal) [2017-11-24T18:08:37+01:00] view original

@cblake I entirely agree, B-trees are pretty awesome. When learning about Rust iterators I discovered that it's often faster to turn a container into a B-tree and then collect it into the starting container type than to operate directly on the initial container.

cblake (orginal) [2017-11-24T20:22:02+01:00] view original

Memory hierarchies since the 1990s have broadened the relevance of 1970s practical wisdom for disk storage. A solid hash table with locality and solid B-Tree often win today. A great stdlib should have both.

A more obscure bonus of a B-Tree (when compared to, say, balanced binary trees) is cheaper amortized cost to maintain simultaneous access by rank (e.g. quickly find 95th percentile, top 10, etc. as well as query by key). With B-Trees you need only one rank counter per big node of O(M) elements instrumentation, not per element as with binary trees. So, B-Trees afford O(M)-times less space overhead and O(log(N)/log(M))-times less time overhead than a rank-instrumented binary tree of N elements. This can be useful for, e.g. tracking quantiles over moving data windows.

alfrednewman (orginal) [2017-11-25T23:08:07+01:00] view original

@Araq - B-Tree in the stdlib will be cool !! Thanks in advance !

Mirror of forum.nim-lang.org

3358 :: Looking for a set that sorts and deduplicates