nimforum mirror - Fun with deduplicate

Stefan_Salewski (orginal) [2016-11-08T11:33:28+01:00] view original

I wonder if sequtils.deduplicate() should be renamed to deduplicated() as it returns a seq like sorted()? And maybe it should be noted that it is O(n^2), which is fine for small collections, but not so nice for large ones.

And maybe we should add a O(n) deduplicate to stdlib -- sets modul is really fast as shown below.

import sets
import random, times
import sequtils # deduplicate

# O(n)
proc uniqCopy[T](s: openArray[T]): seq[T] =
  let h = s.toOrderedSet
  newSeq(result, h.len) # can we avoid the unnecessary initialization?
  var i = 0 # is there no items with index for sets?
  for el in h:
    result[i] = el
    inc(i)

# O(n)
proc uniq[T](s: var seq[T]) =
  let h = s.toOrderedSet
  var i = 0
  for el in h:
    s[i] = el
    inc(i)
  s.setLen(h.len)

# O(n^2)
proc xdeduplicate[T](s: var seq[T]) =
  var i, j: int
  while i < s.len:
    s[j] = s[i]
    var k = j
    while k > 0:
      dec k
      if s[k] == s[j]:
        dec(j)
        break
    inc(j)
    inc(i)
  s.setlen(j)

var a = @[1, 3, 1, 2, 2, 5, 2]
var b = a
echo a.uniqCopy
a.xdeduplicate
b.uniq
echo a
echo b

var t: float # cpuTime()
var x: array[1000, int]
for el in mitems(x): el = random(1000)

var z = @x
t = cpuTime()
z = z.deduplicate
echo "sequtils.dedublicate: ", cpuTime() - t

z = @x
t = cpuTime()
z.xdeduplicate
echo "inplace dedublicate: ", cpuTime() - t

z = @x
t = cpuTime()
z = z.uniqCopy
echo "uniqCopy: ", cpuTime() - t

z = @x
t = cpuTime()
z.uniq
echo "uniq: ", cpuTime() - t

# nim c -d:release t.nim
# Output
# @[1, 3, 2, 5]
# @[1, 3, 2, 5]
# @[1, 3, 2, 5]
# sequtils.dedublicate: 0.0001410000000000001
# inplace dedublicate: 0.0001329999999999999
# uniqCopy: 3.099999999999999e-05
# uniq: 2.700000000000011e-05

Krux02 (orginal) [2016-11-08T13:08:57+01:00] view original

As far as I know, there are three ways to deduplicate the elements of a list, and all of them can be implemented by modifying the argument or returning a new seq. You introduced the method that uses the set, but your implementation destroys the order of elements in a seq, I think that is not necessary. The third alternative is, to sort the collection, and then deduplicate O(n log n). This is clearly faster than O(n²), but destroys definitively the order. The advantage compared to the hashing implementation is, that it can be implemented without any allocation.

Stefan_Salewski (orginal) [2016-11-08T13:31:39+01:00] view original

There is currently another point that confuses me:

I tested with an initial array increased in size by factor 10:

var x: array[10000, int]

Compiled with -d:release and default -O3 for gcc gcc takes 28 seconds to generate the binary and binary size is 650k. And with -Os or -O2 results are similar. Well 10k ints global should result in 80k BSS section when int size is 8 byte. But what may go on? This is for gcc 5.4

[EDIT]

Well, Nim compiler generates 10000 assignments like:

z_130420_512892843->data[9997] = x_130366_512892843[9997];

@Krux: I have to think about your comment some more still. But sorting may be indeed a problem I guess, when array elements are objects, we have to provide a cmp() proc to sort, and that cmp() proc has to compare all the fields of the object, and is called often. So that should be not very fast.

And of course deduplicate operation should prevent the order in the seq and keep always the first candidate.

Krux02 (orginal) [2016-11-12T23:21:43+01:00] view original

yes of course sorting is slow, when comparison is slow, but most of the time comparison is not too complicated. For the following use case, there is a way to reduce the calculation costs. (warnig all non tested pseudocode, not even checket if I use sort correctly)

var someData: seq[A] = ...
proc cmp(lhs,rhs:A): int = cmp(someComplexCalculation(lhs), someComplexCalculation(rhs))
result = someData.sort(cmp)

to something like this:

var someData: seq[A] = ...
var tmp = newSeq[tuple[index:int, value: B]](someData.len)

for i,val in someData:
  tmp[i] = someComplexCalculation(val)

proc cmp(lhs,rhs : tuple[index:int, value: B]): int = cmp(lhs.value, rhs.value)
tmp.sortInplace(cmp)
for i,val in tmp:
  result[i] = someData[val.index]

now the someComplexCalculation is only used N times, instead of N log N

cblake (orginal) [2016-11-15T12:54:39+01:00] view original

@Stefan - this is a classic application of containsOrImpl in the set APIs (or getOrPut in the table APIs) which are very close to the core "search or insert" hash table algorithm.

proc uniqCp[T](s: openArray[T]): seq[T] =
  newSeq(result, s.len)
  var h = initSet[T](rightSize(s.len))
  var i = 0
  for el in s:
    if not h.containsOrIncl(el):
      result[i] = el
      inc(i)

proc uniqInPlace[T](s: var seq[T]) =
  var h = initSet[T](rightSize(s.len))
  var i = 0
  for el in s:
    if not h.containsOrIncl(el):
      s[i] = el
      inc(i)
  s.setLen(h.len)

These were also about 20% faster on my system than their OrderedSet counterparts.

Mirror of forum.nim-lang.org

2611 :: Fun with deduplicate