I wonder if sequtils.deduplicate() should be renamed to deduplicated() as it returns a seq like sorted()? And maybe it should be noted that it is O(n^2), which is fine for small collections, but not so nice for large ones.
And maybe we should add a O(n) deduplicate to stdlib -- sets modul is really fast as shown below.
import sets
import random, times
import sequtils # deduplicate
# O(n)
proc uniqCopy[T](s: openArray[T]): seq[T] =
let h = s.toOrderedSet
newSeq(result, h.len) # can we avoid the unnecessary initialization?
var i = 0 # is there no items with index for sets?
for el in h:
result[i] = el
inc(i)
# O(n)
proc uniq[T](s: var seq[T]) =
let h = s.toOrderedSet
var i = 0
for el in h:
s[i] = el
inc(i)
s.setLen(h.len)
# O(n^2)
proc xdeduplicate[T](s: var seq[T]) =
var i, j: int
while i < s.len:
s[j] = s[i]
var k = j
while k > 0:
dec k
if s[k] == s[j]:
dec(j)
break
inc(j)
inc(i)
s.setlen(j)
var a = @[1, 3, 1, 2, 2, 5, 2]
var b = a
echo a.uniqCopy
a.xdeduplicate
b.uniq
echo a
echo b
var t: float # cpuTime()
var x: array[1000, int]
for el in mitems(x): el = random(1000)
var z = @x
t = cpuTime()
z = z.deduplicate
echo "sequtils.dedublicate: ", cpuTime() - t
z = @x
t = cpuTime()
z.xdeduplicate
echo "inplace dedublicate: ", cpuTime() - t
z = @x
t = cpuTime()
z = z.uniqCopy
echo "uniqCopy: ", cpuTime() - t
z = @x
t = cpuTime()
z.uniq
echo "uniq: ", cpuTime() - t
# nim c -d:release t.nim
# Output
# @[1, 3, 2, 5]
# @[1, 3, 2, 5]
# @[1, 3, 2, 5]
# sequtils.dedublicate: 0.0001410000000000001
# inplace dedublicate: 0.0001329999999999999
# uniqCopy: 3.099999999999999e-05
# uniq: 2.700000000000011e-05
There is currently another point that confuses me:
I tested with an initial array increased in size by factor 10:
var x: array[10000, int]
Compiled with -d:release and default -O3 for gcc gcc takes 28 seconds to generate the binary and binary size is 650k. And with -Os or -O2 results are similar. Well 10k ints global should result in 80k BSS section when int size is 8 byte. But what may go on? This is for gcc 5.4
[EDIT]
Well, Nim compiler generates 10000 assignments like:
z_130420_512892843->data[9997] = x_130366_512892843[9997];
@Krux: I have to think about your comment some more still. But sorting may be indeed a problem I guess, when array elements are objects, we have to provide a cmp() proc to sort, and that cmp() proc has to compare all the fields of the object, and is called often. So that should be not very fast.
And of course deduplicate operation should prevent the order in the seq and keep always the first candidate.
yes of course sorting is slow, when comparison is slow, but most of the time comparison is not too complicated. For the following use case, there is a way to reduce the calculation costs. (warnig all non tested pseudocode, not even checket if I use sort correctly)
var someData: seq[A] = ...
proc cmp(lhs,rhs:A): int = cmp(someComplexCalculation(lhs), someComplexCalculation(rhs))
result = someData.sort(cmp)
to something like this:
var someData: seq[A] = ...
var tmp = newSeq[tuple[index:int, value: B]](someData.len)
for i,val in someData:
tmp[i] = someComplexCalculation(val)
proc cmp(lhs,rhs : tuple[index:int, value: B]): int = cmp(lhs.value, rhs.value)
tmp.sortInplace(cmp)
for i,val in tmp:
result[i] = someData[val.index]
now the someComplexCalculation is only used N times, instead of N log N
@Stefan - this is a classic application of containsOrImpl in the set APIs (or getOrPut in the table APIs) which are very close to the core "search or insert" hash table algorithm.
proc uniqCp[T](s: openArray[T]): seq[T] =
newSeq(result, s.len)
var h = initSet[T](rightSize(s.len))
var i = 0
for el in s:
if not h.containsOrIncl(el):
result[i] = el
inc(i)
proc uniqInPlace[T](s: var seq[T]) =
var h = initSet[T](rightSize(s.len))
var i = 0
for el in s:
if not h.containsOrIncl(el):
s[i] = el
inc(i)
s.setLen(h.len)
These were also about 20% faster on my system than their OrderedSet counterparts.