nimforum mirror - quickest way to take set difference of two sequences?

mreiland (orginal) [2015-07-06T22:24:14+02:00] view original

I'm sure the answer is simple, I just can't seem to find it.

Basically I have two sequences that contain enumerations and I'd like to take the set difference of them, but I'm having trouble finding how to do this quickly and easily.

Related questions (sorry, new to the language):

easily create seq from enums?

toSeq({low(myEnum)..high(myEnum)}) # is this the simplest way?

convert from seq[OrdinalType] to set (and vice versa)?

is there an array version of toSeq? I couldn't find an 'arrayutils' module, and wasn't able to find the answer on my own.

Varriount (orginal) [2015-07-07T05:11:24+02:00] view original

You might be better off using ordinal sets, since they support efficient storage of enumeration values. They support what you want out-of-the-box.

I'd like to take the set difference of them


type MyEnum = enum
  meA, meB, meC

let
  d = @[meA, meB, meC]
  e = @[meA]

var f = newSeq[MyEnum]()

for dItem in d:
  var found = false
  for eItem in e:
    if d == e:
      found = true
      break
  if not found:
    f.add(d)

Anything more complicated is probably going to require a bigger memory/time tradeoff.

easily create seq from enums?

You can use @[low(myEnum)..high(myEnum)]

convert from seq[OrdinalType] to set (and vice versa)?


import sequtils

type MyEnum = enum
  meA, meB, meC

let enumSeq = toSeq(low(MyEnum)..high(MyEnum))

var enumSet: set[MyEnum]
for i in MyEnum:
  enumSet.incl(for i in MyEnum: i)

is there an array version of toSeq? I couldn't find an 'arrayutils' module, and wasn't able to find the answer on my own.

No, there isn't. Arrays in Nim are used relatively rarely, as their length must normally be known at compile-time. For situations where we don't know the length at compile-time, we use sequences (which are safer, and just as efficient as arrays for nearly all cases). This is because sequences are merely dynamically resized/reallocated arrays, with a bit of header information added on to the beginning. If you really need to create run-time sized arrays, You'll either need to use one of the lower-level memory allocation procedures in system.nim (I recommend unsafeNew), or use alloca to allocate the array on the stack.

jibal (orginal) [2015-07-07T06:50:13+02:00] view original

mreiland: "Basically I have two sequences that contain enumerations and I'd like to take the set difference of them, but I'm having trouble finding how to do this quickly and easily."

Maybe you're overthinking it.

var diff: set[YourEnumType]

for v in a:
  diff.incl(v)
for v in b:
  diff.excl(v)

P.S. It's not common to take the set difference of two lists or bags ... maybe they should have been sets in the first place? Or maybe you want the list or bag difference? (e.g., @[a, a, b] - @[a, b] -> @[a])

mreiland (orginal) [2015-07-07T08:32:55+02:00] view original

sets are the first thing I reached for, but I was unsure how to convert from a seq to a set. That syntax you use for it, could you explain that. It looks like nim has list comprehension? I never came across anything mentioning that and the lambda syntax I saw was a bit more wordy. Do you have a resource to explain that syntax? The only reference I can find to list comprehensions for nim didn't have that syntax. Also, why is it in a looping construct?

@jibal: I know I can do it manually, I'm more using this as an exercise to learn the language better, I've already learned several new things in this thread so it was a success :)

Jehan (orginal) [2015-07-07T15:49:55+02:00] view original

If you want to do it with seqs:

import sequtils

type E = enum A, B, C, D

let s1 = @[A, B, C]
let s2 = @[B, D]

echo s1.filterIt(it notin s2)

To create a seq from an enum:

import sequtils

type E = enum A, B, C, D

let s = toSeq(E.low .. E.high)

There is no array version of toSeq, because arrays are fixed size. One could write a procedure that populates an array with an iterator (even at compile-time), but you'd still have to define bespoke behavior for when array and iterator length do not match.

That said, I'd recommend working with sets directly. E.g.:

import sequtils

type E = enum A, B, C, D

let s1 = {A, B, C}
let s2 = {B, D}

echo s1-s2

A set can be created from an enum as follows:

type E = enum A, B, C, D

let s = {E.low .. E.high}

mreiland (orginal) [2015-07-07T20:02:05+02:00] view original

Alright, thanks guys, I went forward using sets (which is what I wanted originally, so that's good).

I came across the following: http://nim-lang.org/docs/manual.html#statements-and-expressions-statement-list-expression

So I have a better idea of that syntax.

two followup questions.

I'm seeing a preference for let vs var, I'm guessing this allows the compiler to optimize similar to const in C++?

in varriount's last example:

var enumSet: set[MyEnum]
for i in MyEnum:
  enumSet.incl(for i in MyEnum: i)

I see the variable enumSet not being assigned to, but having a value nonetheless. I ran into issues trying this with a normal 'class' type. What are the rules for when something gets allocated on the stack and the heap (I'm assuming that was the difference)?

I apologize if this seems like an obvious thing, when learning I have a tendency to just build something small and skip around learning what I need to get it done so I can sometimes miss things.

Jehan (orginal) [2015-07-07T20:14:25+02:00] view original

If a type is a ref, you have a reference to a heap value, and the reference defaults to nil; you'll have to initialize it with new() or an object constructor (for object types). Otherwise, it's on the stack or in static memory and has a default value. For sets of ordinals, this is the empty set (because they're represented as bit vectors, and the default value is to have all bits cleared).

The difference between a ref T and a T variable is largely the same as the one between T* and T in C/C++, except that a ref T is also garbage-collected.

By the way, I don't think that enumSet.incl(for i in MyEnum: i) works; even if, it would be an extremely inefficient way (O(n)) to build the set.

mreiland (orginal) [2015-07-07T21:12:29+02:00] view original

It sounds like this is the dichotomy of reference vs value type you find in the .Net world.

Under the tuple entry in that table I see the following:

(default(A), default(B), ...) (analogous for objects)

Does this mean object fields get initialized with default(T) for any value which isn't initialized explicitly? Or does it mean they always get initialized with default(T) and then the constructor re-assigns the values?

I'm guessing the {.noinit.} pragma is mostly useful for complex types?

I'm curious what happens if you create an object with 3 fields but only initialize 2 of them.

var obj:MyObj = MyObj(a:1,b:2)

Jehan said: By the way, I don't think that enumSet.incl(for i in MyEnum: i) works; even if, it would be an extremely inefficient way (O(n)) to build the set.

I'm glad to hear that, my initial confusion was not understanding why you would need a looping construct in a looping construct, I thought perhaps my mental model of what was happening was off (it probably still is to be fair).

Jehan (orginal) [2015-07-07T21:33:06+02:00] view original

mreiland: Does this mean object fields get initialized with default(T) for any value which isn't initialized explicitly? Or does it mean they always get initialized with default(T) and then the constructor re-assigns the values?

For the most part, there is no observable difference between the two, but basically, the latter (though the compiler can optimize that away if it can prove that the default value isn't used).

The {.noinit.} pragma is for close-to-the-metal code where you worry about every single clock cycle and know that initialization with default values is not needed. Do not use it for ref fields (because that will break the garbage collector). It is an unsafe feature and should be treated as such. Most of the time, such microoptimizations aren't necessary.

If you create an object with three fields and initialize two of them, the third will have its default value.

mreiland (orginal) [2015-07-07T21:37:22+02:00] view original

Ok, that gives me a bit more info about how the construction of type objects works.

Thanks for the help Jehan, I really appreciate it.

jibal (orginal) [2015-07-07T22:20:32+02:00] view original

mreiland: "I know I can do it manually"

All coding is done "manually" unless you're generating it programmatically. You wrote "I just can't seem to find it" ... I don't know what that means if it's not what it says. You wanted "quickest way to take set difference of two sequences" -- that's what I gave you. It's certainly quicker than Varriount's doubly nested loop or Jehan's filterIt one-liner. You spent a lot of time puzzling over Varriount's for i in MyEnum: enumSet.incl(for i in MyEnum: i) ... which is just a mistaken version of my first loop. What you seem to have wanted is seq1.toSet() - seq2.toSet(), but that would be slower than the two loops I gave (and the first loop is the implementation of seq.toSet()). If there were a "set difference of two seqs" library function, it would be best written using the code I gave. So don't be too quick to dismiss what I wrote if learning is your goal.

Jehan: "If you want to do it with seqs:"

And if you don't care about performance, since that's O(N*N).

mreiland: I'm seeing a preference for let vs var, I'm guessing this allows the compiler to optimize similar to const in C++?

Possibly, but it's just good practice.

mreiland: in varriount's last example:

That was almost certainly a typo. He meant

for i in MyEnum:
  enumSet.incl(i)

mreiland: It sounds like this is the dichotomy of reference vs value type you find in the .Net world.

They use it, but the terminology predates and is broader than .NET (or rather, C# and VB ... .NET supports many other languages that don't necessarily use these terms).

mreiland (orginal) [2015-07-07T23:14:16+02:00] view original

As someone looking to pick up the language I could always write FORTRAN in nim, part of me asking questions is to find out if there's a more 'nim-ish' way to do things and I feel like I made that point clear.

As for your response about reference vs value type, I lose that no matter which example I choose. The .Net example is just particularly relevant since this dichotomy becomes an optimization in GC'd languages.

But, I honestly don't like the argumentative way you've chosen to express yourself Jibal so I'm going to bow out of this conversation. I welcome your input, but I have no pride in this, I'm here to learn, not to argue.

jibal (orginal) [2015-07-08T00:28:13+02:00] view original

Rude projection. I'll try to remember never to offer you help again.

Jehan (orginal) [2015-07-08T01:21:31+02:00] view original

jibal: And if you don't care about performance, since that's O(N*N).

Depends on how big the actual sets are. If they are small relative to the number of alternatives in the underlying ordinal type, then it can be faster this way and the representation can also be more compact. (Sorting them and using binary search may or may not pay off.)

mreiland (orginal) [2015-07-08T01:40:27+02:00] view original

The #1 rule is to measure measure measure :)

But, it being an enum it tends to be understood the set is relatively small (in this case < 10 for the full set). nothing I've posted in here is anywhere near being resource intensive.

jibal (orginal) [2015-07-08T02:03:13+02:00] view original

@Jehan

I agree that, with the small N involved, performance is not much of an issue ... but the overhead of the two loops is still lower than the filterIt line.

Depends on how big the actual sets are. If they are small relative to the number of alternatives in the underlying ordinal type, then it can be faster this way

No, the incl/excl loops I gave are over the seq elements, not the enum values ... N is the same for both algorithms. (Actually it's O(N+M) vs. O(N*M))

Sorting them and using binary search may or may not pay off

If you have sorted lists, then find non-dups with a linear merge-like algorithm, not binary search.

the representation can also be more compact

I suppose, if you have an extremely sparse non-ordinal enum.

mreiland (orginal) [2015-07-08T03:37:01+02:00] view original

I believe Jehan was referring to runtime performance, not algorithmic complexity.

Many times a flat search through an array will be quicker than looking through a binary tree (or even a binary search over that same array). Things like cache friendliness for the CPU and branch prediction comes into account, as does things like being able to fit it into the L1/L2 cache of your CPU.

As N grows large the complexity of the algorithm will start to dominate, but if you're really trying to eke out that much performance it's better to measure and make decisions than to blindly trust algorithmic complexity.

Jehan (orginal) [2015-07-08T03:43:17+02:00] view original

jibal: No, the incl/excl loops I gave are over the seq elements, not the enum values ... N is the same for both algorithms. (Actually it's O(N+M) vs. O(N*M))

You forget that the set needs to be initialized, which is an O(EnumType.high) operation (plus, if you want to extract the result into a seq type, that also has O(EnumType.high) complexity).

jibal: If you have sorted lists, then find non-dups with a linear merge-like algorithm, not binary search.

For set union, yes, not necessarily for set intersection or difference. O(m+n) vs. O(min(m, n)*log(max(m, n)) and O(m*log(n))). Often depends on cache locality and branch prediction which one wins out, in addition to the value of m and n and the constant overhead.

mreiland: I believe Jehan was referring to runtime performance, not algorithmic complexity.

Both, actually. I was being terse since I didn't think the topic was so important as to merit a dissertation. :)

mreiland (orginal) [2015-07-08T04:47:34+02:00] view original

haha, touche :)

Mirror of forum.nim-lang.org

1415 :: quickest way to take set difference of two sequences?