nimforum mirror - Changes in sequtils

mora (orginal) [2015-06-18T00:21:55+02:00] view original

Hi,

I have some proposals for changing the sequtils library. Please write things like "yes to A1, A2, but no to A3" or "go to hell". If you want to propose something different, please change the letter of the items (from A1, A2, to e.g. B1, B2, etc.) in our list, so the future votes' target will be clear.

Please first let's discuss what would be ideal, and then talk about what deprecated functions should be added. I'm not using the syntax sugar => for simplicity in this thread.

Thanks, Peter

A1: The parameter order of each function should be: any list, any function, any starting element, any extra parameters. It is important that the list is the first parameter because of chaining:

x.map(f).filter(g).repeat(5)

A2: Any functions should accept "openArray" as list parameter (except inplace map, because that needs "var seq").

A3: The inplace "map" should be renamed to "apply". Examples:

# no change in regular map:
assert map(@[1,2,3], proc(x: int): int = 10*x) == @[10,20,30] # as before

# inplace map:
var x = @[4,5,6]
x.apply(proc(x: var int): int = x = 10*x)
assert x == @[40,50,60]

A4: Let's introduce "nest" and "nestList", they re-recall a function for the given times (following Mathematica's notation):

assert nest(proc(x: int):int = 10*x, 1, 4) == 10000
assert nestList(proc(x: int):int = 10*x, 1, 4) == @[1, 10, 100, 1000, 10000] # careful, it has 5 elements

A5 [feedback: fold/foldl and foldr]: "fold" should do the reduce operation from the left, "foldRight" is from the right. They should throw an exception if the list is empty. Example:

assert fold(@[1,2,3,4,5], proc(x, y: int): int = x+y) == 15

A6: "fold" and "foldRight" should have a version with an extra starting parameter. This versions should not throw an exception even if the list is empty (following the Clojure's "reduce" function):

assert fold(@[1,2,3,4,5], proc(x, y: int): int = x+y, 1000) == 1015
assert fold(@[], proc(x, y: int): int = x+y, 1000) == 1000
assert fold(@[1,2,3,4,5], proc(x: string, y: int): string = x & $y, "") == "12345"
assert foldRight(@[1,2,3,4,5], proc(x: string, y: int): string = x & $y, "") == "54321"

A7: "foldList" is giving back the list which "fold" saw during its task. It extends both versions of "fold" (fold(list, f) AND fold(list, f, startingPoint)). For example the cumulative sum is:

assert foldList(@[1,2,3,4,5], proc(x, y: int): int = x+y) == @[1,3,6,10,15] # this throws exception if the list is empty
assert foldList(@[1,2,3,4,5], proc(x, y: int): int = x+y, 0) == @[1,3,6,10,15] # this doesn't throw exception if the list is empty

A8: Let's introduce "cycle", which does what "repeat" does now (following Clojure's notation):

assert cycle(@[1,2,3], 2) == @[1,2,3,1,2,3]

A9: "repeat" should repeat one element (Clojure's notation), for example:

assert repeat(10, 5) == @[10, 10, 10, 10, 10]
assert repeat(@[1,2,3], 2) == @[@[1,2,3], @[1,2,3]]

A10: There should be an iterator version of all the procedures (not the templates), which returns a list. These are: "map", "filter", "nestList", "foldList", "cycle" and "repeat". All of these should be lazy, for example:

var shift = 10
for x in map(@[1,2,3], proc(t: int): int = t+shift):
  echo x
  shift = 100
# 11
# 102
# 103

A11 [feedback: add ``zip`` too]: "map", "apply", "filter" functions and their iterated versions should be in system.nim, everything else in sequtils.nim

A12 [feedback: use concepts]: The templates should accept "openArray" type as list parameter. This is a strict: currently "mapIt" accepts any type which implements "items".

template mapIt*[T,S](seq1: openArray[T], op: expr): seq[S] =

A13: Let's have "mapIt", "applyIt", "filterIt", "nestIt", "nestListIt" template versions. These should have exactly the same call parameters as the orginial procedures', except the function parameter is an expression of "it". "mapIt" should not ask for the output type parameter.

assert nestIt(it & "a", "", 5) == "aaaaa"

A14: "applyIt" is the template version of "apply", so it needs to modify "it" as in "apply" (consequence of following the previous rule):


applyIt(x, it = it*10) # should work
applyIt(x, it*10) # error message: value of type 'int' has to be discarded

A15: "foldAB" and "foldListAB" template versions of both "fold" functions and both "foldList" functions. The AB refers to the injected variables "a" and "b". For example:

assert foldAB(@[1,2,3,4,5], a+b) == 15

A16: "mapKV", "filterKV", "nestKV" (K stands for 'key', V stands for 'value') can be called on lists of 2 dimensional tuples, where the injected variables are "k" and "v". For example:

assert mapKV(@[(1,2), (3,4)], (v,k)) == @[(2,1), (4,3)] # replacing key with value
assert mapKV(@[(1,2), (3,4)], k) == @[1, 3]
assert filterKV(@[(1,"apple"), (2,"banana")], k == 1) == @[(1,"apple")]
assert nestKV((v,k+v), (1,1), 10).mapKV(k) == @[1,1,2,3,5,8,13,21,34,55] # first 10 Fibonacci numbers

A17 [feedback: create an iterator version of ``zip`` if possible]: "zip" is a special template+macro combo, it takes at least two lists as parameters, and zips them. The macro part is needed because it creates the type at compile time from the types of the input lists. For example:

assert zip(@[1,2,3], @["one", "two"]) == @[(1, "one"), (2, "two")]
assert zip(@[1,2,3], @["one", "two"], @[1.0, 2.0, 3.0]) == @[(1, "one", 1.0), (2, "two", 2.0)]

A18: The remaining functions: "toSeq", "concat", "contains", "deduplicate", "distribute", "delete", "insert" should be the same; "keepIf" shall have template versions "keepItIf" and "keepKVIf"; "nestListIt" is better than "newSeqWith" (because you can use the previous item in the expression). We should have a "reverse" function, I didn't find it.

A19: Future plans: we should have "groupBy", "partition" (same way as in Clojure).

Araq (orginal) [2015-06-18T10:22:55+02:00] view original

I like everything except A12, I think we need to finally have an 'iterable' concept instead.

Having said that, I don't like this whole approach that sequtils takes: Just look at the number of trivial functions/templates you need to create and test. (Though I guess you can use macros and templates to create them.)

Instead of x.map(f).filter(g).repeat(5) we should consider something like this: linq(x, map(it+90), filter(g), repeat(5)) where linq is a macro that gives us a fullblown DSL which can special case it variants. linq also gets enough contexts so that it can optimize intermediate seqs away and merge filter expressions and eliminate lambdas...

mora (orginal) [2015-06-18T11:55:56+02:00] view original

I don't have a strong opinion about A12, I'm happy if mapIt eats anything. Just let's agree on something and stick to it. The "iterable" concept would be great, I'll read more about concepts. I can imagine poor man's solutions like allocate the right size in advance in mapIt if compiles(len(x)) is true.

For the code repeating: I'm using a mapP template to create a map??? type templates in this file https://github.com/petermora/nimLazy/blob/master/lazy.nim , so the definitions of mapIt and mapKV are just one lines.

The linQ idea is available in many languages (for example -> and ->> macros in Clojure), but the smart macro logic which eliminates the inner lists is amazing! However, finding out whether it is needed looks troubling for me, this whole expression can be within a mapIt function. I think it would be safer to write the extra "It" at the end of the function names.

com (orginal) [2015-06-22T21:18:47+02:00] view original

Yes to all except the following.

A5: No. I actually like the name being more informative and would prefer foldLeft or foldl vs fold.

A11: Yes, but I would suggest also including zip in system.nim.

A12: Haven't thought enough about this. Probably defer to those with better knowledge of the language on this one.

A few other suggestions:

zip that can produces an iterator similar to izip in Pythons itertools

zip that can take iterators as parameters.

An unzip procedure similar to Pythons zip(*listOfTuples) or zip(*listOfLists)

I also agree with Araq that an iterable concept would be good.

mora (orginal) [2015-06-29T08:47:54+02:00] view original

@com, thank you for the feedback. Let me reply to your points in the same order.

A5: I'm ok with foldLeft or foldl.

A11: In order to implement zip the library macros is required. When I started to use the macros library, it increased my compilation time a little bit. I'm afraid to increase all the compilation times, not just mine. I don't have enough view on this topic.

A12: We are on the same page.

There should be an iterator version of zip, I don't know how I missed that.

You suggested that the zip shall take iterators as parameters. My plan is that once we all agree on the names and functionality of these functions, then we should have an iterator version of them (which eats iterators and produces iterators, like in my library nimLazy).

I love the unzip approach.

mora (orginal) [2015-06-29T08:54:18+02:00] view original

Update:

Probably the parameter order in the lambda function of foldRight is wrong in A6, it should be

assert foldRight(@[1,2,3,4,5], proc(y: int, x: string): string = x & $y, "") == "54321"

We haven't talked about parallel versions of these functions. I have something like pMap in my mind, which would produce a list using more than one thread.

I am really busy in these days. Within a week I'll try to create a table showing all these functions/templates with example codes, then I'll create a codebase for testing.

Thanks, Peter

ozra (orginal) [2015-06-30T20:44:58+02:00] view original

A1. Great

A3. Great renaming

A5. Prefer fold/foldr, or foldl/foldr, or even fold and foldl aliased - but the shorter names in any event.

A6. Languages (stdlibs...) making a distinction in names usually have fold for initval, reduce for no initval. I however agree on the optional init and changing demand behaviour.

A8. Great, cycle is a much better name.

A13. It would be awesome if the it and explicit proc signature versions could be auomatically picked, don't know if that's possible? (if expr: do the it-style, if proc: do regular style). This is the way you'd do it in languages supporting it natively. It'd be real neat if possible.

A15. Same here as A13, if auto is possible, much greater. Else, foldAb, foldxy, or something not using two capitals

A16. Same regarding naming: mapKv, mapkv or even map2 (for "two vars"/pair, bit ugly though)

A18. Not really important, but, deduplicate is often named unique or uniq, otherwise, dedupe? (yeah, I like shorter names in functional, they get mind share)

All others: Yep!

Meta question: Why is this not in an issue on github instead?

jibal (orginal) [2015-06-30T23:44:00+02:00] view original

usually have fold for initval, reduce for no initval.

According to https://en.wikipedia.org/wiki/Fold_(higher-order_function)

there are considerably more languages that overload reduce than that use fold and reduce. Overloading fold would put Nim in the Haskell camp. I favor it because it's the technical name of the operation. And I strongly dislike using two names because there's no principled distinction ... it's just one more arbitrary thing to try to remember.

ozra (orginal) [2015-06-30T23:52:50+02:00] view original

For the record: Strongly agree on using one name only - if that wasn't obvious from my previous perhaps badly worded comment (regarding A6)

dom96 (orginal) [2015-07-01T00:26:13+02:00] view original

I love it all, thank you for taking the time to write this and ask for our opinions!

I'm not sure about why A14 needs to assign it, but I didn't bother looking at what apply does exactly so meh.

I agree with @Araq about A12. In regards to creating a linq macro, we can do that too, let's let people have their Functional Programming fun too though.

I've wanted to see more FP features in Nim for a while now so thank you again for moving us forward. Clojure is a good model to take inspiration from for this, please take a look at Haskell too though (I can see people suggesting things be done the Haskell way already). I unfortunately don't have time to argue too much and I like what I am seeing anyway.

jibal (orginal) [2015-07-18T00:52:01+02:00] view original

For me that's just laziness, it has nothing to do with "infinity".

Actually there is a close connection because laziness is how infinite abstractions are implemented on finite devices. e.g., in Perl 6 you can say my @nums = 23 .. * or my @fib = 1, 1, -> $a, $b { $a + $b } ... *; You'll never be able to extract an infinite number of elements from those arrays, but you can extract any subset that time and space allow. and you can zip them together, multiply them, and do all sorts of other functiony things to them before ever extracting a value. Essentially you're doing symbolic algebra on infinite sequences by composing functions.

Note that the fact that they are infinite sequences has consequences consistent with infinity. For instance, you can't do a binary search on an infinite sequence, and you can't ask for the entry that's halfway through the sequence of objects produced by the above-mentioned file parser, even though the underlying file is finite ... you would have to consume entries until encountering an EOF exception, thereby converting the abstractly infinite sequence into an actually finite sequence.

andrea (orginal) [2015-08-18T14:40:37+02:00] view original

@mora Any update on this? Did you try to put these functions into sequtils?

mora (orginal) [2015-08-19T00:41:56+02:00] view original

Not much update. I'm learning the type matching in the compiler, especially how the generics and auto-s are working. The motivation is simple: if we had myList.map(x => x+10), then we wouldn't need mapIt, which would make all the things here much easier. It is very hard to implement, according to Araq. Well, after debugging the compiler for weeks, I can't say that he is wrong. Now I'm searching for the simplest way to extend the compiler, not to increase its complexity. Once this is done, the rest of the coding work listed here seems trivial. I assume that there will be some discussions about what and how to change, what to break, etc.

Peter

andrea (orginal) [2015-08-19T09:17:38+02:00] view original

I think it would be a minor inconvenience to write myList.map(x: int => x+10) at least for now. The rest of the changes you propose seem more substantial than being able to remove that single type annotation

mora (orginal) [2015-08-23T21:46:53+02:00] view original

Update: I have a pull request https://github.com/nim-lang/Nim/pull/3234 which is a big step to have myList.map(x => x+10). I hope that it will be accepted or can be corrected if needed.

Let's assume that x=>x+10 will work soon. How should we continue? Shall we have mySeq.mapIt(it+10) as well? I don't think that mapIt is needed anymore. mySeq.map(x=>x+10) is exactly the same number of characters and easier to understand to any newcomer. I guess then we should drop mapKV as well. There is a thread on github https://github.com/nim-lang/Nim/issues/2179 with other ideas. What would you like to have?

I continue to work on the changes in sequtils, it's very likely that I'll have a github repo with a library for testing and to get more feedback.

Thanks, Peter

Araq (orginal) [2015-08-23T22:05:56+02:00] view original

This PR is pure gold but I haven't looked at it in detail yet. :-)

We still need the it versions for efficiency unless you also want to implement a frontend inliner for trivial lambdas to optimize the overhead away. In fact, now that you know the compiler, it's a good next step. Keep up the great work!

mora (orginal) [2015-08-24T01:17:32+02:00] view original

I did some investigation whether map or mapIt is faster. I had a quick look of the generated code. The mapIt is very simple, there is no function call, the formula is inlined. In the map(mySeq, proc(x:int):int = x+10 call we have

//  for i in 0..data.len-1: result[i] = op(data[i])
result->data[i_92079] = op.ClEnv? op.ClPrc(data[i_92079], op.ClEnv):((TMP16)(op.ClPrc))(data[i_92079]);

I removed the extra check here (op.ClEnv?..) for a benchmark, it had no effect. But we have an extra function call for each item. That's an overhead. Also mapIt is inlined (just one call, I know).

I also did a quick benchmark (see below), and mapIt was faster: 2.9 sec vs. 4.2 sec. I guess (I haven't written compiler optimizer) that C can't inline our function since that is a parameter in map function. To test this, I created a myMap template below. It is accepting a function (instead of an expression), and the running time is the same as for mapIt. I guess the compiler can inline our function in this case because the function is known at compile time (for map it is not known, because the map can be called from multiple places). I'm just guessing.

Benchmark (after fixing that mapIt should prealloc the result instead of adding elements one by one):


nim c -d:release test.nim; time ./test

import sequtils

template myMap*(seq1, typ, op: expr): expr =
  var result {.gensym.}: seq[typ] = newSeq[typ](seq1.len)
  var i {.gensym.} = 0
  for it {.inject.} in items(seq1):
    result[i] = op(it)
    i += 1
  result

var x = toSeq(1..1000)
var s = 0
for i in 0..<1_000_000:
  s += map(x, proc(x: int): int = x + 10)[9]       # with this line enabled, it's 4.2 sec
  #s += mapIt(x, int, it + 10)[9]                  # with this line enabled, it's 2.9 sec
  #s += myMap(x, int, proc(x: int): int = x+10)[9] # with this line enabled, it's 2.9 sec
echo s  # print s, so the compiler is forced to do the calc

There is no final conclusion, still playing. Maybe everything should be a template which accepts (and calls) a function to speed it up.

gokr (orginal) [2015-08-24T10:03:40+02:00] view original

I just want to say awesome work. A very strong seq with a really good set of base functionality is a killer to have. One thing that really makes Smalltalk (my home lang) shine is its strong Collection hierarchy, so super thanks for working on this.

jlp765 (orginal) [2015-08-24T13:39:38+02:00] view original

Adding


import times

and putting it in a main proc


proc main() =
  var x = toSeq(1..1000)
  var s = 0
  var t = epochTime()
  for i in 0..<1_000_000:
    s += map(x, proc(x: int): int = x + 10)[9]       # with this line enabled, it's 4.2 sec
    #s += mapIt(x, int, it + 10)[9]                  # with this line enabled, it's 2.9 sec
    #s += myMap(x, int, proc(x: int): int = x+10)[9] # with this line enabled, it's 2.9 sec
  echo s, " :: ",epochTime()-t  # print s, so the compiler is forced to do the calc

main()

and compiling with

-d:release

I got on windows 7


3.085 sec
6.580 sec
2.199 sec

Mirror of forum.nim-lang.org

1333 :: Changes in sequtils