Hi,
I have some proposals for changing the sequtils library. Please write things like "yes to A1, A2, but no to A3" or "go to hell". If you want to propose something different, please change the letter of the items (from A1, A2, to e.g. B1, B2, etc.) in our list, so the future votes' target will be clear.
Please first let's discuss what would be ideal, and then talk about what deprecated functions should be added. I'm not using the syntax sugar => for simplicity in this thread.
Thanks, Peter
A1: The parameter order of each function should be: any list, any function, any starting element, any extra parameters. It is important that the list is the first parameter because of chaining:
x.map(f).filter(g).repeat(5)
A2: Any functions should accept "openArray" as list parameter (except inplace map, because that needs "var seq").
A3: The inplace "map" should be renamed to "apply". Examples:
# no change in regular map:
assert map(@[1,2,3], proc(x: int): int = 10*x) == @[10,20,30] # as before
# inplace map:
var x = @[4,5,6]
x.apply(proc(x: var int): int = x = 10*x)
assert x == @[40,50,60]
A4: Let's introduce "nest" and "nestList", they re-recall a function for the given times (following Mathematica's notation):
assert nest(proc(x: int):int = 10*x, 1, 4) == 10000
assert nestList(proc(x: int):int = 10*x, 1, 4) == @[1, 10, 100, 1000, 10000] # careful, it has 5 elements
A5 [feedback: fold/foldl and foldr]: "fold" should do the reduce operation from the left, "foldRight" is from the right. They should throw an exception if the list is empty. Example:
assert fold(@[1,2,3,4,5], proc(x, y: int): int = x+y) == 15
A6: "fold" and "foldRight" should have a version with an extra starting parameter. This versions should not throw an exception even if the list is empty (following the Clojure's "reduce" function):
assert fold(@[1,2,3,4,5], proc(x, y: int): int = x+y, 1000) == 1015
assert fold(@[], proc(x, y: int): int = x+y, 1000) == 1000
assert fold(@[1,2,3,4,5], proc(x: string, y: int): string = x & $y, "") == "12345"
assert foldRight(@[1,2,3,4,5], proc(x: string, y: int): string = x & $y, "") == "54321"
A7: "foldList" is giving back the list which "fold" saw during its task. It extends both versions of "fold" (fold(list, f) AND fold(list, f, startingPoint)). For example the cumulative sum is:
assert foldList(@[1,2,3,4,5], proc(x, y: int): int = x+y) == @[1,3,6,10,15] # this throws exception if the list is empty
assert foldList(@[1,2,3,4,5], proc(x, y: int): int = x+y, 0) == @[1,3,6,10,15] # this doesn't throw exception if the list is empty
A8: Let's introduce "cycle", which does what "repeat" does now (following Clojure's notation):
assert cycle(@[1,2,3], 2) == @[1,2,3,1,2,3]
A9: "repeat" should repeat one element (Clojure's notation), for example:
assert repeat(10, 5) == @[10, 10, 10, 10, 10]
assert repeat(@[1,2,3], 2) == @[@[1,2,3], @[1,2,3]]
A10: There should be an iterator version of all the procedures (not the templates), which returns a list. These are: "map", "filter", "nestList", "foldList", "cycle" and "repeat". All of these should be lazy, for example:
var shift = 10
for x in map(@[1,2,3], proc(t: int): int = t+shift):
echo x
shift = 100
# 11
# 102
# 103
A11 [feedback: add ``zip`` too]: "map", "apply", "filter" functions and their iterated versions should be in system.nim, everything else in sequtils.nim
A12 [feedback: use concepts]: The templates should accept "openArray" type as list parameter. This is a strict: currently "mapIt" accepts any type which implements "items".
template mapIt*[T,S](seq1: openArray[T], op: expr): seq[S] =
A13: Let's have "mapIt", "applyIt", "filterIt", "nestIt", "nestListIt" template versions. These should have exactly the same call parameters as the orginial procedures', except the function parameter is an expression of "it". "mapIt" should not ask for the output type parameter.
assert nestIt(it & "a", "", 5) == "aaaaa"
A14: "applyIt" is the template version of "apply", so it needs to modify "it" as in "apply" (consequence of following the previous rule):
applyIt(x, it = it*10) # should work
applyIt(x, it*10) # error message: value of type 'int' has to be discarded
A15: "foldAB" and "foldListAB" template versions of both "fold" functions and both "foldList" functions. The AB refers to the injected variables "a" and "b". For example:
assert foldAB(@[1,2,3,4,5], a+b) == 15
A16: "mapKV", "filterKV", "nestKV" (K stands for 'key', V stands for 'value') can be called on lists of 2 dimensional tuples, where the injected variables are "k" and "v". For example:
assert mapKV(@[(1,2), (3,4)], (v,k)) == @[(2,1), (4,3)] # replacing key with value
assert mapKV(@[(1,2), (3,4)], k) == @[1, 3]
assert filterKV(@[(1,"apple"), (2,"banana")], k == 1) == @[(1,"apple")]
assert nestKV((v,k+v), (1,1), 10).mapKV(k) == @[1,1,2,3,5,8,13,21,34,55] # first 10 Fibonacci numbers
A17 [feedback: create an iterator version of ``zip`` if possible]: "zip" is a special template+macro combo, it takes at least two lists as parameters, and zips them. The macro part is needed because it creates the type at compile time from the types of the input lists. For example:
assert zip(@[1,2,3], @["one", "two"]) == @[(1, "one"), (2, "two")]
assert zip(@[1,2,3], @["one", "two"], @[1.0, 2.0, 3.0]) == @[(1, "one", 1.0), (2, "two", 2.0)]
A18: The remaining functions: "toSeq", "concat", "contains", "deduplicate", "distribute", "delete", "insert" should be the same; "keepIf" shall have template versions "keepItIf" and "keepKVIf"; "nestListIt" is better than "newSeqWith" (because you can use the previous item in the expression). We should have a "reverse" function, I didn't find it.
A19: Future plans: we should have "groupBy", "partition" (same way as in Clojure).
I like everything except A12, I think we need to finally have an 'iterable' concept instead.
Having said that, I don't like this whole approach that sequtils takes: Just look at the number of trivial functions/templates you need to create and test. (Though I guess you can use macros and templates to create them.)
Instead of x.map(f).filter(g).repeat(5) we should consider something like this: linq(x, map(it+90), filter(g), repeat(5)) where linq is a macro that gives us a fullblown DSL which can special case it variants. linq also gets enough contexts so that it can optimize intermediate seqs away and merge filter expressions and eliminate lambdas...
I don't have a strong opinion about A12, I'm happy if mapIt eats anything. Just let's agree on something and stick to it. The "iterable" concept would be great, I'll read more about concepts. I can imagine poor man's solutions like allocate the right size in advance in mapIt if compiles(len(x)) is true.
For the code repeating: I'm using a mapP template to create a map??? type templates in this file https://github.com/petermora/nimLazy/blob/master/lazy.nim , so the definitions of mapIt and mapKV are just one lines.
The linQ idea is available in many languages (for example -> and ->> macros in Clojure), but the smart macro logic which eliminates the inner lists is amazing! However, finding out whether it is needed looks troubling for me, this whole expression can be within a mapIt function. I think it would be safer to write the extra "It" at the end of the function names.
Yes to all except the following.
A5: No. I actually like the name being more informative and would prefer foldLeft or foldl vs fold.
A11: Yes, but I would suggest also including zip in system.nim.
A12: Haven't thought enough about this. Probably defer to those with better knowledge of the language on this one.
A few other suggestions:
zip that can produces an iterator similar to izip in Pythons itertools
zip that can take iterators as parameters.
An unzip procedure similar to Pythons zip(*listOfTuples) or zip(*listOfLists)
I also agree with Araq that an iterable concept would be good.
@com, thank you for the feedback. Let me reply to your points in the same order.
A5: I'm ok with foldLeft or foldl.
A11: In order to implement zip the library macros is required. When I started to use the macros library, it increased my compilation time a little bit. I'm afraid to increase all the compilation times, not just mine. I don't have enough view on this topic.
A12: We are on the same page.
There should be an iterator version of zip, I don't know how I missed that.
You suggested that the zip shall take iterators as parameters. My plan is that once we all agree on the names and functionality of these functions, then we should have an iterator version of them (which eats iterators and produces iterators, like in my library nimLazy).
I love the unzip approach.
Update:
Probably the parameter order in the lambda function of foldRight is wrong in A6, it should be
assert foldRight(@[1,2,3,4,5], proc(y: int, x: string): string = x & $y, "") == "54321"
We haven't talked about parallel versions of these functions. I have something like pMap in my mind, which would produce a list using more than one thread.
I am really busy in these days. Within a week I'll try to create a table showing all these functions/templates with example codes, then I'll create a codebase for testing.
Thanks, Peter
Meta question: Why is this not in an issue on github instead?
usually have fold for initval, reduce for no initval.
According to https://en.wikipedia.org/wiki/Fold_(higher-order_function)
there are considerably more languages that overload reduce than that use fold and reduce. Overloading fold would put Nim in the Haskell camp. I favor it because it's the technical name of the operation. And I strongly dislike using two names because there's no principled distinction ... it's just one more arbitrary thing to try to remember.
I love it all, thank you for taking the time to write this and ask for our opinions!
I'm not sure about why A14 needs to assign it, but I didn't bother looking at what apply does exactly so meh.
I agree with @Araq about A12. In regards to creating a linq macro, we can do that too, let's let people have their Functional Programming fun too though.
I've wanted to see more FP features in Nim for a while now so thank you again for moving us forward. Clojure is a good model to take inspiration from for this, please take a look at Haskell too though (I can see people suggesting things be done the Haskell way already). I unfortunately don't have time to argue too much and I like what I am seeing anyway.
For me that's just laziness, it has nothing to do with "infinity".
Actually there is a close connection because laziness is how infinite abstractions are implemented on finite devices. e.g., in Perl 6 you can say my @nums = 23 .. * or my @fib = 1, 1, -> $a, $b { $a + $b } ... *; You'll never be able to extract an infinite number of elements from those arrays, but you can extract any subset that time and space allow. and you can zip them together, multiply them, and do all sorts of other functiony things to them before ever extracting a value. Essentially you're doing symbolic algebra on infinite sequences by composing functions.
Note that the fact that they are infinite sequences has consequences consistent with infinity. For instance, you can't do a binary search on an infinite sequence, and you can't ask for the entry that's halfway through the sequence of objects produced by the above-mentioned file parser, even though the underlying file is finite ... you would have to consume entries until encountering an EOF exception, thereby converting the abstractly infinite sequence into an actually finite sequence.
Not much update. I'm learning the type matching in the compiler, especially how the generics and auto-s are working. The motivation is simple: if we had myList.map(x => x+10), then we wouldn't need mapIt, which would make all the things here much easier. It is very hard to implement, according to Araq. Well, after debugging the compiler for weeks, I can't say that he is wrong. Now I'm searching for the simplest way to extend the compiler, not to increase its complexity. Once this is done, the rest of the coding work listed here seems trivial. I assume that there will be some discussions about what and how to change, what to break, etc.
Peter
Update: I have a pull request https://github.com/nim-lang/Nim/pull/3234 which is a big step to have myList.map(x => x+10). I hope that it will be accepted or can be corrected if needed.
Let's assume that x=>x+10 will work soon. How should we continue? Shall we have mySeq.mapIt(it+10) as well? I don't think that mapIt is needed anymore. mySeq.map(x=>x+10) is exactly the same number of characters and easier to understand to any newcomer. I guess then we should drop mapKV as well. There is a thread on github https://github.com/nim-lang/Nim/issues/2179 with other ideas. What would you like to have?
I continue to work on the changes in sequtils, it's very likely that I'll have a github repo with a library for testing and to get more feedback.
Thanks, Peter
This PR is pure gold but I haven't looked at it in detail yet. :-)
We still need the it versions for efficiency unless you also want to implement a frontend inliner for trivial lambdas to optimize the overhead away. In fact, now that you know the compiler, it's a good next step. Keep up the great work!
I did some investigation whether map or mapIt is faster. I had a quick look of the generated code. The mapIt is very simple, there is no function call, the formula is inlined. In the map(mySeq, proc(x:int):int = x+10 call we have
// for i in 0..data.len-1: result[i] = op(data[i])
result->data[i_92079] = op.ClEnv? op.ClPrc(data[i_92079], op.ClEnv):((TMP16)(op.ClPrc))(data[i_92079]);
I removed the extra check here (op.ClEnv?..) for a benchmark, it had no effect. But we have an extra function call for each item. That's an overhead. Also mapIt is inlined (just one call, I know).
I also did a quick benchmark (see below), and mapIt was faster: 2.9 sec vs. 4.2 sec. I guess (I haven't written compiler optimizer) that C can't inline our function since that is a parameter in map function. To test this, I created a myMap template below. It is accepting a function (instead of an expression), and the running time is the same as for mapIt. I guess the compiler can inline our function in this case because the function is known at compile time (for map it is not known, because the map can be called from multiple places). I'm just guessing.
Benchmark (after fixing that mapIt should prealloc the result instead of adding elements one by one):
nim c -d:release test.nim; time ./test
import sequtils
template myMap*(seq1, typ, op: expr): expr =
var result {.gensym.}: seq[typ] = newSeq[typ](seq1.len)
var i {.gensym.} = 0
for it {.inject.} in items(seq1):
result[i] = op(it)
i += 1
result
var x = toSeq(1..1000)
var s = 0
for i in 0..<1_000_000:
s += map(x, proc(x: int): int = x + 10)[9] # with this line enabled, it's 4.2 sec
#s += mapIt(x, int, it + 10)[9] # with this line enabled, it's 2.9 sec
#s += myMap(x, int, proc(x: int): int = x+10)[9] # with this line enabled, it's 2.9 sec
echo s # print s, so the compiler is forced to do the calc
There is no final conclusion, still playing. Maybe everything should be a template which accepts (and calls) a function to speed it up.
Adding
import times
and putting it in a main proc
proc main() =
var x = toSeq(1..1000)
var s = 0
var t = epochTime()
for i in 0..<1_000_000:
s += map(x, proc(x: int): int = x + 10)[9] # with this line enabled, it's 4.2 sec
#s += mapIt(x, int, it + 10)[9] # with this line enabled, it's 2.9 sec
#s += myMap(x, int, proc(x: int): int = x+10)[9] # with this line enabled, it's 2.9 sec
echo s, " :: ",epochTime()-t # print s, so the compiler is forced to do the calc
main()
and compiling with -d:release
I got on windows 7
3.085 sec
6.580 sec
2.199 sec