nimforum mirror - [RFC] Project Picasso - A multithreading runtime for Nim

mratsim (orginal) [2019-08-09T13:50:03+02:00] view original

"Good artists borrow, great artists steal." -- Pablo Picasso

I have put together a document that lay out the various blocks needed for an efficient multithreading runtime. You can have a (long) read here: https://github.com/nim-lang/RFCs/issues/160.

GordonBGood (orginal) [2019-08-13T13:34:03+02:00] view original

@mratsim:

We were in the midst of a discussion about multi-threading over on the Fastest Primes thread, but I see why you haven't replied to that discussion - you've been busy with this Picasso RFC project. I, too, have been busy investigating multi-threading ideas, but it is perhaps time to bring ideas together and perhaps this thread is a better place to do it. I won't be commenting on your Picasso RFC as, in my mind for everyday users and me, it is too complex, so I'll let those that perhaps have more sophisticated needs contribute to it and possibly implement it if it seems to fill their needs.

Our last exchange of ideas on that other thread culminated in the following, with me replying to your quoted comments as follows:

As to exploring different alternative multi-threading:
approaching the limits of OpenMP (nested parallelism especially)
raw Threads is a too low-level abstraction
Nim's FlowVars have contention issues and no load balancing
I rejected OpenMP early as it is too C'ish and therefore somewhat cludgy although I have used it before with C and see how its use can be expanded beyond the basics available "out-of-the-box".
At first I thought using the low-level Thread's would be the answer as in easily emulating GoLang's green threads and that channels would work like GoLang channels; however, it kept getting more complex due to the limitations of not having GoLang's multi-threaded GC and therefore all the Nim limitations on the use of Channel's, meaning that one has to stoop to the use of low level features as in use of Lock and Cond and essentially building a Nim version of Haskell's "MVar" to pass through the channels. This has caused me to explore @Araq's recommended threadpool.
Funny you should say that about FlowVar's as your link to the "pibench2" project doesn't use theadpool and therefore doesn't use FlowVar's. I agree that the linked project using the raw thread's isn't very efficient and should be re-written to use threadpool and spawn as per the following...

My provided spawn/FlowVar code that doesn't seem to have "contention issues and no load balancing" problem can be run on this Wandbox link if you care to follow that up.

Since I see that threadpool/spawn is a beautiful modern concept for multi-threading, I decided to follow it up by applying it to my use case; however, after some days of work I found that while it is still a beautiful concept, the enforced limitations being implemented within Nim's limited memory management constraints make it very difficult to use. I see that these have come about due to Nim's history as follows:

"In the beginning" it was decided that "value"/struct types on the stack rather than "reference" types on the heap were the desired implementation for default data structures in Nim; thus we have the string, seq, and others implemented this way as object's containing heap references when necessary to contain "big" data. It was considered that using value types is faster (which can be true) as copying doesn't have to go through memory management of a limited Garbage Collection (GC), which Nim's GC of the time would have certainly been limiting as to performance.

Due to that design decision, transfer of ownership often requires `deepCopy`'ing of the reference heap data in order to avoid nominal independently owned copies pointing to the same heap data. This can often cause performance problems, so compiler data flow analysis avoids this where possible, but it can't always be avoided. However, the solutions of being able to "just work" and there being ways of overloading/selecting shallowCopy'ing "when one knows what they are doing (remember the problems of unintentionally shared mutable data) work well enough for the everyday use in non-threaded applications.

It turned out that developing a multi-threading GC is too big a project for the limited available Nim DEV team, and that has made multi-threading difficult other than within some fairly severe limitations.

The problem is that every GC managed ref can be prematurely collected when passed to another thread hidden behind a pointer, and in order to prevent premature collection, one needs to use a provided `protect`/`dispose` mechanism to use GC_ref/GC_unref in a threadsafe way to increment/decrement the GC reference count. Since these must be matched, it becomes a form of manual memory management except that we can now use Nim's destructor/copy/move facilities to implement this using wrappers. Closures are included as GC managed since their captured bindings are stored in a ref to a heap allocation.

Since low-level threading is just a (quite thin) wrapper around platform provided capabilities and since those facilities are written with C/C++ in mind to expect UNSAFE pointers, these work well enough either through deepCopy'ing or through use of these wrappers hidden behind pointers. However, they aren't easy to use for the everyday programmer with all kinds of details on synchronizing, avoiding data races and blocks, etc. to consider.

`Channel`'s are built quite closely on the platform provided primitives, so they don't have too many limitations other than deepCopy`ing `ref T, string`s, and `seq's by default unless hidden, just as for Thread's.

For a complex application written by an everyday programmer, these aren't easy to use and it was realized that Nim needed to provide something better: thus, the modern ideas of `threadpool`, `spawn`, and `FlowVar`'s came about. These are beautiful concepts, but the limitation of this library means that it is only easy to use in practice in a limited range of applications due to the above Nim limitations. These are that it must deepCopy almost everything and can't handle any data type that contains GC'ed ref's or return closures. Due to the fairly extreme compiler magic, templates, and macros used to make it work at all, it seems to be a support nightmare and very difficult to impossible to make it work outside its limited use case where it works well. So if one requires better performance than deepCopy'ing, multiple nested levels of heap reference, and/or use within nested levels of genenerics, it seems to be very difficult to make it work.

I saw the beauty of the `threadpool`/`spawn`/`FlowVar` concept and thought that I would be able to make it work through the use of the discussed wrappers preventing premature garbage collection across threads and through the use of these ptr based data data types, but after days of work with being able to get it working in simple sample applications without generics while being unable to work around the problem of spawn not recognizing a normal non-closure proc with parameters defined within the stated limitations as a call site when the "call" `proc` and/or some of the parameters were generic and possibly nested generic, I gave up.

In my programming/life, I have learned some basic precepts to control whether something is worth pursuing as follows (I think I've seen similar lists with possibly more items posted elsewhere):

Simple is better than Complex; when something is getting too Complex, it's time to go back to looking for Simple.

When there are too many special cases to keep making something appear Simple, then it is getting Complex, probably wrong, and it's time to go back to rule one.

When getting "Something" done appears Complex, then the method to accomplish the said "Something" is likely wrong and it's time to look for a Simple way to accomplish "Something" so as to return to rule one.

It appears that Nim's threading model fails at several if not all of the above precepts; this is similar to @Araq just recognizing that Nim's GC based memory model is likely wrong in that it is taking more and more support time and still doesn't make managing memory Simple. In this case, I don't think the threadpool library is the wrong approach; it's the limitations that it is forced to assume by everything else including Nim's data and memory managment models that has made it difficult/impossible to use for more general use.

So in a few hours I took my wrapper implementations and re-wrote my required parts of `threadpool` to not consider GC'ed data structures in about 150 lines (including the wrapper implementations), and that "just worked" without compiler magic or generic code limitations. In fact, this implementation is almost twice as fast as the original even when it worked and that is in spite of using atomically reference counted wrappers as a general memory management destructor control system; the newruntime's ownership on `ref`'s model will be even faster and I think will work in this applications's use case as ownership doesn't need to be shared beyond the creating entities. I can post this code somewhere if anyone is interested.

As to "cycle stealing"/"load balancing", the `threadpool`/`spawn` model actually takes care of that as the underlying platform implementation of threads will have a scheduler that automatically "time-slices" across all the currently running threads from the threadpool; it is only necessary to provide tasks/work to these threads to be in big enough time units so as to be significantly larger than the overheads of doing the multi-threading. A threadpool has an advantage here over use of primitive Thread's as since the threads in the pool are already "spun-up", they avoid thread "spin-up"/some context switch overhead per spawn and thus have less overhead by perhaps a factor of fifty to a hundred times than by the use of primitive threads; they also somewhat automatically "load balance" as stated.

Too many programmers try to feed work to a multi-threading system in tiny micro-slices which may result in worse multi than single threading performance; a (horrible) example of that is the following, taken from Nim's experimental documentation:

# Compute PI in an inefficient way
import strutils, math, threadpool
{.experimental: "parallel".}

proc term(k: float): float = 4 * math.pow(-1, k) / (2*k + 1)

proc pi(n: int): float =
  var ch = newSeq[float](n+1)
  parallel:
    for k in 0..ch.high:
      ch[k] = spawn term(float(k))
  for k in 0..ch.high:
    result += ch[k]

echo formatFloat(pi(5000))

As said, this 343-year-old algorithm to find "PI" is very slowly converging so isn't very efficient (although great in use as a benchmark), but the assignment of the calculation of a single simple polynomial term per thread at about a hundred CPU cycles per term is ludicrous when there is an overhead of thousands or tens of thousands of cycles overhead per thread invocation!

That isn't improper "load balancing" and doesn't need "cycle stealing", it just needs a simple improvement to the algorithm to feed work in larger units by computing many terms per invocation!

Note: this is not a recommended solution, this Wandbox link is a much better solution within the limitations of the Leibniz polynomial series.

Araq (orginal) [2019-08-13T15:28:50+02:00] view original

It appears that Nim's threading model fails at several if not all of the above precepts; this is similar to @Araq just recognizing that Nim's GC based memory model is likely wrong in that it is taking more and more support time and still doesn't make managing memory Simple.

I take that as a compliment, thank you. :-)

I can post this code somewhere if anyone is interested.

Definitely!

GordonBGood (orginal) [2019-08-13T21:45:57+02:00] view original

@Araq: Good of you to indicate you are following this! Sorry to inflict my long posts on you, but they are also written for new users who don't know the history and background of Nim, and I bold the main points so those who are "in-the-know" such as yourself can skim.

It appears that Nim's threading model fails at several if not all of the above precepts; this is similar to @Araq just recognizing that Nim's GC based memory model is likely wrong in that it is taking more and more support time and still doesn't make managing memory Simple.
I take that as a compliment, thank you. :-)

It was intended as a compliment, but also a lament: @gradha was right those many years ago and we've lost years using GC and trying to adapt single threaded GC to cross threads; as an aside to this thread's subject, perhaps he was also right in his lament on the lack of a standard GUI, but I see that you are open to someone developing something, as in exploring the possibility of making the new wNim package cross platform. For this, perhaps once our community get more used to the newruntime (once it works), it will also make that development easier.

On the current subject of a cleaner threading experience, it isn't as much a problem as the `threadpool` library but for the `channels` library, the storeAux proc that is related to supporting the deepCopy'ing of send parameters is almost half the total file size using the metric of the number of code lines used! I suppose that it was developed before system.deepCopy became available, as it seems that if it were still necessary (hopefully not when newruntime is fully deployed) it could avoid the redundancy of seemingly doing much the same thing. But again, I digress, as current channels works "well enough" and it is easy enough to work around this complexity. I think we have already talked somewhere else about making Channel send parameters to be sink and now perhaps owned if they are ref to make this library much cleaner and more flexible as to being more usable with (a proposed new) threadpool/spawn.

I can post this code somewhere if anyone is interested.
Definitely!

I am looking at my code, and it is as beautiful as the whole concept of threadpool/spawn/FlowVar (which names it uses/borrows/steals as it attempts to provide the identical functionality); however, in order to make it more readily generally adaptable to the current non-newruntime environment, it includes its own implementation of reference counted pointers (30 lines), and a closure wrapper to prevent premature garbage collection of closures and their captured bindings (another 30 lines). That works well enough as I am always careful not to use cyclic data and performance at the level of invocation of threads isn't a problem since it is of negligible cost in CPU cycles compared to the cost of invoking even pooled threads, but it is a significant part of the overall Proof Of Concept (POC) file (currently 160 lines including performance testing code).

I see that in most cases, one will be able to use the newruntime's `owned ref` instead of the "RCRef" emulation and won't need to wrap closures since they will use `owned ref` internally, which would make the code even more beautiful and be a good selling point for the use of the newruntime. However, the newruntime doesn't currently work with threading on the stable release and only maybe (untested as of yet as to the latest commits) with the latest DEVEL build, so it can't really be done properly yet. I have considered emulating `owned ref` rather than "RCRef` on top of the current runtime as a POC so that it would be more easily converted when the newruntime works, but emulating closure captures isn't easy for general closures without some compiler magic.

So what I propose is this: I'll first post a bit of code using the current `threadpool` to establish some performance benchmark, post another bit of code that shows the problems using the current `threadpool` with generic `proc`'s using closures and bypassing `deepCopy` (this could be posted as an issue against threadpool but what's the point when we can avoid all that complexity with newruntime), then the current "RCRef"/"Closure" wrapper version showing the concept works both for performance and solves the problem across generic proc's, and finally, either a different emulation using `owned ref` or (if newruntime works with threading soon) using the actual new runtime to show we gain both performance and code elegance.

What do you think?

At least I'm not proposing "throwing the baby out along with the bathwater" as per @gradha and others ;-)

Araq (orginal) [2019-08-13T23:49:15+02:00] view original

Sure, sounds good. And yes, with --newruntime we shouldn't have threads or channels or deepcopy in system anymore, all this cruft can be removed and we can write libraries without compiler magic that do these things better.

GordonBGood (orginal) [2019-08-14T07:46:58+02:00] view original

@Araq: ???

with --newruntime we shouldn't have threads or channels or deepcopy in system anymore

What? threads and channels are their own libraries and aren't currently in the system library. I think you mean that they won't need to use system.deepCopy anymore?

We still need the threads library if one has some real need for very low level multi-threading and the channels library for convenient inter-thread communication, and in fact these are what makes my new implementation of the threadpool library work just as the current threadpool library depends on them. What is nice when these are updated to use newruntime is that `channels` will be perfectly compatible with `spawn` without the problems as now, once we remove much of the "cruft and compiler magic" from them.

However, there still is a need for `system.deepCopy` as other than wrapping in a "RCRef", `deepCopy` is the only way of getting a new freshly owned non-shared/non-dangling version of an `owned` and we need it's handling of cyclic data structures. I think that deepCopy must still be automatically used when the type of a passed sink parameter to the channel send or a spawn is a dangling ref or contains one such as when ownership is retained by not being "lastReadOf" in order for the thread to get an owned copy; if the programmer desires to avoid this behaviour such as to avoid a high execution overhead of a full deepCopy, then they need to wrap the parameter so that it doesn't appear as a dangling ref such as by hiding it behind a manually managed raw pointer or by using a "RCRef" wrapper with an override for deepCopy that bypasses the full deepCopy just as suggested in the experimental documentation for spawn.

As per the experimental documentation on spawn about the limitation that the `FlowVar` returned from a spawn not being able to contain a type containing a `ref` type (including a closure?), I see the reason for that being that it currently returns a FlowVar containing a closure that defers the nested production of the actual FlowVar containing the type perhaps in order to support the experimental parallel optimizations through "crud and compiler magic" and I think we need to find a better way to implement parallel than this if that is what causes this limitation. Good multi-threading code uses the ability to pass closures to and fro as a convenient way to pass owned data and also cleanly defer execution without "smoke and mirrors".

In fact, to fully implement B/D as to avoiding data races on destruction, we need to implement a `deepDestroy` as is part of the suggested extension at the end of that document, and that also must avoid data races on cyclic data through ref's, but that is a bit easier. This was what you called my "crazy talk" but it is in the spec and needed unless you come up with a different simpler way of handling cyclic data?

There may be a need for `shallowCopy` to handle the "special case handling" for `sting`'s and `seq`'s (with which I disagree as per my KISS precept 2). However, (due to this precept), I highly recommend you consider making these two date types and any others that persist in appearing as value types but containing `ref` pointers be converted to `ref object` reference types if possible without breaking everything. I think that this can possibly be done for newruntime for any code that uses them through their by keeping their API's the same and even for code that obtains raw pointers to their data as it should work in the same way. I hate special cases, and the libraries are full of them due to handling this!

Araq (orginal) [2019-08-14T08:00:26+02:00] view original

What? threads and channels are their own libraries and aren't currently in the system library. I think you mean that they won't need to use system.deepCopy anymore?

No, I mean they are currently "submodules" of system.nim but there is no reason for that, it should be a separate, explicit import.

GordonBGood (orginal) [2019-08-14T08:23:24+02:00] view original

@Araq:

(the theads and channels libraries) should be a separate, explicit import.

Okay, I see that.

(about deepCopy) the answer is "no" because C++ and Rust and C etc. # all lack it. ;-)

C/C++ lack deepCopy but then, in spite of their bulk, they aren't very complete languages are they ;-)

Rust has this thing called Clone and I was under the impression that it does a deep copy, as how else could one consider the copy to be a "clone". But we are alright without it as with Nim's overloaded "hooks" and operators/proc's/func's, we could implement a custom "clone" to do what we needed if it were necessary. In that case, it would be nice not to need shallowCopy either, but that would likely only be possible by eliminating the "special cases" by making standard data types that contain ref's be ref object's?

Don't we still need a deepDestroy to be sure we have handled possibly cyclic ref's correctly?

mratsim (orginal) [2019-08-14T11:28:17+02:00] view original

Couldn't render post #32013.

GordonBGood (orginal) [2019-08-14T19:49:09+02:00] view original

@mratsim: Thanks for the reply and your correction of my misunderstandings.

Also the current threadpool/spawn/^ does not work on the "Hello World!" of multithreading which for better or worse is fibonacci...

Ah, I understand. So the order of business is still likely correct: get threading working with "newruntime" as seems to actively being worked on by the DEVS, then get the current concept of channels and threadpool at least working with "newruntime" as I am discussing with @Araq here, then improve the "load balancing" of those implementations as per your RFC.

Sounds like a good plan to me!

GordonBGood (orginal) [2019-08-15T14:43:46+02:00] view original

@Araq, @mratsim: As promised, I'm posting some code that shows the current difficulty of bypassing the overly simplistic (and performance costly) default behaviour for the current channels and threadpool libraries of deepCopy'ing all GC'ed ref's so they won't be prematurely destroyed if they go out of scope in the thread where they were created.

I was wrong previously in thinking that my difficulties in making this work were due to generic processes and thus nested templates; in fact, I think the problems were due to all the "cruft and compiler magic" not considering recursive algorithms where there may be threads spawned from within threads ad nauseum, which is also likely the problem that @mratsim has had in running his benchmarks. Accordingly, this benchmark wraps absolutely everything that has to do with GC (just in case) and should be able to handle just about an reasonable level of recursion, although I'm not sure about what nesting levels of "toDispose" lists generated by the protect/dispose pairs will do to the stack - there isn't much I can do about those without computer magic anyway, and we certainly don't want to add more of that if it isn't necessary.

The code linked below implements a little benchmark to cycle through spawning 10,000 (trivial) tasks from the threadpool using a manual closure implemented customizable iterator through closures and includes a "polymorphic" converter function closure parameter. It is close to what I require to cleanly implement my version of the "Ultimate Sieve of Eratosthens in Nim" algorithm, which does require the ability to nest and recursively spawn threads. I've divided the code into modules by functionality (the file tabs across the top of the source code section) for ready reference and so you can see that the actual benchmark is fairly trivial; most of the code is there to make deepCopy unnecessary by preserving the GC'ed ref's in the current Nim provided ways. I've tried to make the code concise and elegant, but the need to do this is UGLY.

However, it is likely the easiest to implement "Plan B" if the "newruntime" doesn't work out rather than the huge project of implementing a multi-threaded GC - just make the extra support modules available as one or more libraries. This link on Wandbox is the runnable code. It is run in full release mode at about 400 milliseconds on aIntel Xeon Sandy Bridge CPU at 2.5 GHz for which we are given the use of three threads, two of which likely share a core (Hyper Threaded). Thus, there are about a 2.5 billion total cycles used across all available threads, which means that there are about 250 thousand cycles or about 100 microseconds per thread spawn including overheads. This sounds like a lot but actually isn't bad considering it takes something like ten to a hundred times as long to do this by "spinning up a new thread" for every task.

Now, if "newruntime" does work out and for algorithms such as mine where single ownership is adequate, much of this code would just "go away": owned ref's would replace "RCRef", no wrappers would be required for closures or for ref's because they would be owned, and the channels and threadpool libraries could be re-written to be much simpler without the "cruft and compiler magic", not depending on using global resources, and thus written to be completely recursive if necessary.

All that would be left would be the benchmark itself, and even that would be much simpler, more concise, and more elegant through not having to call into the extra wrappers. It should also be a little (perhaps twice) as fast through not having the GC fighting us in the background and the more direct forms of code.

I had started work on converting this to use my own emulation of owned ref's (since release builds won't run with threading and newruntime on at the same time), but don't think I'll pursue it as emulating the new closures is quite hard without compiler help and there is little point if we are soon to have newruntime run for threads, which looks to be imminent. I'll reserve the effort for when that happens, as I think it is reasonably clear from this work how much easier that could make working across threads!

GordonBGood (orginal) [2019-08-22T06:21:48+02:00] view original

@Araq: Regarding the `channels` and `threads` libraries use with `newruntime`...

No, I mean they are currently "submodules" of system.nim but there is no reason for that, it should be a separate, explicit import.

As I said before, I can see the sense of making these separate modules and not auto imported when "--threads:on" is used.

However, currently in the latest builds with "--newruntime", there seems to be no `channels` library at all, neither implicitly imported nor available for explicit import. I see that the old library is full of "cruft" to do with deep copying and isn't suitable, nor did it use system.deepCopy so as to automatically do whatever threadpool did in the case that deepCopy disappeared.

Does this mean that channels is being re-written to be an explicitly importable module without the "cruft"? I can see that it should be, but the lack of `channels` makes any code that depends on `channels` break so needs some immediate attention!

The threads module is still implicitly imported with "--threads:on" whether "--newruntime" is used or not, but then it isn't full of "cruft", being a fairly thin wrapper over the OS threading capabilities.

Mirror of forum.nim-lang.org

5083 :: [RFC] Project Picasso - A multithreading runtime for Nim