nimforum mirror - Comparisons of Nim to the Chapel computer programming language...

GordonBGood (orginal) [2020-06-12T15:51:29+02:00] view original

You might be interested in my observations about the Chapel computing language, which is a Cray (now owned by HP) off shoot that is in the Open Source domain. The reason for my interest is that it is a language about roughly the same age as Nim but specializes toward parallel computing which I am interesting in exploring and do explore including using Nim:

Management of the project...

Chapel seems to have as Benevolent Dictator For Life (BDFL) a supportive person named Brad Chamberlain in place of our @araq, and a group of contributors which I would judge to be smaller than our group.

There is probably not as much information available about Chapel than about Nim on-line, but StackOverflow does have some Questions/Answers, many answered by Brad Chamberlain himself or out of the other contributors.

Ease of installation and use...

Chapel is not so easy to install other than on Xnix platforms (including OSX) for which there are pre-compiled packages available but can be used on Windows using cygwin although one needs to compile the compiler and libraries from source.

The compiler is written in C/C++, not Chapel (at least yet), and can generate very fast code as all the usual compiler optimizations are available, but the compilation process of compiling source Chapel code itself (currently) is sloooow, taking about a minute to compile about 200 lines of code on my older Sandy Bridge class desktop with full optimizations enabled.

The compiler emits C code for further processing by a C compiler just as is one of the options Nim has, and the generated C code is fairly easy to read and understand as Nim's generated code is.

There aren't nearly as many CPU targets such as all the ARM variants although it does support ARM64, doesn't directly support a JavaScript backend (again it might be made to work through Emscripten but isn't directly supported) and all the various C/C++ compilers that Nim supports, although it does support a LLVM backend.

I would judge that the Chapel language has many fewer available libraries than Nim does, probably reflecting the smaller number of contributors (AFAIKT) and a smaller and more specialized user base.

Nature of the language...

Chapel syntax is similar to Python as is Nim's, but Chapel brought back compulsory semicolon statement terminators and curly bracket block delimiters in the manner of C/C++, it seems to aim to appease users of those languages who refuse to learn how to effectively use white space/off side rule languages.

Chapel does have Generics for all of functions, classes, and records, although the syntax to use them isn't quite as clean as that of Nim.

Chapel is very much an Object Oriented Programming (OOP) language and depends on that paradigm whereas Nim can kind of support that paradigm but it probably isn't the preferred way to use it.

Along that line, Chapel has First Class Functions (FCF's) and lambdas (anonymous functions) but neither of these are closures (capable of capturing bindings external to their function blocks and/or input parameters). This puts some severe limitations on things one would commonly do using closures as one is forced to emulate closures using classes, which isn't always possible as in the following discussion.

Currently, Chapel doesn't have reference fields for classes (which use heap storage) and records (which are instantiated as value storage, likely on the stack) and thus any captured values in a class or record are copy-on-capture and can't follow any external mutations to their state.

Thus, the only way to have a function/method follow external state is by calling it with a ref formal parameter, making it somewhat restrictive if one needs to implement a function using recursive variables such as for the simpler recursive implementation of a Y-combinator.

As for system-level features, Chapel doesn't really directly support a very low level of code although pointers and some low level functions are available in order to make the (mostly C/C++) Foreign Function Interface (FFI) work. These can be used to do such things as call into the operating system for cpymem/movmem and using a different bitedness for views into Array's, etc. by casting c_ptr's. That said, Chapel is actually using efficient implementations using pointer references internally, so they rarely are needed for actual user code other that for when squeezing every last CPU cycle out of particular types of code that must run fast.

Use for parallelism/multi-threading...

Chapel has a wealth of what they consider low level task oriented parallelism using sync/single variables (more-or-less equivalent to Nim's FlowVar's; Haskell's MVar's, etc.) and also many ways of doing high level "data parallelism" (which might be considered to be somewhat equivalent to Nim's Threadpool spawn).

The real reason for Chapel's existence is its support for multi "locale" parallelism over many computing cores over many locations linked by a specified communication protocol, which is well beyond Nim's goals. Accordingly, Chapel has ways of specifying Array domain maps to where the actual Array's or parts/slices of Array's are actually located across possibly many "locales". Arrays are the main mechanism supported for implementing "data parallelism". This use of multi-"locales" is beyond my direct interest, although I can see that it could be of interest if one had access to such a machine or group of machines, which most of us probably never will.

One seeming disadvantage of the powerful possibility of distributed computing as described above is that task/thread context switching/process starting are slower than for the direct use of pthread (which are part of the means of multi-threading used under-the-covers) and it takes something in the order of 10 to 12 milliseconds per "tasK" at minimum. As what I was trying to accomplish only took about one millisecond for the task work, the simple way of implementing multi-threading meant that it was taking ten times as long to execute as multi-threaded than not. The solution for me was to implement a simple thread pool using the built-in sync variables (or I could have used channels) which worked very well as the implementation of sync variables is very efficient, taking an average of about 2.5 microseconds to process in and out sync'ed queues.

Stability of the language...

Chapel is no more and almost certainly less stable than Nim even though it is up to version 1.22; it recently introduced many back breaking changes such as zero-based built-in indexes (tuple/strings) instead of one-based, different initialization means for some types of fields, etc.

For Hash tables, Chapel uses what they call Associative Array's, but those are currently horrendously slow, likely because it seems to be using full crypto hashing algorithms which are overkill, and also because the ability to do the processing is built to be shared across "locales" and every when the "parallel safe" mode is disabled, there still seems to remain some overhead. In order to really test this ability, I had to do a quick write of a Python type hashing algorithm, which worked well enough and wasn't hard to do.

Features that are quite well developed as compared to Nim...

Chapel has a few things that I like compared to the current state of Nim: it has no Garbage Collection but does have four modifiers for ref's as in owned, borrowed (something as our B/D implementation was/is planned to be), shared (which is currently reference counted as our --gc:arc is), and unmanaged (which means manual deallocation); however, these can be mixed and matched and converted between them in the same code. For multi-threading, having shared ref's available at any time makes things easy and yet reasonably safe, but allowing one to use owned/borrowed when sharing isn't necessary is even safer and as each can be used in the some code, one can use whichever is most appropriate at any time.

Chapel's sync variables as first class built-in's are a great feature.

So will I continue using Chapel?

I probably won't be using Chapel any further as I hate going back to semicolons and curly brackets and don't really like OOP programming, which although Chapel isn't an "everything is an object/class" language, has enough restrictions on the use of functions that often OOP is the simplest if not the only way to go.

What I think may be learned from Chapel...

I like the idea of sync variables (or FlowVar's) being at the top of the available features and the idea of building the channel library on top of such a feature would seem to be a good one as it hides all the complexities of mutexes and conditions behind it.

If any of you have not already investigated Chapel or its ideas, you may find it interesting...

kcvinu (orginal) [2020-06-12T16:09:43+02:00] view original

// the generated C code is fairly easy to read and understand as Nim's generated code is.// Really ? The C code generated by Nim looks Chinese to me.

mratsim (orginal) [2020-06-12T16:35:20+02:00] view original

Very nice analysis.

A couple of points:

I like the idea of sync variables (or FlowVar's) being at the top of the available features and the idea of building the channel library on top of such a feature would seem to be a good one as it hides all the complexities of mutexes and conditions behind it.

Nim threadpool actually does this. And it should be the other way around, channels are the building blocks and you build Flowvars on top. And you also need a Channel type that is aware of async frameworks to bridge the async world and the sync/multithreaded world. In Weave, Flowvars are just a ptr Channel (https://github.com/mratsim/weave/blob/9f0c384f/weave/datatypes/flowvars.nim#L30-L50)

Along that line, Chapel has First Class Functions (FCF's) and lambdas (anonymous functions) but neither of these are closures (capable of capturing bindings external to their function blocks and/or input parameters). This puts some severe limitations on things one would commonly do using closures as one is forced to emulate closures using classes, which isn't always possible as in the following discussion.

Did you look into how they implement data parallelism? If they use something similar to OpenMP they have to pack the inner body of the parallel for into a closure that can be distributed to other threads.

The real reason for Chapel's existence is its support for multi "locale" parallelism over many computing cores over many locations linked by a specified communication protocol, which is well beyond Nim's goals. Accordingly, Chapel has ways of specifying Array domain maps to where the actual Array's or parts/slices of Array's are actually located across possibly many "locales". Arrays are the main mechanism supported for implementing "data parallelism". This use of multi-"locales" is beyond my direct interest, although I can see that it could be of interest if one had access to such a machine or group of machines, which most of us probably never will.

I expect this is similar to CoArray Fortran: https://en.wikipedia.org/wiki/Coarray_Fortran, cmacmackin wanted to implement something similar for Nim, https://github.com/mratsim/Arraymancer/issues/419

I'm also interested in expanding Weave for clusters but I don't have access to such machines so I can't work on that.

One seeming disadvantage of the powerful possibility of distributed computing as described above is that task/thread context switching/process starting are slower than for the direct use of pthread (which are part of the means of multi-threading used under-the-covers) and it takes something in the order of 10 to 12 milliseconds per "tasK" at minimum. As what I was trying to accomplish only took about one millisecond for the task work, the simple way of implementing multi-threading meant that it was taking ten times as long to execute as multi-threaded than not. The solution for me was to implement a simple thread pool using the built-in sync variables (or I could have used channels) which worked very well as the implementation of sync variables is very efficient, taking an average of about 2.5 microseconds to process in and out sync'ed queues.

There is no reason to have such a high overhead especially in high performance computing. OpenMP is way way way lower.

As n example it takes 0.33 ns to do 4 additions on a vectorized 3GHz CPU, i.e. you can do 12 additions per ns, so 12M addition per milliseconds, i.e. with your figures, you would need to work on matrices of size 3000x4000 for parallelization to start being useful, that seems way too large.

Intel TBB also target tasks in the 10000 cycles range which would translate to 3.33 µs.

Araq (orginal) [2020-06-12T23:40:29+02:00] view original

So far I found "ParaSail" to be most inspiring. It keeps Ada's (and Nim's) parameter passing modes, throws away pointers completely, adds some mild aliasing checks and ends up with a system where g and h can and are run in parallel within f(g(args), h(otherArgs)).

GordonBGood (orginal) [2020-06-13T03:04:30+02:00] view original

@kcvinu, as to readability of generated C, there is always going to be some name mangling (changing of names to be sure they are unique) and changing of forms of code into simple standard forms. In this regard, Chapel generated C code is maybe easier to read than Mom's, as Chapel is closer to C in syntax.

Or perhaps you find C code difficult to read from the outset, with which I agree if one doesn't have experience with it...

GordonBGood (orginal) [2020-06-13T04:19:38+02:00] view original

I've edit/added a short lament on the lack of any template/macro capability in Chapel in the OP above.

GordonBGood (orginal) [2020-06-13T05:30:22+02:00] view original

@mratsim:

Very nice analysis.

Thank you.

Nim threadpool actually does this. And it should be the other way around, channels are the building blocks and you build Flowvars on top. And you also need a Channel type that is aware of async frameworks to bridge the async world and the sync/multithreaded world. In Weave, Flowvars are just a ptr Channel.

Difference of opinion, Golang seems to do as is your opinion, Haskel builds everything on MVar's, and F# has both async and task parallelism with the Task type it gets from DotNet including the FlowVar concept, but with those Task's able to work together with Async. F# doesn't have built-in channels, so if required on would have to build them from primitives.

My opinion could be changed if the version of FlowVar's I propose truly can't work with a version of async, but I'd like to see the implementation as by F# researched first.

Did you look into how they implement data parallelism? If they use something similar to OpenMP they have to pack the inner body of the parallel for into a closure that can be distributed to other threads.

I spent about a week fighting task parallelism as looking into why it was so slow but didn't have an immediate need for data parallelism so didn't look much into it. However, I see that much of the compilation time is spent doing various data flow analysis such that the final ways of doing parallelism is dependent on the particular case. RE: data parallelism, the documentation states that it isn't guaranteed that each inner element will be performed by a separate thread so it seems that some grouping may be taking place and use of some internal thread pools may also be taking place, else the feature would be almost useless. Documentation for data parallelism shows parallel operations being carried out on individual Array element basis, so it isn't taking just the naive approach.

All parallelism seems to be implemented by using the pthread library directly and not through the use of OpenMP; however, they do manually internally generate some sort of closure or closure emulation (remember neither Chapel nor low level C has real closures) where the required parameters/captured values are passed in as parameters to a function which is the body of the background task. It seems to be reasonably intelligent in how it does that for data parallelism so that the overheads are greatly reduced.

My interest was more in low level task parallelism as I had an application in which I wanted to control execution order and the reasonably simple way I found to do it was in assigning tasks to a thread pool and controlling their order with sync variable queues.

There is no reason to have such a high overhead especially in high performance computing. OpenMP is way way way lower.

I agree, and I'm still not 100% sure where all the time is going, just that the loss of time went away when I implemented thread pools. Creating individual pthreads directly is in the order of about ten times faster. That said, in their fancy data flow analysis, Chapel sometimes surprises in that when I did a little toy low level test with extremely simple tasks, it was able to give the expected boost in multi-threaded performance (boosted by the multiplier of the number of real cores) for doing almost no work per task and I think that it somehow saw the work was negligible/trivial and automatically made thread pool groups out of the tasks.

My actual application was much more complex so that it couldn't see that an automatically generated thread pool was what was necessary; thus my fix of manually generating thread pools.

As an example it takes 0.33 ns to do 4 additions on a vectorized 3GHz CPU, i.e. you can do 12 additions per ns, so 12M addition per milliseconds, i.e. with your figures, you would need to work on matrices of size 3000x4000 for parallelization to start being useful, that seems way too large.

As per my discussion above and the documentation examples, Chapel is very much specialized for doing Array/Matrix parallel manipulations so it almost certainly does something in data parallelism so one would never experience these delays even for much smaller matrices than you describe.

Intel TBB also target tasks in the 10000 cycles range which would translate to 3.33 µs.

Chapel uses pthread mutexes and monitors to implement sync variables (as does Nim) and so much of the delay for task parallelism using a thread pool when a work queue/channel and result queue/channel are being used will be whatever execution delays those impose, which is likely to be the same as for Intel TBB or any language using these mechanisms. It is more than adequate when the actual task "Chunk" takes about one milliseconds (or much less down to say 100 microseconds), and I could increase the Chunk to up to about 50 milliseconds if I wanted to so the overhead time would be miniscule.

My main lament in this regard as to using Chapel is that it has neither a "channels" library (it seems it used to have one as I see references to it in on-line searches) nor a "threadpool" library, which would likely have made my coding work a little easier.

mratsim (orginal) [2020-06-15T13:55:59+02:00] view original

Difference of opinion, Golang seems to do as is your opinion, Haskel builds everything on MVar's, and F# has both async and task parallelism with the Task type it gets from DotNet including the FlowVar concept, but with those Task's able to work together with Async. F# doesn't have built-in channels, so if required on would have to build them from primitives.

My opinion could be changed if the version of FlowVar's I propose truly can't work with a version of async, but I'd like to see the implementation as by F# researched first.

The main reason why I think channels should be a primitive building block is that it can be implemented in hardware on message-passing-based hardware or cluster-on-chips:

Intel SCC had hardware channels https://en.wikipedia.org/wiki/Single-chip_Cloud_Computer

I assume Tile CPUs from Tilera/EzChip/Mellanox/Nvidia (successive acquisitions by a bigger fish) also have hardware channels http://web.mit.edu/6.173/www/currentsemester/handouts/L18-tilera-multicore.pdf

Similarly for Adapteva 1024 cores RISC network-on-chip CPU: https://www.parallella.org/docs/e5_1024core_soc.pdf

You can reuse Infiniband or MPI channels as primitives for distributed computing.

Adding async support on top of a channels only requires adding a way for the channel to return control the event loop. If channels use tryRecv and trySend it's easy to add a poll() or equivalent callback to the event loop.

In all languages (Go, Nim, ...) channels are built either on top of Lock+Condition variables or Atomics (reusing queue designs) but I expect we will reach the limits of memory coherency at the CPU level and that in the future NUMA will become increasingly important and we might see hardware channels being more popular (rather than restricted to stuff like Infiniband)

Araq (orginal) [2020-06-15T15:24:46+02:00] view original

In all languages (Go, Nim, ...) channels are built either on top of Lock+Condition variables or Atomics (reusing queue designs) but I expect we will reach the limits of memory coherency at the CPU level and that in the future NUMA will become increasingly important and we might see hardware channels being more popular (rather than restricted to stuff like Infiniband)

Well this has been claimed for over ten years now, it's one reason why Nim did bet on thread local heaps + message passing and I don't see it becoming the reality anytime soon. I can be wrong of course. However, it is insightful to look at what GPUs do. The classical view is that they are based on shared memory. But when you look at it from a different angle what happens is that inside a GPU a program fragment is passed over to the data. Maybe I'm nuts and is has nothing to do with GPUs, but the idea is a good one. Don't transport data, transport programs.

bpr (orginal) [2020-06-15T18:18:21+02:00] view original

Don't transport data, transport programs.

This is the motto of Hadoop, slightly rephrased. What does it mean practically for programmers?

I think @mratsim has a point (from experience at a bio startup) that we'll see increased need for NUMA aware APIs. I could be wrong as well, but it seems that it's one natural avenue for getting better performance post Moore's law.

Great stuff, thanks for the writeup on Chapel @GordonBGood! It's a fascinating language.

GordonBGood (orginal) [2020-06-17T08:06:45+02:00] view original

@Brad, Thanks for joining the forum and your hints as to Chapel threading performance:

Are you able to share the benchmark you were using with us (as well as information like back-end compiler and version, processor, etc)? In our experiments, Chapel has generally shown to be competitive with OpenMP, so it would be interesting for us to understand better what you were doing (prior to resorting to a homegrown thread pool) in order to make sure nothing's going horribly awry. I'd also be curious whether you were using CHPL_TASKS=qthreads or CHPL_TASKS=fifo. Thanks.
( = For example, these benchmarks :... demonstrate Chapel creating 12,000,000 tasks (500,000 x 24 cores) in 4.4 seconds, compared to 4.1 seconds for GNU OpenMP). (about 8.8/~0.2 microseconds per thread, machine unspecified - GBG)

Thanks for your prompting as to checking the CHPL_TASKS environment variable; due to this prompting I re-read the documentation section on Using Tasks to find that due to my using cygwin on Windows this defaults to "fifo" which is pthreads and and was somewhat confused with some of my benchmarks run on "tio.run" (an online IDE using Linux and therefore "qthreads") and some on my own machine using cygwin and therefore defaulting to "fifo".

Your results as posted above are in line with what I was seeing on "tio.run"/Linux in just a few microseconds overhead per task switch using the benchmark program as follows:


// testing simple threading...

use Time;

config const NUMLOOPS = 1000;
assert(NUMLOOPS >= 0, "NUMLOOPS must be zero or above!");
config const NUMTHRDS = here.maxTaskPar;
assert(NUMTHRDS > 0, "NUMTHRDS must be at least one!");
writeln("Number of loops:  ", NUMLOOPS, "; Number of threads:  ", NUMTHRDS);

var rsltsq:  [ 0 .. NUMTHRDS - 1 ] sync int;

proc dowrk(qi: int, v: int) {
  rsltsq[qi] = v; // thread work that doesn't do much of anything!
}

var timer: Timer; timer.start();
// pre load the thread pool work
for i in 0 .. NUMTHRDS - 1 {
  rsltsq[i].reset(); begin with (const in i) dowrk(i, i); }
var qi = 0; var sum = 0;
for i in NUMTHRDS .. NUMLOOPS + NUMTHRDS {
  sum += rsltsq[qi];
  if i <= NUMLOOPS then begin with (const in i) dowrk(qi, i);
  qi = if qi >= NUMTHRDS - 1 then 0 else qi + 1;
}
timer.stop();
write("The sum is ", sum, " in ");
writeln(timer.elapsed(TimeUnits.milliseconds), " milliseconds!");

When run on tio.run (qthreads), there are no real surprises as it runs a million loops on two threads (the maximum available) on a Sandy Bridge class processor at 2.9 GHZ in about 675 milliseconds and with only one thread in a little over a 1000 milliseconds; this isn't surprising as with such trivial work (almost none), most of the time expended will by in just running the sync'ing.

However when run on my cygwin system (Windows Intel Sandy Bridge class i3-2100 CPU with 2 cores/4 threads at 3.1 GHz, compiled with Chapel version 1.22 and gcc version 9.3.0) using default 4 threads, it runs much much slower than the above at about 3000 to 5000 milliseconds for only 10000 loops with full four threads, which is slow but perhaps can be attributed to pthreads being slower than qthreads (although the variable results are worrisome).

Furthermore, the shocker is when the number of threads is dropped to say only two, it can only run 1000 loops in about 2500 milliseconds (highly variable upwards from this), and with 1000 loops run on only one thread (argument of --NUMTHRDS=1), it takes 16,000 milliseconds (and sometimes more). Obviously, taking over ten milliseconds per task start prevented me from being able to effectively run tasks which take about one millisecond each as the overhead per task is over ten times higher than the actual work, and explains why I had to implement the thread pool.

This seems to indicate the you do indeed have a problem with your pthreads implementation as I suspected before (even before I understood the difference between "qthreads" and "fifo" and which are default where).

Why is "fifo" default with cygwin? Are there problems compiling with "qthreads" on cygwin?

While we are on the subject of Chapel problems I encountered (and acknowledging your quick fix of my reported issue related to initializing a shared nilable sync field for a generic record/class, thank you very much), the other distressing problem I encountered was in the slowness of Hash Table's/Associative Array's as seen in this submission to the Sieve of Eratosthenes on RosettaCode (the first version, fixed with a roll-your-own Hash table implementation in the second version). Perhaps you would like to investigate to see why that was so slow and not O(1) amortized access as expected (and as my roll-your-own Hash Table version has). The same test machine and environment was used but the same slowness and non O(1) performance seems to be common to the "tio.run"/Linux environment. Perhaps it is an error on my part?

mratsim (orginal) [2020-06-17T12:11:29+02:00] view original

(* = For example, these benchmarks:

https://chapel-lang.org/perf/chapcs/?startdate=2019/12/14&enddate=2020/06/15&graphs=emptytaskspawntimings500000xmaxtaskpar

demonstrate Chapel creating 12,000,000 tasks (500,000 x 24 cores) in 4.4 seconds, compared to 4.1 seconds for GNU OpenMP).

GNU OpenMP shouldn't be used as a comparison for runtime overhead.

For example on fibonacci(40) which is for all intent and purposes a recursive spawn of 2^40 empty tasks, I have the following results on my i9-9980XE 18 cores:

GNU OpenMP chokes for at least a hour

Clang/Intel OpenMP finishes in 2 seconds (they use the same kaapi backend)

Intel TBB finishes in 1 second or 0.6 second depending on if you use the class or closure API

LLVM-Cilk uses 0.4~0.5 seconds

Julia 1.3 uses 4 seconds on fib(31) and too many on fib(40) (https://github.com/JuliaLang/julia/issues/33762)

HPX uses 8 seconds on fib(22) and segfaults on more (https://github.com/STEllAR-GROUP/hpx/issues/4227)

For less established runtimes:

Weave, in pure Nim, uses 0.35 seconds with heap futures or 0.2s with alloca futures

Fibril is at 0.12s (from Yang and Mellor-Crummey, A Practical Solution to the Cactus Stack Problem, https://dl.acm.org/doi/pdf/10.1145/2935764.2935787, https://github.com/chaoran/fibril)

Staccato is at 0.18s (from https://github.com/rkuchumov/staccato)

12000000 is in the order of 2^23, using TBB as a base that would mean that Chapel and GNU OpenMP have 10 times the overhead while creating 2^17x (131072x) less tasks.

This overhead was a critic already in 2012 of both HPX and GCC OpenMP by Sandia National Laboratory: https://prod-ng.sandia.gov/techlib-noauth/access-control.cgi/2012/1210594.pdf (who write the Kokkos runtime)

My own overhead benchmarks with various runtime are here: https://github.com/mratsim/weave/tree/9f0c384f/benchmarks/fibonacci

The depth first search benchmark is a similar overhead-bound benchmark: https://github.com/mratsim/weave/tree/9f0c384f/benchmarks/dfs

I've also categorized benchmarks depending on what kind of difficulties they impose on the multithreading runtime: https://github.com/mratsim/weave/tree/master/benchmarks#weave-parallel-benchmark-suite

Overhead & Extreme overhead (fibonacci, i.e. creating thousands of trillions of tasks in milliseconds)

Load balancing & Extreme load balancing (bouncing producer-consumer with cores alternating in producing and consuming tasks leading to stressing the task scheduler and also the memory allocator as free-ing memory allocated from another thread is costly and usually each thread have a thread-local memory arena, producers end up with depleted arenas and the memory allocator becomes mmap with overhead)

Nested parallelism

Conditional parallelism

Brad (orginal) [2020-06-17T20:27:50+02:00] view original

Hi Gordon / @GordonBGood —

Thanks for clarifying that your timings were taken using a Cygwin-based installation and CHPL_TASKS=fifo. Though I'd seen the note about Cygwin being an option in the original article, it hadn't occurred to me that it was the platform for your performance studies. Knowing that now, the main message I'd like to convey to anyone introduced to Chapel through this article is that running Chapel using fifo + cygwin is not representative of the performance Chapel is capable of delivering (where qthreads + native Unix-like platforms has been my team's primary performance focus).

In more detail:

CHPL_TASKS=`fifo` will almost certainly result in performance overheads relative to CHPL_TASKS=`qthreads`. The fifo option is intended primarily for portability and simplicity, essentially taking the approach of using a POSIX thread per Chapel task. It also has not received much attention from us in terms of performance tuning and optimization. In contrast, qthreads multiplexes tasks across user-level threads and has received a lot more attention in terms of performance analysis and tuning (both by the Qthreads team and by our team in terms of how Chapel maps down to it). By nature, it is also more complex / less portable. The following issue details some of the challenges we've encountered when trying to use Qthreads with Cygwin: https://github.com/Qthreads/qthreads/issues/49

That said, in our experience, Cygwin itself also incurs a lot of overhead relative to a native installation of Chapel, so it's hard to say how much of the poor performance you observed was due to fifo as opposed to Cygwin overheads. Like fifo, we've similarly thought of Cygwin as being a portability option rather than a performance-oriented one. And, since Windows 10 introduced Linux Bash Shell / Windows Subsystem for Linux, that's been the preferred mode for using Chapel on Windows (though admittedly, it also hasn't received performance-related attention from our team; given our focus on scalable and scientific computing, Unix-like systems have received the vast majority of our attention).

This exchange has made me realize that our Cygwin documentation doesn't really include performance disclaimers, nor a mention of WSL being the current preferred alternative for Windows users, so I've opened an issue on our repo to capture that task: https://github.com/chapel-lang/chapel/issues/15864

Every few years, someone expresses interest in a native Windows port of Chapel (i.e., one that doesn't even require using WSL), but to date, there hasn't been sufficient interest to justify our putting in the effort, arguably reflecting Cray's supercomputing-focused markets and user bases. As an open-source project, we'd be happy to help provide support for someone external interested in pursuing such a port.

Again, thanks again for clarifying your experimental setup for us.

With respect to the performance of our hash tables / associative domains and arrays, this is admittedly another area that has not received much performance focus in our work to date compared to dense rectangular arrays. But unlike fifo and cygwin, it is an area where we intend to invest more effort. For example, members of the team have recently been working on refactoring the existing code with an eye toward improving it (e.g., https://github.com/chapel-lang/chapel/pull/15739), and open-source developers have also been developing improved hash table implementations (e.g., https://github.com/chapel-lang/chapel/pull/14409).

Best wishes,

-Brad

GordonBGood (orginal) [2020-06-17T23:06:32+02:00] view original

@Brad, just a few pertinent points:

That said, in our experience, Cygwin itself also incurs a lot of overhead relative to a native installation of Chapel, so it's hard to say how much of the poor performance you observed was due to fifo as opposed to Cygwin overheads.

Cygwin may introduce some additional overheads, but Windows is what I have and I have barely enough memory to be able to run WSL on top so the Linux option is somewhat difficult (although likely possible). That said, other than even slower than usual compile times, the actual Chapel-produced executable files seem to run about the same speed on my cygwin setup as they do on "tio.run" other than this "fifo" issue.

Also, other than this issue (which I've worked around with my implementation of a thread pool), Chapel produced code, whether run on my cygwin system or on "tio.run"/Linux is producing as fast or faster executable code as compared to any compiler out there, maybe even including Nim. I am just putting the finishing touches on a Page-Segmented Sieve of Eratosthenes implementation in Chapel that should be at least as fast (and it's looking like it may be faster) than Kim Walisch's "primesieve" written in C++, although the reason it's faster isn't really the language (other than bad logic habits C/C++ programmers often fall into) but my algorithm, which simplifies the code.

Eventually I'll make a GitHub repo for this Chapel project. Do you have an "awesome Chapel" page somewhere to post a link to this eventual repo?

I'll also be updating my implementation of the same in Nim so that a comparison of the code and performance can be readily made.

Let's just chalk Chapel/cygwin's limitations as being up to the "fifo" implementation and leave it at that.

With respect to the performance of our hash tables / associative domains and arrays, this is admittedly another area that has not received much performance focus in our work to date compared to dense rectangular arrays. But unlike fifo and cygwin, it is an area where we intend to invest more effort. For example, members of the team have recently been working on refactoring the existing code with an eye toward improving it...

Looks like I should file an issue on the performance I'm finding for this one then, if your team is actively working on it.

I won't bother filing an issue on the fifo performance since it looks like you are unlikely to do anything about it anytime soon.

Best wishes

And mine to you, Gordon

Brad (orginal) [2020-06-18T01:40:43+02:00] view original

That said, other than even slower than usual compile times, the actual Chapel-produced executable files seem to run about the same speed on my cygwin setup as they do on "tio.run" other than this "fifo" issue.

That's a good point. I think most of the overhead associated with Cygwin comes about when doing system calls, so I probably shouldn't suggest it's inherently bad as I did. By nature, the fifo code is fairly system call heavy, so it makes sense that it would be a significant pain point for task-parallel Chapel code on Cygwin. I'm personally never sure what to expect from TIO performance due to it being a potentially oversubscribed (and as I understand it, somewhat virtuallized?) system, so I don't know how good a point of comparison it is to a native UNIX/Linux run, say, though I'd guess at its best it's not terrible (and it is running qthreads, so that's a plus).

Do you have an "awesome Chapel" page somewhere to post a link to this eventual repo?

We don't have an official one run by the project, though you're reminding me that a community member created one some time ago, and it looks like it's been updated as recently as January (which is more recent than I would've guessed): https://github.com/marcoscleison/awesome-chapel We tend to try and announce awesome stuff on our social media, so that'd be another possibility.

Thanks,

-Brad

Brad (orginal) [2020-06-18T03:25:34+02:00] view original

Hi @mratsim —

GNU OpenMP shouldn't be used as a comparison for runtime overhead.

I think GNU OpenMP can be a reasonable basis of comparison for some computational styles, if not others. That's not to say that it should be treated as the gold standard, but it does have the virtues of being a readily available and commonly used technology (which makes it a good candidate for our portable performance test suite).

I should clarify that the tasking performance numbers I pointed to above aren't of the deeply-recursive, fibonacci-like style that's amenable to cactus-stack-like implementations, but rather a simple, repeated creation and destruction of ~#cores tasks, like:

config const numTrials = 500000,
       numCores = here.numPUs();

for i in 1..numTrials do
  coforall t in 1..numCores do  // coforall creates a task per loop iteration
    /*no-op*/;

This is an important pattern for us because it's the most common style of task creation used in Chapel, particularly by our data-parallel features such as forall loops and whole-array operations.

Chapel's tasks differ fairly significantly from many traditional task-based programming models (e.g., Cilk) in that our tasks can synchronize with one other dynamically and access/share arbitrary data. So where many task-based runtimes can serialize tasks completely without breaking program behavior (e.g., the Cilk serial elision property) or know precisely what values a task will read and write at task creation/scheduling time, the same isn't generally true for Chapel programs (though a compiler could potentially identify some of these characteristics in specific Chapel tasks; that hasn't been a focus of our efforts, however).

For example, a favorite toy Chapel program creates a fixed number of tasks that compute a producer-consumer pattern on a bounded buffer of synchronization variables with full/empty semantics. In my experience, this is a challenging pattern to express in many task-based systems due to their restricted task semantics. In this sense, our tasks might be considered to be more like "threads", though we reserve that term for the system resources used to execute the tasks.

For this reason, Chapel's implementation of tasks doesn't really follow the cactus-tree style of approach. When the user says to create a task, as an imperative language that can't generally know what the task will do, we literally create that task. So, when writing your Fibonacci pattern in the simple/obvious way that requests an exponential number of tasks, we create the exponential number of tasks you requested, and performance tanks as you'd expect (e.g., I stopped timing at fib(32) which took ~80 seconds on my Macbook).

At times, we've discussed supporting a distinct tasking option in the runtime from fifo or qthreads that would support more of a traditional cooperative-task approach to scheduling, squashing, and stealing tasks. But frankly, (a) we haven't run into many (any?) practical code patterns from users that would benefit greatly from such techniques and (b) we haven't wanted to struggle with either trying to adapt those approaches to Chapel semantics or with having the compiler distinguish between tasks that can be cooperatively scheduled/serialized vs. those that can't. In saying this, my intention isn't to throw that line of work under the bus, just to state that it hasn't been a priority for the codes and users we've worked with, so it remains on our list of rainy day projects. If a grad student was interested in researching the application of cooperative task scheduling approaches to Chapel while preserving our semantics, I'd be all for it.

Anyway, circling back around, in sharing those Chapel vs. GNU OpenMP results, I didn't mean to give it any sort of special "4-minute mile"/"this is the thing to beat" status, I just wanted to defend our performance when it comes to creating and destroying tasks in contrast to Gordon's observations.

I'd also be curious whether you think there's any reason to doubt GNU OpenMP would do a reasonable job with the task creation pattern that we're actually measuring.

Thanks,

-Brad

mratsim (orginal) [2020-06-18T19:20:29+02:00] view original

For example, a favorite toy Chapel program creates a fixed number of tasks that compute a producer-consumer pattern on a bounded buffer of synchronization variables with full/empty semantics. In my experience, this is a challenging pattern to express in many task-based systems due to their restricted task semantics. In this sense, our tasks might be considered to be more like "threads", though we reserve that term for the system resources used to execute the tasks.

Is there reference code and/or benchmark code I can use? I would be interested to add that to Weave benchmark suite and see how the runtime cope with this scenario. This would also help me to evaluate GCC OpenMP weaknesses if any for that type of workload.

But frankly, (a) we haven't run into many (any?) practical code patterns from users that would benefit greatly from such techniques and (b) we haven't wanted to struggle with either trying to adapt those approaches to Chapel semantics or with having the compiler distinguish between tasks that can be cooperatively scheduled/serialized vs. those that can't. In saying this, my intention isn't to throw that line of work under the bus, just to state that it hasn't been a priority for the codes and users we've worked with, so it remains on our list of rainy day projects. If a grad student was interested in researching the application of cooperative task scheduling approaches to Chapel while preserving our semantics, I'd be all for it.

One things that come to mind are tree algorithms.

Monte-Carlo Tree Search, is the go-to algorithm for reinforcement learning and has been used to great effect since 2006 to build go engines. Robot competition winners with tasks like put many balls in a basket have also been using MCTS for a decade now.

While the media focused on the advanced of deep learning from Google Deepmind, the truth is that the deep learning engine is feeding the Monte Carlo Tree Search engine (though it can run without and still be strong, it will misplay a lot). Monte Carlo Tree Search is described here: https://en.wikipedia.org/wiki/Monte_Carlo_tree_search and is used to solve problems with imperfect information or a too large search space in an optimal way, including Poker. Those problems being called "Multi-Armed bandits" (as in the bandit slot machines).

This website https://banditalgs.com/ and the accompanying book https://tor-lattimore.com/downloads/book/book.pdf are an excellent resource

Parallelizing MCTS was a very active area of research at one point with 3 different techniques being explored:

tree parallelism where threads duplicated the tree locally and there was some merging going on

leaf parallelism where threads where assigned a leaf of the (game/decision) tree to explore

branch parallelism

I'm of the mind that a good multithreading runtime would trivialize those concerns.

Large clustering algorithms (kNN, DBSCAN, ...) are recursive as well with some clusters being quite compute intensive and others being empty which means that load balancing is critical.

Raytracing is very compute intensive and as mentioned in my thread here: https://forum.nim-lang.org/t/6367#39266 it's also recursive with a random recursion depth (terminated by an algorithm aptly called the Russian Roulette).

From this picture:

Clang/Intel OpenMP and Intel TBB have the load balancing at B, as they do "eager task splitting", while Weave does load balancing C with "lazy task splitting" the tasks being generated by 2 parallel for loop over the image bounds.

How to parallelize raytracing is a quite active area of research, especially on GPU that have poor to no support to recursive function calls with indeterminate recursion depth.

Beam Search: for voice recognition and image captioning, neural networks often have to explore and backtrack between multiple candidates, backtracking imposes quite a lot of constraint on a load balancer.

Brad (orginal) [2020-06-19T03:57:09+02:00] view original

Hi @mratsim —

I may have been unclear in my last response: I didn't mean to imply that I didn't believe there are realistic workloads that would result in recursive tasks or benefit from task serialization/squashing/stealing techniques. Just that the computations and users that have found their way to Chapel typically haven't exercised that style, so it hasn't been an area of focus for our team (combined with our choice to make Chapel tasks more arbitrary/unpredictable than most task-based programming systems).

Also, I should note that it's still possible to write recursive or dynamically balanced tasking programs in Chapel, simply that we typically express those patterns within the language or library rather than the implementation. For example, our libraries have iterators that can dynamically distribute iterations either locally across a compute node's cores or globally across multiple compute nodes and their cores. As another example, if I rewrite my naive Fibonacci computation in a way that manually throttles the creation of tasks rather than naively creating exponentially many of them, I can reduce the fib(40) case to just over a second on my 4-core laptop (I'll include my fib codes at the bottom of this for reference).

My colleage pointed out something from your earlier post that I didn't pick up on (and I imagine other readers may not have either): the Nemesis and Sherwood lines in the graphs you showed above are scheduler options in the Sandia Qthreads library that Chapel targets by default, so are representative of what's possible in Chapel today. Qthreads defaults to using the Sherwood scheduler, but Chapel chooses to use the Nemesis scheduler by default instead (though the user can override this default). We found that while the Sherwood scheduler was highly tuned for UTS (Unbalanced Tree Search) and great for highly unbalanced workloads with many many tasks, it had too much overhead for well-balanced operations, which is what we tend to care most about. In contrast, the Nemesis scheduler avoids those overheads and has much better performance for balanced workloads, but gives up work-stealing.

Ultimately we hope to have the best of both worlds. We had started some work with the Qthreads team to create a distrib scheduler that would do that (https://github.com/Qthreads/qthreads/issues/39 ), but the effort is currently stalled out.

Wrapping up, here's my naive Fibonacci in Chapel:

config const n = 10;

writeln("fib(",n,") = ", fib(n));

proc fib(n): int {
  if n < 2 then
    return n;
  else {
    var i, j: int;
    sync {
      begin with (ref i) i = fib(n-1);
      j = fib(n-2);
    }
    return i+j;
  }
}

Here's my manually throttled version:

config const n = 10;

writeln("fib(",n,") = ", fib(n));

proc fib(n): int {
  if n < 2 then
    return n;
  else {
    if (here.runningTasks() >= here.numPUs()) then
      return fib(n-1) + fib(n-2);
    else {
      var i, j: int;
      sync {
        begin with (ref i) i = fib(n-1);
        j = fib(n-2);
      }
      return i+j;
    }
  }
}

And here's a very simple example of the bounded buffer pattern I was mentioning. We've never really treated this as a benchmark, more of a demonstration of how easy task parallelism can be in Chapel, where the key points are (a) task creation is simple and (b) sync variables avoid the need to worry about the typical overflow/underflow conditions:

config const buffsize = 10,
             n = 101;

var A: [0..#n] sync int;

cobegin	{
  producer();
  consumer();
}

proc producer() {
  for i in 0..#n do          // write n values to the buffer
    A[i%buffsize] = i;
  A[n%buffsize] = -1;   // write a "done" sentinel
}

proc consumer() {
  var pos = 0;
  do {
    const val = A[pos];     // read value
    if val != -1 then
      writeln("got: ", val);
    
    pos += 1;               // advance cursor
    pos %= buffsize;
  } while (val != -1);
}

A more full-blown version of this pattern that we use as an exercise for students in hands-on exercises (where their goal is to add support for multiple producers and consumers on the same buffer) can be found here (with sample solutions in the subdirectory): https://github.com/chapel-lang/chapel/tree/master/test/exercises/boundedBuffer

stefantalpalaru (orginal) [2020-06-20T16:37:10+02:00] view original

Chapel uses this interesting C library for its green threads: https://github.com/Qthreads/qthreads

Note that a couple of the available schedulers support MxN threading, allowing you to use all CPU cores, just like with Go.

I hope we'll be able to use it in Nim, at some point.

Mirror of forum.nim-lang.org

6426 :: Comparisons of Nim to the Chapel computer programming language...