nimforum mirror - Preview of Weave/Picasso v0.1.0, a message-passing based multithreading runtime.

mratsim (orginal) [2019-12-15T19:42:09+01:00] view original

It's been a while since the RFC Picasso multithreading runtime (https://github.com/nim-lang/RFCs/issues/160 / https://forum.nim-lang.org/t/5083).

The project is living at https://github.com/mratsim/weave

It's well tested on Linux with 32-bit and 64bit CI, and also on ARM64 with Travis offering a whopping 32 cores undisclosed ARM CPU. Windows is not supported, I'm only lacking a low-level wrapper for Synchronization Barriers OSX should work but somehow it trips some assertions on Travis so your mileage may vary.

It offers both task parallelism and data parallelism. The task parallelism API is similar to async/await on Futures, except that you call spawn/sync on Flowvar. The data parallelism API is similar to OpenMP.

One important thing, It doesn't support GC-ed types, you need to pass a pointer (example with seq in the README) or use Nim channels.

There are a couple of low-level routines that may be of interest:

stackless/queueless depth-first and breadth iterator for binary trees stored in arrays: https://github.com/mratsim/weave/blob/v0.1.0/weave/datatypes/binary_worker_trees.nim#L91-L161. Those are fast and don't require any heap allocation.

Threadsafe memory pool and look-aside buffer. Those can handle efficiently the spawning of trillions of tasks under a couple milliseconds. They can also release memory back to the OS. They might be of interest to game developers that need to handle billions of particles within a single frame. There are 3 markdown files that details the memory challenge with the README providing an overview of the solution: https://github.com/mratsim/weave/tree/v0.1.0/weave/memory. The actual implementation is short under 500 lines without comments. It is based on state-of-the-art research for Microsoft's Mimalloc (the fastest general purpose allocator) and Snmalloc (a very fast message-passing based allocator).

There are 10 benchmarks available to stress several aspects of the runtime, 8 being as fast or much faster than established runtime like Intel TBB or GCC/Clang/Intel OpenMP (and the 2 slow ones being parallel reductions):

Name	Parallelism	Notable for stressing
Black & Scholes (Finance)	Data Parallelism
DFS (Depth-First Search)	Task Parallelism	Scheduler Overhead
Fibonacci	Task Parallelism	Scheduler Overhead
Heat diffusion (Physics)	Task Parallelism
Matrix Multiplication (Cache-Oblivious)	Task Parallelism
Matrix Transposition	Nested Data Parallelism	Nested loop
Nqueens	Task Parallelism	Conditional parallelism
SPC (Single Task Producer)	Task Parallelism	Load Balancing

kobi (orginal) [2019-12-15T22:13:37+01:00] view original

Does this aim to be part of Nim? Part of the standard library? Does it have an accessible api, that feels like idiomatic nim code?

Thanks!

mratsim (orginal) [2019-12-16T13:04:39+01:00] view original

Does this aim to be part of Nim?

Part of the standard library?

It's probably too big, though some of the underlying code like the memory subsystem could be in the standard library.

Does it have an accessible api?

spawn/sync/Flowvar are directly taken from https://nim-lang.org/docs/threadpool.html. The parallelFor is just a for-loop.

kobi (orginal) [2019-12-18T21:56:16+01:00] view original

Sorry, my question was basically: what does the word runtime mean? Does it replace certain operations in the executable that the Nim compiler produces by default, in that case, it's a low level integration with Nim, the compiler. So, what do the users need to do to use this multithreading runtime. It's likely I'm just misunderstanding everything. But could be that other nim users would be more enthusiastic when they understand what your project means. Thanks

rayman22201 (orginal) [2019-12-18T23:35:16+01:00] view original

It's similar to the Rayon or Tokio libraries from Rust.

Weave provides a set of fundamental building blocks to create safe and high performance multi-threaded programs.

Anyone who wants to write multi-threaded programs in Nim should be very excited about this.

like Rust, Nim itself provides a minimal runtime (as it should be for systems languages), and allows libraries to provide fundamental features without modification of the compiler.

This is actually a core philosophy of Nim. It has a small core but allows for large extensions (mainly through macros and a few other features.)

mratsim (orginal) [2019-12-19T13:38:12+01:00] view original

A runtime means something that operates at runtime with extra overhead compared to anything done at compile-time by the compiler. It also means potentially interoperability issue.

For example, a garbage collector or a reference counting scheme is also a runtime, extra overhead, not done at compile-time. Systems languages need to keep a runtime very lightweight in most cases and inexistent in certain critical cases (operating system, interoperability/embedability in other language).

Now a runtime is also supposed to bring benefits: abstract away the programmer's worries of manual memory management or manual thread management.

I would say the goal is not to replace threadpool, but provide an advanced version. The standard library threadpool gives you a simple way to use threads, however those are not load balance which is a critical issue in many cases.

Now users have 2 choices, either they have a common use-case (process all my tasks as fast as possible) and they can use Weave. Or they have very specific constraints like real-time scheduling/latency/fairness or priority jobs and they need to build their own scheduler on top of the threadpool or raw Nim threads. For example if you process audio or video in real-time, the goal is not to process the whole video as fast as possible but to get the next frame processed before the deadline. Weave would guarantee the former, but maybe it would schedule the first frame as the very last processed (guarantee of throughput but not of latency).

rayman22201 (orginal) [2019-12-19T19:38:14+01:00] view original

@mratsim, Sorry. I should have been more specific.

I didn't mean to compare Weave to those Rust libraries in terms of features. I was comparing them in terms of library size and level of abstraction (for lack of a better phrase.)

What I meant to say is both Tokio and Rayon are "runtime" libraries for Rust in the same way that Weave is a runtime library for Nim.

Araq (orginal) [2019-12-20T10:00:37+01:00] view original

This is excellent work, but what is the protocol that your spawn uses/needs. I'm asking because I'm wondering about =sink vs =move vs =deepCopy as the fundamental building block for thread communication. Rust has both send and sync, see https://doc.rust-lang.org/nomicon/send-and-sync.html

mratsim (orginal) [2019-12-20T10:51:03+01:00] view original

I only support trivial types, checked at compile-time via supportsCopyMem(T).

What I do is here: https://github.com/mratsim/weave/blob/v0.1.0/weave/parallel_tasks.nim#L125-L150 From the function call spawn foo(a, b, c), I check the return type (void or if I need a future).

I package the following in a task:

foo address

The future/flowvar address which is just a channel

a, b, c as raw bytes.

The task is allocated on a shared memory heap via a memory pool (or via malloc). It is load-balanced between threads if needed and when comes execution:

I just dereference the address of the raw (a, b, c) to the packaged function foo.

If there was any result, I send it through the future/flowvar channel.

mratsim (orginal) [2019-12-22T01:14:48+01:00] view original

I can't edit my title but I'm happy to announce the release of Weave 0.2.0 codenamed "Overture".

This is the result of a fight of over 8 hours to add reconcile setting $PATH in Azure Pipelines and nimble/findExe.

Weave now supports Windows in addition to Linux, macOS and all platforms with Pthreads.

Furthermore, Weave Backoff system has been reworked, formally verified to be deadlock free. It is now enabled by default and is without any noticeable performance impact. It allows Weave to park idle threads to save power.

Side-story: In the process a critical bug in glibc and musl implementation of condition variables has been found, signal does not always wakeup a waiting thread. This does not happen with MacOS and Windows condition variables.

mratsim (orginal) [2020-01-01T14:24:48+01:00] view original

And for the new year 0.3.0: https://github.com/mratsim/weave/releases/tag/v0.3.0

Next developments will probably take a while as the "low-hanging" fruits are done (i.e. from my PoC in July/August). If someone wants to add something like graphviz output to Synthesis that would be helpful to display Weave internal/control-flow visually.

Changelog

Weave can now compile with Microsoft Visual Studio Compiler, in C++ mode (there are unsupported atomics with VCC in C mode)

for-loops are now awaitable, note that only the spawning threads will be blocked other will continue on other tasks. Being blocked means that the thread "stay still" in the code but it will still complete tasks pending in its queue while waiting for the blocking one to be resolved (by itself or another worker).

Research flags for stealing early or thief targeting have been added.

Weave now uses Synthesis state machine in several places. I am still unsure on the readability benefits (maybe I'm too familiar with the codebase now) but if visual graphs/description were added to Synthesis that would definitely tip the scale.

The memory pool now has the exact same API has malloc/free (previously freeing required specifying the caller threadID). The scheme to retrieve an unique thread identifier without expensive syscalls is probably worthwhile in Nim: https://github.com/mratsim/weave/blob/v0.3.0/weave/memory/thread_id.nim#L8-L66

The internal memory pool and lookaside list have been annotated for use with LLVM AddressSanitizer, a memory accesses debugger tool. There are warnings, I didn't check yet if they are spurious or not. Some are due to Nim internals.

Significant performance bugs and improvements were identified on data parallelism (for example not splitting parallel loops in some cases). Weave is now competitive with OpenMP for coarse-grain loops (with a large amount of work) and hopefully doesn't suffer from OpenMP issues on loops that are too small and shouldn't be parallelized as Weave does parallelization lazily on an as-needed basis (to be benchmarked, see https://github.com/zy97140/omp-benchmark-for-pytorch)

The highlight is that on my 18-cores machine a pure Nim matrix multiplication without assembly without OpenMP is now much faster than OpenBLAS and competitive with Intel MKL and Intel MKL-DNN which are 90% assembly, processor-tuned and the result of decades of dedicated development. Actually if we just look at parallelization efficiency (time 18 cores / time 1 core):
- Weave with Backoff achieves the same speedup as Intel MKL + Intel OpenMP at 15.0~15.5x speedup
- Weave is much better than Intel MKL + GCC OpenMP at 14x speedup
- Weave without Backoff achieves a speedup of 16.9x

One thing of note: measuring performance on a busy system is approximative at best, you need a lot of runs to get a ballpark figure. Furthermore for multithreading runtime, workers often "yield" or "sleep" when they fail to steal work. But in that case, the OS might give the timeslice to other processes (and not to other thread in the runtime). If a process like nimsuggest hogs a core at 100% it will get a lot of those yield and sleep timeslices even though your other 17 threads would have made useful progress. The result is that while nimsuggest is stuck at 100% (or any other application), Weave gets worse than sequential performance and I don't think I can do anything about it.

Mirror of forum.nim-lang.org

5692 :: Preview of Weave/Picasso v0.1.0, a message-passing based multithreading runtime.