nimforum mirror - Back to the basics (of research) - Concurrent data-structures in 2.0 ?

Bosinski (orginal) [2022-06-25T21:23:48+02:00] view original

Hello evbd,

afaik Martin Odersky once described a app as a flow of typesafe transformations. These days one might like to add efficient multiprogram type transformations in a shared-memory environment.

Recently i've spent some time to look at a couple of research-papers (and some lecture videos) concerned with concurrent a.k.a. "lock-free" data-structures. The earliest papers were published around the year 2000 describing a concept named 'Read-Copy-Update' (RCU) which had been implemented into the Linux-kernel around ~2004 and was released as user-space-lib under the name 'Userspace-RCU' four years after that. RCU is a brainchild of Paul E. McKenney, who is currently concerned with the memory-model of the Linux-kernel. Later Hazard-Pointers (HP) were invented and coined by Maged Michael. In 2017 both concepts have been accepted by the C++-Comittee into a Technical-Specification (TS) to be worked out in the coming years. Early adoption can be seen & tested in Facebooks "Folly"-library. Another interesting 'flavour' of RCU has been published by DPDK and was presented by Nathan Brown in february at FOSDEM-2022 [ https://fosdem.org/2022/schedule/event/comparing_dpdk_rcu_and_user_space_rcu_library/ ]. The above mentioned concepts deal with the problem of safe-memory-reclamation. Authors group these concepts in two families - epoch-based=RCU, Pointer-based=Hazard-Pointers - and hybrids of both (e.g. Hazard-Eras). I leave out (Software)-Transactional-Memory (STM) here - as long as it remains a hackers-wet-dream - instead trying to focus on the CPP-Comittees accepted concepts, that will prbl. get finalized into the C++-Standard in 2026. Looking at Nims std/lib and nimble-packages i wonder why there are so few attempts to introduce concurrent data-structures ? Many have expressed the wish for better concurrency support, since threads:on will become the default in 2.0. The lock-based List/Table-implementations from std/shared are nice-to-have, but prbl. not too important for future projects. Don't get me wrong here - locks will remain important in the future, but please not in concurrent data-structures that see reads&updates from many threads. Having collected a series of papers describing breakthru inventions & implementations of (b)lock-free data-structures e.g. Harris Linked-Lists 2002, Shavit/Shalev Lock-free extensible Hash-Tables (~2006) based on 'split-ordered-lists'. Those have been tested and the (amazing) gains-in-performance have been reported and confirmed in many later publication by different authors. I regard these as building-blocks for any concurrent-DS to come. So following along a quote by G. van-Rossums

"When you approach a new language, first look and measure the quality of the provided data-structures".

I've done exacly this and learned that all of Nims DS are well-crafted and very-performant - esp. the std/tables is a performance-beast :) BUT as i type along on my Intel i5 (2014) with 4-HW-threads, i'm sure my next machine will be either a M-something, RISC-V or a modern Intel/AMD-thingy and both will provide 20-cores or more. And in that respect i'd like to see Nim 2.0 beeing able to provide data-structures that can be safely and efficiently used in a lock-free manner by multiple threads. So the pole-position on my wish-list for 2.0 is a new lib/module, filled with concurrent-DS for Seq/Table/Set/List, that are (as-much as possible) API-compatible with the current std-Seq/Set/Table-implementations. So that current projects might only need to exchange one import-line and start to benefit from better thruput or can exchange std/shared-types. New projects would naturally base their code on the concurrent-DS-types, without worrying about memory-reclamation-strategies. I'm sure this will provoke mentions on HN, along the lines Look Mama 'Nim adopts lock-free DS.' I'd like to publish my tiny collection of 'milestone'-papers and make them available to evbd. interested. Maybe at the Nim-Wiki, or linked from my github or nice&shiny within [ http://Markdown.com ] or elsewhere - i'm open for suggestions. Many times i came across warnings saying :

"In mutlithreading and esp. threadsafe DS, even the most experienced hackers have very-good chances to get it wrong.".

Besides the fact-of-life "Failure is always a option" yes - that might be true for the crafting of multithreaded lock-based DS like trees. Looking at Harris Linked-List or Shavit/Shalevs new (amazingly-brilliant) concepts i'd say these are stand-out because of their simple design. Beeing a rather mediocre hacker, i regard these as doable. The problem of 'safe-memory-reclamation' is still a hot - or at least warm - topic in research. More recent works like Interval-based-Reclamation (IBR/2018) or Variant-based-Reclamation (VBR/2021) are RCU-flavours with sophisticated additions. They seem to be very promising and my impression is, that the implementations are strictly NOT rocket-science. Again, things can be studied and tried publicly using FB 'Folly' or DPDK-RCU. ARAQ once mentioned in one of his musings that he believes either STM or Memory-Pools might be a good fit for Nim in the future - thus further taking away workloads from Nims-GC's. The VBR-paper from 2018 proposes exactly that - they test the use of type-preserving Memory-Pools. Their performance results are quite good. Having that said - i don't expect that any form of language-support is needed here - rather the opposite. On the other hand, maybe some-day-in-the-future it turns out that the Nim-runtime has more oversight on a Apps usage of type-preserving-pools. Maybe one lets the VM decide or suggest upon efficient dimensions for such pools. Or in case the VM detects the use of a 'non-threadsafe-DS' during compilation and decides to abort or warn the user. The basic node-structures of e.g. Harris-Linked-Lists and the proposed HashMap/Sets by Shavit/Shalev are identical, and thus can be consumed and recycled from/to the same pool. Maybe the VM can handle/reason differently on concurrent-DS on the shared-heap ? For new users of Nim such a introduction might be a attractive feat and i find that a Version 2.0 deserves better than (retiring stuff and cleaning-up the std-lib), instead a feature that increases efficiency and will clearly go mainstream soon (within the next 2-4 yrs.) and that surprisingly 'the other crews' ( Odin/Zig/Hare/Go/V(sry, i had to)) do not provide yet - can't say if the debate it. As a side-note: The more pseudo-code i study from these papers, the more i assume one could easily provide real-nim-code that is closer-or-maybe-identical to the provided code-samples. So finally part of this is about style and ease-of-use. A good example is Nims withLock-template. It prevents errors (aquired but release forgotten), marks and indents the critical-section and looks good. In my RCU-wrap, you mark & indent the critical-section exactly so and understanding of what happens behind the scene on entering and leaving is not required. But just-in-case one attempts to call synchronize_rcu within a critical-section, that will not-compile and instead abort with a understandable error-explanation. I believe this reads well and provides for ease-of-use and safety.

I considered writing a RFC on this topic, but since it does and shall not demand changes to the language i decided to ask for some feedback upfront.

So far i've tried :

wrapped Userspace-RCU for Nim ( *nix-only )

implemented a HAMT, that the authors claimed to have 80-percent the thruput of a Hash-Map. Learned than that such a complex-structure remains 20-% performance - maybe its all different in Clojure-land or maybe its just me.. Still believe a HAMT can be made lock-free and threadsafe.

found Folly ( C++, all platforms ), sry, too dumb, cannot wrap that.

tried to make a wrap out of DPDK-RCU ( *nix & win) - does not compile on osx.

studied and started implementing Harris Linked-List and Shavit/Shalev Set/HashMap

collected, read and picked papers following the topic 'safe memory reclamation'

tried my best to identify new ideas&concepts that might be the right choices for Nim 2.0

So i'd love to hear your ideas, critique or warnings maybe regarding possible conflicts with other planned-features alike CSP/Weave.

beatz & greetz, Andreas

mratsim (orginal) [2022-06-29T23:04:16+02:00] view original

Looking at Nims std/lib and nimble-packages i wonder why there are so few attempts to introduce concurrent data-structures ?

Because correct, performant concurrent data-structures are extremely hard to get right and an absolute pain to debug even for the most seemingly simple that exist (say a lockless Multi-Producer Single-Consumer queue).

And for the most complex ones like the MPMC or Hashmaps you add multithreaded memory reclamation challenges on top with issues of lifetimes and pointer invalidations.

And then you need to test and fuzz the hell out of them or even better, formally verify them to avoid heisenbugs.

I would not attempt writing concurrent data structure without a model checker to formally verify the absence of race conditions, deadlocks or livelocks. Threadsanitizer, AddressSanitizer or a borrow checker are not enough.

sls1005 (orginal) [2022-06-30T06:49:32+02:00] view original

As --threads:on becomming the default, I hope there would be more types that can be used for sync, such as Semaphore or RWLock.

Clonk (orginal) [2022-06-30T11:48:37+02:00] view original

As --threads:on becomming the default, I hope there would be more types that can be used for sync, such as Semaphore or RWLock.

For Linux, I often end up using std/posix with a thin "qualitity of life" wrapper on top of it

cblake (orginal) [2022-06-30T14:04:42+02:00] view original

Lastly, shared memory is last resort, contention makes things slow

This cannot be said often enough. Take even the innocent looking from 2014 tests/parallel/tpi.nim (propagated to nim-taskpools even). While, yes, it was always a bad way to compute pi, and yes the Nim compiler correctly analyzes "safe/disjoint access" - from a software/"example code", POV tpi runs 500 times slower than 1-core/serial (for me on an i7-6700k, for an n boosted to 1 shl 24, 90 vs 0.178 seconds for the obvious simple serial sum impl). Pretty sure the issue is cache contention of writes to the shared output array. Maybe I am the umpteenth person in 8 years to point out this footgun, but as per my opening sentence, it cannot be said enough. I only discovered it porting the example to cligen/examples/piPar.nim.

elcritch (orginal) [2022-07-03T00:25:38+02:00] view original

Because correct, performant concurrent data-structures are extremely hard to get right and an absolute pain to debug even for the most seemingly simple that exist (say a lockless Multi-Producer Single-Consumer queue).

So true, and yet they're so fun... but gosh it's easy to run into "well crap that'll result in memory corruption". I'd love to see more Dr. Nim. stuff.

Recently I had need/desire for a MPMC ring buffer. At first I though it'd be easy to do a lockless ring buffer, but crap it gets hard. You need double wide CAS atomics with most algorithms and always seems to run into ABA issues. Ugh, and in my use case a single or dual core MCU doesn't benefit from lockless.

Instead I added an "overwrite" mode to threading/channels. It uses locking so it was trivial:

https://github.com/EmbeddedNim/threading/blob/9adc151e17c981bca0cdfde2deb410f6fdbde19e/threading/channels.nim#L404

Yay! Now I can use channels like "lossy" channels. When reading sensor data it's often pretty useful to dump old stale data in favor of new data. Though I really need to PR upstream my extra unit tests, they're currently pretty svelte.

However, then I got emboldened to try a peek or view operation and... well there's good reasons you can't do a peak operation with ARC/ORC. It probably could work with non-ref types, but isolate and friends can't really tell that yet.

I've done exacly this and learned that all of Nims DS are well-crafted and very-performant - esp. the std/tables is a performance-beast :)

Too true! Nim is a dream for basic core algorithms. :heart: Even better I can actually read the implementation of MPSC queues like @mratsim's without spending half a week deciphering the syntax of the C++ or Rust-ism's every where. Though granted people are doing some impressive work in both. It just... takes so much effort to read the syntax it leaves less room for thinking about the algorithm.

So the pole-position on my wish-list for 2.0 is a new lib/module, filled with concurrent-DS for Seq/Tables/Set/List, that are (as-much as possible) API-compatible with the current std-Seq/Set/Tables-implementations.

To speak to this point, Nim-2.0 (or really to @treeform's post the other day, basically current Nim + flags) really provides the basis for Nim ecosystem to begin creating good multi-threaded data structures. IMHO, threading with the previous GC's prohibited many areas of exploration like you're talking about. It's actually pretty exciting to me.

I've already seen a few good options pop up, including threading/channels and nim-taskpools. Though it's tricky because ARC/ORC have different performance profile and in some cases is always going to be slower, so it takes time to adapt the ecosystem.

elcritch (orginal) [2022-07-03T00:43:07+02:00] view original

Wow, thanks for sharing those numbers. 500x slower is just crazy.

Looking at nim-taskpools and doing Elixir programming really keeps me thinking about how to do an "actor" framework in Nim. Everything is there. The performance could probably be comparable or better than many of the more shared memory styles of programming.

BUT as i type along on my Intel i5 (2014) with 4-HW-threads, i'm sure my next machine will be either a M-something, RISC-V or a modern Intel/AMD-thingy and any of these will provide 20-cores or more. And in that respect i'd like to see Nim 2.0 beeing able to provide data-structures that can be safely and efficiently used in a lock-free manner by multiple threads.

Building on @mratsim and @cblakes points about memory contention, most of these truly multi-core machines don't have true shared memory. Or not a truly unified shared memory. It's an illusion, as modern processors build on multiple layers of caches with transparent memory transfers done in hardware to make it look seamless.

Really any N+ core high performance processor really has more in common with message passing that shared memory. It's just hardware memory passing. Still understanding details like cache lines (e.g. the size of the message passing) in a process is key to lots of really high performance numerical code. As we get processors with more cores this problem actually increases. Lockless algorithms while often faster than locked ones are way slower than not sharing memory if possible. Though I've wondered why CPU designers haven't tried introducing smaller more specialized and non-cached blocks of memory just for shared atomics. They'd be expensive to wire per byte but would seem to produce a huge speed up for anything using atomics.

What would be really interesting from a programming language level would being able to embrace and adapt to the reality of "message passing" memory systems in modern computers. We've already got some of this with GPU's now, but I've only dabbled with GPU stuff.

mratsim (orginal) [2022-07-03T11:34:13+02:00] view original

Couldn't render post #60854.

cblake (orginal) [2022-07-03T12:31:57+02:00] view original

Shared everything is both dangerous and hard to use right. Even if you tame correctness, performance danger lurks. Static help on that is hard since estimating the time some code takes relative to the cache coherency protocol is both A) hard & B) deployment sensitive - 128 core will be very different than 4-core. Message passing is easier to get right, but if the messages get large you probably want to "fail" over to shared memory for transport. That's what X11 did in 1991 with the SHM extension.

Anyway, a mental model of a 64-core CPU (or multi-socket board with hundreds of cores) as a "distributed system" is probably more apt even without the hardware having the exact same mapping to reality (like switch/bus failures/high latency diversity not being a thing).

elcritch (orginal) [2022-07-05T00:13:10+02:00] view original

For formal verification of concurrent data structure, Dr. Nim. or SAT solvers in general are probably not the way to go. It's probably easier to use a model checker which will deterministically check all the states 2, 3, 4, ... threads can be when accessing the data structure concurrently. It will detect data races, deadlocks, livelocks, and race conditions (note that a borrow checker only deals with the data races and cannot help on deadlocks, livelocks and race conditions).

Good points, I like the idea of Dr. Nim, but it's tedious at best to write inductive proofs. Plus even in university I was trained to think like a Physicist and never enjoyed delta-epsilon proofs. It never felt like I'd really proved anything. Give it to me as a linear system and let me see if I can diagonalize it. ;)

What I ponder about would be genetic algorithm type of solvers that creates tries to create proofs for you and checks them deterministically against all the states. Then you let that run. That'd be more for say figuring out if you can elide runtime bounds checking, not really threading safety. Though it sounds more in line with what you're describing, at a very loose level.

Though for threads isn't what you're describing as deterministically checking all the state be what a SAT solver is for? From what I understand about TLA+ for example is that it encodes the "time" portion of distributed algorithms into discrete states. Then it uses a SAT solver to deterministically check if all the states are valid? Though the trick still seems correctly writing proof part to avoid deadlocks.

Note I've only dabbled in this area so I'm just trying to understand what you mean. Terminology is tricky.

Personally in general I like Nim's approach to these problems of just using code. Like using concepts where the programmer writes a blob of straightforward compile-time code that specifies what they want, rather than turning the type system into a SAT problem and then make the programmer play a reverse game of Sudoku to encode what they want. ;)

I've gathered extensive research on proving correctness of concurrent data structures here: https://github.com/mratsim/weave/issues/18 And started to implement CDSChecker http://demsky.eecs.uci.edu/publications/c11modelcheck.pdf as it is one of the only paper with actual implementation code.

That's awesome! I'd love to play around more with these.

elcritch (orginal) [2022-07-05T00:28:19+02:00] view original

@cblake I think we're on the same page. Unless it's been swapped out. ;)

Message passing is easier to get right, but if the messages get large you probably want to "fail" over to shared memory for transport.

Yah I was trying to point out that somewhat ironically at the hardware level that shared memory nowadays is often just "message passing". Just that the messages are pages or cache lines, etc. Though trying to program compute systems at that level directly generally doesn't pan out.

mratsim (orginal) [2022-07-07T10:58:31+02:00] view original

Couldn't render post #60940.

Mirror of forum.nim-lang.org

9264 :: Back to the basics (of research) - Concurrent data-structures in 2.0 ?