nimforum mirror - Atomic ARC option?

RodSteward (orginal) [2022-11-24T13:32:20+01:00] view original

Right now is ARC and ORC memory management thread sensitive and you have to move references to the thread in order to use them. Since Nim is all about options, it would be nice to have an atomic reference counted option. The reason is that the thread ownership of garbage collected memory isn't always a good match how many contemporary systems use memory and threads.

Often threads are in thread pools and the usage is often, thread starts, obtains a resource (ref object) from some kind of accelerator structure, do some work one it and potentially let it be freed by the GC. Each a different thread will perform the operations. A thread is not an owner, just a worker.

Also, it is often the case is that having the cycle detection (with tracing) is not that important. Often cycles are between objects that are in close relation to each other (like parent pointers) and these can be dealt with manually like it is suggested in the Lobster language (http://aardappel.github.io/lobster/language_reference.html).


class rec:
    r:rec?

var x = nil
x = rec { nil }
x.r = x
x = nil

The programmer can then easily fix the leak by setting the reference causing the cycle to nil, like clearing the enemy field when a unit dies, or by writing x.r = nil in the above simplified example.

This type of memory management is typically that the systems and games programmer crowd would be interested in.

An experimental atomic ARC option would be beneficial. I know that there is a SharedPtr[T] but this type is explicit. The question is how this type can optimize inc/dec counting away. Also, using an explicit type makes more difficult to just change the memory management type. Often it is interesting to compare the types by just using the --mm: command line option.

Yepoleb (orginal) [2022-11-24T14:07:10+01:00] view original

I had the same idea three days ago and @ringabout pointed me towards this patch which I ported to an unmodified Nim 1.6 and published it here for you (removed the VCC compatibility though).

I was long mislead by the memory management documentation which claims "The reference counting operations (= "RC ops") do not use atomic instructions and do not have to -- instead entire subgraphs are moved between threads.", but this apparently does not happen automatically and atomics are still necessary for concurrent access, which happens often enought that it should not simply be ignored. Knowing which parts memory will get touched by a procedure is often incredibly hard to determine before running the code, so moving it all beforehand is not realistic.

I am in strong support of your idea to add an option for this. Even if some consider atomic refcounts too slow to use for performance critical application, it's a great tool for prototyping without having to worry about manual memory management.

Araq (orginal) [2022-11-24T15:25:29+01:00] view original

It's an important experiment we should do after v2.0 has been released.

The programmer can then easily fix the leak by setting the reference causing the cycle to nil, like clearing the enemy field when a unit dies, or by writing x.r = nil in the above simplified example.

No, actually that can be incredibly hard to do in real-world code. Nim's async comes to mind. ;-)

RodSteward (orginal) [2022-11-24T16:15:31+01:00] view original

I am in strong support of your idea to add an option for this. Even if some consider atomic refcounts > too slow to use for performance critical application, it's a great tool for prototyping without having > to worry about manual memory management.

The atomic reference counts are indeed more expensive, however with compile time optimizations and and lent/sink can help removing unnecessary inc/dec. Lobster claim a roughly 95% reduction of these and with such reduction the performance impact between atomic and non atomic becomes less important. One problem with concurrent access to references is that any borrow/checker will stop working as there might be another thread accessing the same resource. This also ties to how to make the concurrent primitives.

We have Lobster but Swift will have a ownership model together with atomic reference counts. I think Swift will demand exclusivity to a resource from one thread. I'm not sure how that will look like right now.

Memory management is a moving target today and we will probably see more development when it comes to that. What Nim should provide is a platform so that we can investigate memory management ideas. Nim has historically done well in this department.

Araq (orginal) [2022-11-24T17:08:21+01:00] view original

The atomic reference counts are indeed more expensive, however with compile time optimizations and and lent/sink can help removing unnecessary inc/dec.

Most of these optimizations are unsound in a multi-threaded setting.

But even if it all worked out, there is nothing "convenient" about shared mutable state with concurrent access, it just doesn't work. Decades of heavy C# and Java use have demonstrated that. Sending isolated subgraphs around is the better idea and should be our focus.

elcritch (orginal) [2022-11-24T17:23:31+01:00] view original

Also the threading package has atomic shared pointers that play nice with ARC.

Yepoleb (orginal) [2022-11-24T21:49:22+01:00] view original

What do you mean by shared mutable state - writable memory that is actively written to by multiple threads or any writable memory, even if it is not modified in the threaded task? I would agree that the first case is much better handled by doing all the writes in a dedicated thread, but I don't think dynamically allocated memory that for the entire duration of the threaded task remains unmodified should be discouraged.

Some applications, I am thinking of games specifically, have a lot of internal state that is written once at startup - config files, model data, terrain data, texture information - and never again modified. It is often hard to predict when one of these objects will be needed, so they can't be moved or copied to the thread beforehand. Only giving each thread a full copy of the application state would be possible, but that's bad for both memory and memory bandwidth usage. I know there's also the possibility of passing pointers around, but that's incredibly messy, inconvenient and I wouldn't even know how to do it with objects that can only be accessed with procs that internally create references.

Araq (orginal) [2022-11-25T07:34:02+01:00] view original

Only giving each thread a full copy of the application state would be possible, but that's bad for both memory and memory bandwidth usage.

Agreed but there is SharedPtr for that.

Roughly it is like this:

Nim version 1: Use the --gc:x that suits your program best. (And hope all your dependencies still work with it.)

Nim version 2: There is only mm:orc but you can create custom pointers and containers that are optimal for your code. (And all your dependencies will work because they too have been developed against mm:orc).

With mm:atomicrc we would go back to version 1. I'm reluctant.

ringabout (orginal) [2022-11-25T08:08:48+01:00] view original

Python has a fork using "biased atomic RC" to diminish the overhead of atomic operation of a ref for its owner thread => https://lwn.net/Articles/872869/ . But perhaps it requires some kind of manual management and is not fit for language compiling to native code.

ringabout (orginal) [2022-11-25T08:24:48+01:00] view original

Araq (orginal) [2022-11-25T12:01:48+01:00] view original

I don't burn briges and arc/orc are much less hardcoded into the compiler than the old GCs are. In fact, much of the existing codegen logic only exists in order to keep the older GCs alive.

But there is a world of difference to me between "mm:atomics -- not officially supported" and "mm:atomics -- alternative to mm:orc".

Let me clear on this: Experiments and PRs are accepted and the new memory management is easier to tinker with than before. But my long term vision is to eliminate language dialects, not to embrace them.

Yepoleb (orginal) [2022-11-25T13:24:50+01:00] view original

SharedPtr is the right technical concept for this problem, but highly inconvenient for large data structures. I quickly realized I would need to eliminate all references in any data structure that could be accessed from a thread and rewrite all procedures to work with SharedPtr instead of refs. This would have to spread out beyond my code to code other people have written with the assumption that refs are the only relevant pointer type (this is reasonable, there is no generics system in place for AnyRef).

The only reasonable option was to modify the mm to automatically slap a SharedPtr on everything. And this turned out to be much simpler than expected, just a handful of touches in ARC to use atomic instructions and it works like a charm so far. This "new ARC" is still 99% the "old ARC", there is no need for another mm switch.

Araq (orginal) [2022-11-25T14:03:08+01:00] view original

And this turned out to be much simpler than expected, just a handful of touches in ARC to use atomic instructions and it works like a charm so far.

Glad to hear it and I'm not surprised. You probably want to combine it with --cursorInference:off though as the optimizer doesn't understand the excessive sharing.

Can you provide some benchmark results?

Yepoleb (orginal) [2022-11-25T15:56:48+01:00] view original

I will eventually when I make a proper proposal for this to be included. I just came here to provide a quick solution and argue against the notion that "sending isolated subgraphs around" is always a viable alternative to atomic reference counting. But I have not had the time to properly test and benchmark it 🙁.

Araq (orginal) [2022-11-25T16:08:28+01:00] view original

No problem. And I'm looking forward to your proposal.

boia01 (orginal) [2022-11-26T22:34:20+01:00] view original

The concurrency model that I'd like to see better supported in Nim is something along the lines of persistent data structures that support multiple-readers and a single-writer:

Imagine an isolated graph of objects (refs).

This graph is be accessed through something like SharedPtr[ref T], which would be the root of the graph.

The root can be safely shared between threads, and is the only object that participates in reference-counting for this graph.

The graph data structure is immutable/persistent, in the sense that it preserves the previous version of itself when it is modified, i.e., copy-on-write semantics, to avoid data races.

The graph can be modified safely by any number of threads concurrently, however concurrent updates result in different graph versions. A graph version is just another another root 'SharedPtr[ref T]`. (To be clear, a graph version is not a full copy of the graph -- it's merely a copy of the nodes that have been updated)

Writes can be batched so as to avoid generating excessive versions/garbage during graph updates.

Graphs can be connected to other graphs through other roots.

All graph updates have to ensure that the graph remains isolated, and require copying non-root refs from other graphs if needed.

This model would be a great mid-point in the concurrency spectrum between being able to pass around (but not share) isolated graphs and having to SharedPtr[T] everywhere. It would be more convenient, preserve safety, and yet still be relatively efficient -- the additional object copies inherent to the persistent data structure model would be mitigated by avoiding the cost of fine-grained GC and concurrency control within the object graphs.

My sense is this model is also compatible with the general direction of lent types, in that traversing an object graph in a read-only fashion would yield only lent types, and that updating a graph would involve hooks, similar in spirit to the lifetime-tracking hooks we already have, that would deal with updating (read: making copies of) the objects referencing the object being mutated (read: copied-on-write).

I'm curious if the community would find this model compelling to have...

(In the meantime, I'm using a poor-man's version of this model but using full mutability and big-giant-lock for readers/writers around the entire graph, which works okay but is neither very concurrency-friendly nor very elegant.)

Araq (orginal) [2022-11-26T23:50:14+01:00] view original

Reminds me of something that is called "ballon types" but maybe I misremember and I cannot find the paper on it anymore either.

cblake (orginal) [2022-11-27T12:34:14+01:00] view original

Some applications, I am thinking of games specifically, have a lot of internal state that is written once at startup - ... - and never again modified.

I realize the conversation has moved on, but I did not see this question answered. This is just an informational post - I know all the senior sys programmers here like @Araq & @boia01 know it all already. What @Yepoleb describes would be "read-only after creation state" not what most people mean by "shared mutable state". For writable memory pages, state almost always starts as all zeros and must get populated somehow, but if your application logic ensures it is not written to (say with Nim's let) then there is little need to "protect" it from multiple concurrent writes - only to defend against buggy logic. For simple enough memory page structures, things like mprotect or whatever the Windows equivalent is can activate CPU MMU protection after the fact.

But if you are willing to assume your CPU has an MMU (usually but not universally true - e.g. some embedded CPU manufacturers scrimp on that) then a nicer design for the "much static data" scenario can be pre-compute binary file formats and use std/memfiles to just memory map them READ-ONLY. This enforces the non-modified invariant with hardware (as well as being essentially zero cost "parsing" at program start-up - really the CPU understands and an "parse" the native format). In a sense, writing these binary formats is a bit like writing a "compiler for data" while re-parsing it all the time is "staying interpreted".

A simple fully worked out example of this is thes. A more complex one is suggest where one pre-computes 100s of MiBs of data. A more "general purpose" one kind of targeted at "data analysis" of "FileArrays" is nio. None of those is really "multi-threaded" in the sense of this conversation. There is also ftab which is a single-writer multi-reader "persistent" data structure { not the same kind of "persistence" @boia01 means - but "external file persistence" }. That last is more a WIP - still needs checksums so parallel readers can verify the single writer did not alter what they wanted under the covers (as well as being nice for crash recovery), but it may inspire/serve as food for thought { and has important bugfixes to std/memfiles on Windows I have been meaning to upstream as a Nim PR }.

A final example that does not involve such a "data compiler", if that feels intimidating/complex, is grl.nim <https://github.com/c-blake/cligen/blob/master/examples/grl.nim> where you have a Nim seq[string] pre-fork and then procpool forks a bunch of kids to do work. Each kid gets a copy-on-write version of the Nim paths: seq[string] and just accesses the memory "as usual". This grl program can be faster (or slower!) than a threaded device IO equivalent because memfiles IO can be for reasons, but maybe only faster for processes which have private page tables. Threads sharing page tables must synchronize that shared mutable state of the file mappings in-OS-kernel. For many small files mostly in the buffer cache (highly dynamic file mappings), all that sync will definitely slow down a similar threads version.

In short, as always, also consider processes for parallelism/multi-core no matter how good Nim support for threads sharing general object graphs becomes. { I sure hope it becomes second to none! :-) } Depending upon your application structure/needs, it may be simpler "in the large" with added bonuses of avoiding re-computation or (possibly) performance.

elcritch (orginal) [2022-11-29T04:43:41+01:00] view original

The graph can be modified safely by any number of threads concurrently, however concurrent updates result in different graph versions. A graph version is just another another root 'SharedPtr[ref T]`. (To be clear, a graph version is not a full copy of the graph -- it's merely a copy of the nodes that have been updated)

Reminds me of Clojure's STM system. I never used it myself but its the only other plausible shared multi-threading paradigm that I know about. Here's a longer reference: https://clojure.org/reference/refs

I think it be great to implement something like it with ORC and lent's.

Mirror of forum.nim-lang.org

9646 :: Atomic ARC option?