zevv (orginal) [2023-05-03T12:15:22+02:00] view original

(I indented to post the message below as a reply to the recent "Is ORC considered production ready" thread, but it grew much longer than I expected and I think it might deserve a separate thread. Note that I often consider myself not smart enough to write proper multi-threaded code, so the text below might contain wrong or biased information)

Some time ago I have spent several weeks pushing multi threaded Nim 2.0 with shared data to the limits, mostly looking at ARC only.

My adventure started with this post where I asked a similar question: What is the state of threading in Nim 2.0, and how to make the best us of this: https://forum.nim-lang.org/t/9617

My plan was to create a minimal actor based system in which I can write multi threaded programs without ever having to think about the threading; sharing data over threads typically requires the user to take care of the synchronization primitives, which I usually find too cumbersome and error prone for daily work. My end goal was to be something like Erlang's "Process": a very lightweight 'fiber' like flow of control, allowing millions of those to be running on a handful of threads, with all communication being done with message passing through the processes mailboxes. The processes are built on top of disrupteks fine CPS library, which offers processes at the cost of tens of bytes of memory each, with very low scheduling overhead.

This where the shared memory+ARC comes in: for effective message passing of "large" data, you typically want to avoid deep copies, and move the data from thread to thread, effectively. This requires a few steps to happen in the proper order:

The sending thread needs to make sure it has zero other references to the object it is sending out (including any other objects that are referenced by that object itself: recursively, potentially cyclic!), because you do not want to have the same data referenced by two threads

ARC needs to let go of the object and promise to never again touch the RC from the sending thread

Proper synchronization needs to take place to make sure the RC and the moved data are safely handed off to the receiving thread - this also needs to work properly on architectures with weaker memory ordering of course.

The receiving thread now takes ownership of the RC.

The Actors project is more or less complete, and is in a works-for-me state, but I must admit that I have not actually used it for very much after I got it to work; For those interested, take a peek at https://github.com/zevv/actors

A short summary of my conclusions, not complete and in random order:

It can be done, but it is fragile, error-prone and feels like walking a mine field. Sharing unmanaged data (aka, raw pointers) is of course no problem as long as you take care of proper synchronization, but playing well with ARC managed data (refs) make things a lot harder; I still have bad nights dreaming of backtraces filled with calls to nimDecRefIsLast() and __eqdestroy__XXX()

ARC will increase and decrease reference counters just about everywhere you access or mutate a ref, and it is not easy to make this play well with synchronization primitives because ARC is usually doing its work "outside" of your code (eg, at the end of a function after your last line of Nim)

One of the unsolved problems is proper isolation; To safely pass data between threads I had to write some nasty code that recursiely peeks at the RC headers before Nim refs to inspect the RC counter value, and only allow moving data when the RC is 0. There is no way to effectively assure this at compile time for generic ARC managed data (isolate[T] is cumbersome), so this also requires error handling to do the right thing when moved data happens to be not isolated.

I'm not sure if this can be properly made to play well with ORC, for now. The problem is that ORC manages some if its data in thread local variables, which makes it not possible to safely move it to another thread. It seems that Nim needs some additional infrastructure for this to "reroot" an ORC managed ref when moving. GC_RunORC() can be used as a workaround to make ORC clean up before moving data around, but it comes at a steep price, performance-wise.

My single most important takeway of this little adventure is: this problem is still hard, and Nim will not hold your hand - it will happily shoot you in the back of your head when you are not looking. Getting a SIGSEGV right away is usually the best result you can hope for, because these are obvious and traceable. The problem is of course that a lot of bugs of this class can be very, very subtle and can show up in a million different ways, not causing crashes but all kinds of other undefined behavior. Not something I want in my production code.

If you decide to go play with shared ARC managed memory, do yourself a huge favour and use and trust memory sanitizers like asan/tsan and Valgrind/Helgrind/Drd and take the output very serious. I have talked to some people telling me that they knew what they were doing and that Valgrind was just generating false positives. I beg to differ: Valgrind has been right 99% of the time. If Valgrind ever generates a false positive, something in your code is usually doing "funny stuff" and IMHO deserves proper annotation to make it shut up, and inform readers of the code that funny stuff is happening here.

My final conclusion would be that ARC simply does not play well with threading in the current state unless you really, really know what you are doing. Having atomic RC types in the language would take most of these headaches away.

jasonfi (orginal) [2023-05-03T12:35:12+02:00] view original

I agree that Valgrind can help a lot. But the code you write that can cause memory errors should be minimal. Not only that, but I recommend writing a minimal prototype program that tests the concepts on their own. Make sure this minimal program works fine without errors or leaks, then transfer the code to any larger program you're working on, if such a program already exists.

PMunch (orginal) [2023-05-03T13:55:13+02:00] view original

Very interesting writeup! The actor system you describe is exactly the kind of threading story which would really help Nim. A programming language which in 2023 doesn't have anything easier than manual locking and such isn't exactly a great look. Unfortunately it sounds like Nim currently fights your attempts at getting this done fairly hard..

But your approach is still very interesting. If Nim could just be a bit friendlier about handing of one tree of ref objects to another then that would be a very nice way of dealing with threading.

Araq (orginal) [2023-05-03T15:15:24+02:00] view original

Moving a tree around has never been easier:

import std / [json, isolation]
import threading / channels

var chan = newChan[JsonNode]()
var thr: Thread[void]

proc worker() {.thread.} =
  var x: JsonNode
  chan.recv(x) # somebody should fix this API...
  echo "received ", x

createThread thr, worker
#chan.send unsafeIsolate(%* {"key": 2, "keyB": "value"})
chan.send isolate(%* {"key": 2, "keyB": "value"}) # JSON nodes do form a tree
joinThread thr

No need for "manual locking in 2023", but locking is still awesome anyway IMHO. ;-) Much easier to reason about than any message passing system or "actor model" that I've seen so far. But that's off-topic.

PMunch (orginal) [2023-05-03T15:27:12+02:00] view original

Is the channels module actually thread safe though? https://github.com/nim-lang/threading/issues/24. If this does indeed work without any leaks or other subtle issues then it's great news!

Araq (orginal) [2023-05-03T15:55:47+02:00] view original

So fix https://github.com/nim-lang/threading/issues/24 by using a single lock, it doesn't imply anything for ARC/ORC's usablity in a multi-threaded setting.

PMunch (orginal) [2023-05-03T16:10:18+02:00] view original

Just played around a bit with your example and unless I create the object entirely within the isolate call it doesn't work. This really limits what you're able to do with this since I can't create stuff outside the isolate call and then isolate them after the fact, and I'm not able to isolate something first and then edit it afterwards. E.g. something like this is not possible:

import std / [json, isolation]
import threading / channels

type
  Test = ref object
    data: string
  Tree = ref object
    left, right: Test

var chan = newChan[Tree]()
var thr: Thread[void]

proc worker() {.thread.} =
  var x: Tree
  chan.recv(x) # somebody should fix this API...
  echo "received ", x.left.data, " ", x.right.data

createThread thr, worker
let hello = Test(data: "Hello")
chan.send isolate(Tree(left: hello, right: Test(data: "world")))
joinThread thr

because it can't isolate that let hello. I fail to see how this would be useful for anything more complicated than an example like this, if you have anything which does actual work with this I'm super curious to see how it's supposed to work.

On another note running it through Valgrind/Helgrind I get 3 errors, one of which is the aforementioned issue and two others about possible data races. I'm running this command to test it: nim c --passC:-g --passL:-g -d:useMAlloc araqtree.nim && valgrind --tool=helgrind ./araqtree so it seems like it still doesn't work quite as well as we'd like it to..

zevv (orginal) [2023-05-03T16:15:03+02:00] view original

@araq: Your example is demonstrating moving a constant tree around has never been easier.

It is showing the happy path because isolate() is able to take your const JSON tree and isolate it; when trying to pass anything else, isolate() is no longer able to do the job and will tell you expression cannot be isolated.

Unfortunately, my data is usually not constant.

You already mentioned unsafeIsolate() in your snippet, which is just casting the value to isolated[T], without actually checking if this is the case. But now you're on your own - your code might work today but fail in interesting ways ten months from now. The programmer now has the responsibility to make sure the tree is isolated, but if it is not you run into undefined behavior - or a early crash if you are lucky.

Araq (orginal) [2023-05-03T16:19:54+02:00] view original

Thanks @zevv for the nice summary! I'm sure you will excuse my more positive, totally biased take on your work:

The Actors project is more or less complete, and is in a works-for-me state, but I must admit that I have not actually used it for very much after I got it to work; For those interested, take a peek at https://github.com/zevv/actors

So, from what I understand, that is a runtime that combines "micro" processes with an async event loop while being able to use all of your CPU cores and client code is easy to write and "not blocking". Sounds amazing! Plenty of people have been waiting for this thing!

Was it hard to write? I bet. Are you burned out now that it finally begins to work? I can imagine. So let others join the party. ;-)

Would it been easier with an "atomic ARC" mode? Sure. Yet you managed to do without.

The real question is how hard it is for client code to avoid triggering (non atomic) ARC problems when using your actors runtime.

Araq (orginal) [2023-05-03T16:30:56+02:00] view original

You already mentioned unsafeIsolate() in your snippet, which is just casting the value to isolated[T], without actually checking if this is the case. But now you're on your own - your code might work today but fail in interesting ways ten months from now. The programmer now has the responsibility to make sure the tree is isolated, but if it is not you run into undefined behavior - or an early crash if you are lucky.

Actually, it's not undefined behavior, it's simply always wrong, it's just that the tooling cannot detect it. I claim that it's not hard to ensure isolation for a programmer, but it's hard for Nim's type system. We need real usability data on these things and if you think that your experiments with "actors" is a valid data point I have to say that I don't agree:

You managed to get it to work regardless.

Non-blocking, multi-threaded runtimes on top of event loops are expert-only territory. The ordinary Nim programmer doesn't write these systems.

Araq (orginal) [2023-05-07T16:41:20+02:00] view original

Maybe what you say is true, maybe not. In your example

proc createTree(hello: Test): Tree =
  Tree(left: Branch(data: "Hello"), right: Branch(data: "world"))

The parameter hello is unused. This means it's not a realistic example. I keep asking for realistic examples. Alias analysis depending on the involved types has proven to be hard to reason about and bites with generic algorithms which is why "strict funcs" evolved to use a mechanism based on the involved expressions only.

I can imagine the same will happen for isolate -- typed based alias analysis is too fragile and a rule like "cannot use local variables" is easier to understand. Or maybe a rule like "every local variable involved in isolate must not be used afterwards".

Araq (orginal) [2023-05-07T17:01:28+02:00] view original

The result of the query borrows from the n parameter, and in theory the lifetime dependency can avoid refcounting activity since it's assumed that the lending parameter is reachable in the first place and the lent value lifetime won't extend its lender's. Does this already prevent refcount updates?

You misuse lent in your example which makes it harder to understand.

Or is this something that's planned?

You seem to describe an optimization that is "well known":

"There is an important special case in which it is possible to avoid incrementing and decrementing reference counts. Suppose that the program has a declaration

type C = counted collection of ...

We say that a scope S is C-conservative if it contains no assignments to variables of type ^C that are not local to S, it contains no uses of v.refCount for variables in C, and all procedures that it calls are also C-conservative. Within S it is not necessary to update reference counts for variables in C, since no variable in C can be freed in S and every such variable will have the same reference count on exit from S that it had on entry to S."

Nim doesn't do this optimization, instead Nim does "cursor" inference. Given cursor inference, it is not clear if the optimization is worth it. But it is interesting and reasonably easy to understand and implement.

Assuming no concurrent mutation (as for persistent data structures), or a coarse-grained read/write lock over the root Node, the query function above would be race-free, refcount-update-free, and thus thread-safe. It would be a big enabler for multithreaded ARC/ORC.

Maybe, maybe not. How can you assume a lock on the "root" Node? The compiler has no idea about a "root" node, the nodes are all of the same type and the hard part of analysing multi-threaded programs is that you don't know what the other threads may do, it's fundamentally a nonlocal analysis.

Last, I'm wondering if we could have lent T from X syntax similar to what's planned for var T from container syntax? It would allow for more flexibility on parameter position and also potentially borrowing from multiple parameters.

More syntax doesn't help when we still try to figure out the important idioms we need to support.

PMunch (orginal) [2023-05-07T17:02:30+02:00] view original

It's true, that example isn't a real word example. This was simply me trying different things to figure out where it broke, and then being surprised when it broke even if I didn't use the argument. The point is that right now isolate is so strict it's not useful at all. Since you can't pass in data to work with I can't think of a scenario where what you put inside isolate couldn't just have been moved to the receiving thread. Of course we can use anything that doesn't have refs, but since refs are considered safe anywhere else in the language they are pretty much everywhere.

I'm not saying that alias analysis is easy, I'm just saying that without it isolate isn't really all that useful. Indeed having a rule like "every local variable involved in isolate must not be used afterwards" would vastly improve the system. But you still have to know that the local variable can't be an alias and that it can't alias anything which is used afterwards. And then it seems like we've come full circle. The rule would really be "could this entire tree be garbage collected right now, if it weren't for the single reference we're trying to isolate". If that is the case then it should be safe to pass that single reference on to another thread, because without it the tree would be collected.

I could whip up a realistic example, but without a working isolate system it's hard to make sure the entire thing is correct. So I wrote you up two scenarios that I've thought about using such a system for, but apparently those aren't good enough? Would it help if I wrote them out in code so it would be more explicit what I would try to do? I'd have to invent some kind of work to be done though since I don't have anything specific I'm working on right now.

Araq (orginal) [2023-05-07T17:56:27+02:00] view original

I could whip up a realistic example, but without a working isolate system it's hard to make sure the entire thing is correct.

Yes, please do that. And it's not hard to do: Instead of isolate use unsafeIsolate to make the compiler shut up. And use valgrind/some sanitizer to make sure it's correct. You might need to use tricks like zeroMem(addr local, sizeof(local)) or wasMoved(local) or move(local) etc so that the thread-unsafe destructor is not run on the local variable (which has been moved anyway).

sky_khan (orginal) [2023-05-07T18:25:36+02:00] view original

Most probably I'm being stupid here but cant you invent a new return type which cannot be assigned to a variable and cant be changed after returned from a proc? Then If threads:on, wouldnt it be possible to prove it that it is really not touched after its built in extra compiler pass ?

proc buildTree(inputs): "untouchable" Tree =: ...

var data = buildTree(..) # error

let data = buildTree(..) send data to some thread

Just a thought. Ignore it if its indeed a stupid idea.

Araq (orginal) [2023-05-07T19:22:57+02:00] view original

Just a thought. Ignore it if its indeed a stupid idea.

You're describing Isolated[T] and isolate and we're trying to figure out how exactly it can work.

sky_khan (orginal) [2023-05-07T20:26:33+02:00] view original

Sorry, maybe it was 8-9 years ago when I've last time looked nim compiler source. I'm not exactly sure what I'm talking :) but I was thinking, you have to track and store information about every assignment, every access to that data in every module to make isolated work. Compiler then would need to re-analyze whole source for this tracking in an extra pass and It wouldnt be practical to do this for everything. Thats why I thought "Could it be easier if you mark some variable kinda like "runtime const" after it's returned from a proc" Combining with atomicArc this could have made things easier, you know. I'm not claiming to understand difficulties of it. Just that was my train of thought.

sky_khan (orginal) [2023-05-08T04:32:53+02:00] view original

My English is not good. Still, I will try to explain what I had in my mind a bit more before I give up.

I was answering these:

Araq: How can you assume a lock on the "root" Node? The compiler has no idea about a "root" node
PMunch: Indeed having a rule like "every local variable involved in isolate must not be used afterwards" would vastly improve the system

In short, my idea was confining graph creation into single proc. I guess its still not an easy job at all if it is even possible but I thought it might make the analysis easier somehow while providing a way to what OP and PMunch has asked.

import std / [json, isolation]
import threading / channels

var chan = newChan[JsonNode]()
var thr: Thread[void]

proc worker() {.thread.} =
  let x = chan.recv()
  echo "Welcome ", x["user"], x["msg"], " from ", x["uri"]

# this should be magic in compiler
template isolate(b: untyped): untyped =
    b

# this is not an ordinary procedure, its a graph generator
proc buildTree(uri, msg, user : string) : JsonNode {.isolate.} =
# I'm telling the compiler, this proc and only this proc is where I'll create my graph.
# Consider the return value as my graph root. You may ignore refcount of children
# but root should have atomic refcount. IDK if extra rules about proc arguments is needed.
# also not sure how but children must be destroyed too when root is destroyed
    var root = newJObject()
    root["uri"] = %uri
    root["msg"] = %msg
    root["user"] = %user
# real return value should be isolated JsonNode, dont let me break the rules outside of this proc
    return root

echo "Name?"
let user = stdin.readLine()
createThread thr, worker

# this would be an error:
#    var x = buildTree("example.com","Hello",user)
#    someProcess(x)
# but is it possible to make this work ?
let t = buildTree("example.com","Hello",user)
# chan.send t
# instead of this:
chan.send unsafeIsolate(t)
joinThread thr

alexeypetrushin (orginal) [2023-05-08T05:02:14+02:00] view original

Just an idea about proving that variable graph is isolated. Maybe it could be done same way as out of bound runtime check is done, at runtime. In dev mode each ref variable has a field with thread id, and if that id is different from the current thread id when you touch that variable, the exception will be thrown. In production build this flag is removed, so there will be no slow down or overhead. Also, if in the future someone would came up with a clever idea how to do that check at compile time, this flag could be removed without nobody noticing it.

P.S. I don't know much about compilers so feel free to ignore it if it sounds stupid.

Mirror of forum.nim-lang.org

10161 :: Usability of ARC/ORC in multi threaded code.

var data = buildTree(..) # error