| mm-switch | Thread-safe | Can handle cycles |
|---|---|---|
| ARC | No | No |
| Atomic ARC | Yes | No |
| ORC | No | Yes |
| YRC | Yes | Yes |
ORC can only handle cycles that stay inside one thread. YRC lifts that: cycles can span threads. For acyclic data it uses atomic reference counting; for cyclic data it uses a write barrier that only runs at pointer assignments — no stop-the-world, no arbitrary safepoints. Any thread can run the cycle collector when needed (no dedicated GC thread), which helps locality and suits embedded systems where every thread counts.
Nothing is free. In our orcbench, YRC is about 1.6× slower than ORC. We think that’s a good trade for full thread-safe cycles.
Questions?
In fact, for orcbench I got the overhead down to 1.3x slower than ORC by using more atomics than locks (-d:yrcAtomics) but this variant wasn't verified, so better not enable it for now...
YRC is the simplest threadsafe garbage collector by far: 550 lines for a provably correct, concurrent, thread-safe cycle collector with real-time opt-outs.
Compare to JVM's G1 collector — you need a flowchart collection just to understand the phases.
It works so well because all my previous attempts would simply not work: different collectors that overlap in execution, but need to ensure only one collector calls free (which one?), subtle interactions between "it's zero, can free it immediately! - but what if it's already registered as root candidate?" I had to throw away many ideas.
What YRC doesn't need (insane)
❌ No arbitrary STW phases — collector triggered by actual pressure
❌ No per-thread stack scans — RC roots are explicit, merged once
❌ No global sweep phase — for h in heap: if not marked(h): free(h) is not simple -- it is a multicore nightmare (multi-threaded deletion during iteration)
❌ No read barriers — mutators read freely, collector catches up
❌ No tricolor invariant write barriers — physical topology always current
❌ No pointer remapping — no copying, no forwarding tables
❌ No remembered sets/card tables — RC buffering is the "barrier"
Each eliminated complexity feeds the next:
Immediate pointer writes → no other write barriers needed
Stripe-buffered RC → no inline zero-ref races
Single merge point → no overlapping collectors
Physical graph tracing → no snapshots needed
Explicit RC roots → no stack scanning
Another key insight is that the tracing GC algorithms throw away information, but RC (and everything based on top such as YRC) gets complete incRef/decRef information. So the GCs need to recompute what RC already knows. Sure, disregarding this information by not storing it helps for throughput, but everything else suffers.
YRC knows EXACTLY:
✅ Every incRef/decRef event (buffered in stripes)
✅ Exact reference count for every object
✅ Precise root set (merged roots with inRootsFlag)
✅ Exact cycle membership (rcSum == edges check)
Tracing GCs must REBUILD all this:
❌ Throw away ref count history → rediscover roots via stack scan
❌ Throw away cycle info → full graph traversal
❌ Throw away locality → generational heuristics/card tables
❌ Throw away incrementalism → snapshot or write barriers
The per-object inc/dec operations that tracing GC fans call "overhead" -- it's actually priceless metadata!
Sorry for the marketing blub, but I'm really proud of my discovery...
Excellent work! In 550 lines is impressive 👍
When will it be ready for testing in Nim 2?
What does the "Y" in YRC stand for?
When will it be ready for testing in Nim 2?
It is in Nim devel. No idea when the next release will come.
What does the "Y" in YRC stand for?
Y: "almost the last RC based cycle collector" (Y comes right before Z). You can also interpret the Y as two threads joining.
In German it's pronounced "ürk", the sound I made when I came up with it ("slow shit...")
In German it's pronounced "ürk", the sound I made when I came up with it
🤣
And, just like that, you also discovered a universal solution to the naming problem!
To the question about using LLMs and sharing progress, I think that's truly insightful than the other way around. That we see the thinking behind the heuristics and tradeoffs discussed rather informally without huge corporate formalism.
I didn't mean to criticize Araq for sharing his progress. Obviously, while working on something new (and from my novice POV, this is cutting edge research) the researcher may run into unforeseen roadblocks, potentially even unresolveable ones.
My concern was about something different. Maybe I'm just projecting my own lack of discipline, but if I was coding with LLMs, and they were working so well that I was actually enthusiastic about it, I'm pretty sure I would get lulled into a false sense of security and have my vigilance dulled. That's exactly the state of mind when I would expect the LLM to sneak mistakes past me. The situation in this thread—Araq first excitedly missing showstopper flaws, and then flaws which seem to have been pretty obvious to Zerbina—sort of evoked this worry in me.
Cool. I am pretty chilled about it as this is still early stage and iterative, and will take a while to be PR ready, and will never be merged without review and thorough regression testing.
Also, I must say that this specific case of multi-threaded shared cyclic structures isn't something that concerns me as ORC does the job very well when you have single owner thread, and worker threads that work by message passing, or just keep the graph/tree structure in a blocking owner thread if one needs structured concurrency. An approach that Google's Abesil library also enforces as it explicitly bans C++ weak_ptr and nudges the developer towards unique_ptr. There have to be exceptional reasons when one must go into explicit unsafe cycles collection using weak_ptr. Even in game engines where there is more than one use for such data, the best ones use an ECS with indices instead of cyclic references.
So, I am just happily watching this evolve to see how it can be more efficient than Go's GC, and how it would tie in with the ownership semantics proposed for Nimony.
Btw, I must thank planetis for porting mimalloc to Nim. Really nice!
Great!
Is it ready for experimenting with it?
Bad news for YRC. While it appears to work correctly this program is about 6x slower than mm:orc:
discard """
output: '''true peak memory: true'''
cmd: "nim c --gc:orc -d:release $file"
"""
## Concurrent variant of torcbench.
##
## Each thread works entirely on its own private heap — no nodes are shared
## across threads — so the test is valid under both mm:orc and mm:yrc.
## Under mm:orc the cycle collector runs per-thread.
## Under mm:yrc the global lock arbitrates concurrent collectors.
import std/[locks, strutils, times]
const
NumThreads = 8
OuterIters = 25 ## per thread; total work ≈ 100 iters as in torcbench
ListLen = 5000
TreeIters = 400
TreeDepth = 8
type
DLNode = ref object
value: string
next: DLNode
prev {.cursor.}: DLNode # weak back-ref, same as DoublyLinkedNode
DLList = object # value type, embedded in Node
head: DLNode
tail: DLNode
Node = ref object
parent: DLList # copy of the list header (head/tail refs)
le, ri: Node
self: Node # self-cycle to force cycle detection
proc dlAppend(L: var DLList; s: string) =
let n = DLNode(value: s)
if L.head == nil:
L.head = n
L.tail = n
else:
n.prev = L.tail
L.tail.next = n
L.tail = n
proc buildTree(parent: DLList; depth: int): Node =
if depth == 0:
return nil
if depth == 1:
return Node(parent: parent)
result = Node(
parent: parent,
le: buildTree(parent, depth - 1),
ri: buildTree(parent, depth - 2))
result.self = result
# ---------------------------------------------------------------------------
# Per-thread workload — identical to torcbench.main() but self-contained
# ---------------------------------------------------------------------------
proc threadWork(_: pointer) {.thread.} =
for i in 1 .. OuterIters:
var leakList: DLList
for j in 1 .. ListLen:
leakList.dlAppend(newString(200))
for k in 0 .. TreeIters:
discard buildTree(leakList, TreeDepth)
GC_fullCollect()
var threads: array[NumThreads, Thread[pointer]]
for i in 0 ..< NumThreads:
createThread(threads[i], threadWork, nil)
joinThreads(threads)
GC_fullCollect()
when not defined(useMalloc):
echo getOccupiedMem() < 10 * 1024 * 1024, " peak memory: ", getMaxMem() < 10 * 1024 * 1024
Now given how YRC works that is not surprising. We need yet another approach...