| mm-switch | Thread-safe | Can handle cycles |
|---|---|---|
| ARC | No | No |
| Atomic ARC | Yes | No |
| ORC | No | Yes |
| YRC | Yes | Yes |
ORC can only handle cycles that stay inside one thread. YRC lifts that: cycles can span threads. For acyclic data it uses atomic reference counting; for cyclic data it uses a write barrier that only runs at pointer assignments — no stop-the-world, no arbitrary safepoints. Any thread can run the cycle collector when needed (no dedicated GC thread), which helps locality and suits embedded systems where every thread counts.
Nothing is free. In our orcbench, YRC is about 1.6× slower than ORC. We think that’s a good trade for full thread-safe cycles.
Questions?
In fact, for orcbench I got the overhead down to 1.3x slower than ORC by using more atomics than locks (-d:yrcAtomics) but this variant wasn't verified, so better not enable it for now...
YRC is the simplest threadsafe garbage collector by far: 550 lines for a provably correct, concurrent, thread-safe cycle collector with real-time opt-outs.
Compare to JVM's G1 collector — you need a flowchart collection just to understand the phases.
It works so well because all my previous attempts would simply not work: different collectors that overlap in execution, but need to ensure only one collector calls free (which one?), subtle interactions between "it's zero, can free it immediately! - but what if it's already registered as root candidate?" I had to throw away many ideas.
What YRC doesn't need (insane)
❌ No arbitrary STW phases — collector triggered by actual pressure
❌ No per-thread stack scans — RC roots are explicit, merged once
❌ No global sweep phase — for h in heap: if not marked(h): free(h) is not simple -- it is a multicore nightmare (multi-threaded deletion during iteration)
❌ No read barriers — mutators read freely, collector catches up
❌ No tricolor invariant write barriers — physical topology always current
❌ No pointer remapping — no copying, no forwarding tables
❌ No remembered sets/card tables — RC buffering is the "barrier"
Each eliminated complexity feeds the next:
Immediate pointer writes → no other write barriers needed
Stripe-buffered RC → no inline zero-ref races
Single merge point → no overlapping collectors
Physical graph tracing → no snapshots needed
Explicit RC roots → no stack scanning
Another key insight is that the tracing GC algorithms throw away information, but RC (and everything based on top such as YRC) gets complete incRef/decRef information. So the GCs need to recompute what RC already knows. Sure, disregarding this information by not storing it helps for throughput, but everything else suffers.
YRC knows EXACTLY:
✅ Every incRef/decRef event (buffered in stripes)
✅ Exact reference count for every object
✅ Precise root set (merged roots with inRootsFlag)
✅ Exact cycle membership (rcSum == edges check)
Tracing GCs must REBUILD all this:
❌ Throw away ref count history → rediscover roots via stack scan
❌ Throw away cycle info → full graph traversal
❌ Throw away locality → generational heuristics/card tables
❌ Throw away incrementalism → snapshot or write barriers
The per-object inc/dec operations that tracing GC fans call "overhead" -- it's actually priceless metadata!
Sorry for the marketing blub, but I'm really proud of my discovery...
Excellent work! In 550 lines is impressive 👍
When will it be ready for testing in Nim 2?
What does the "Y" in YRC stand for?
When will it be ready for testing in Nim 2?
It is in Nim devel. No idea when the next release will come.
What does the "Y" in YRC stand for?
Y: "almost the last RC based cycle collector" (Y comes right before Z). You can also interpret the Y as two threads joining.
In German it's pronounced "ürk", the sound I made when I came up with it ("slow shit...")
In German it's pronounced "ürk", the sound I made when I came up with it
🤣
And, just like that, you also discovered a universal solution to the naming problem!
Great!
Is it ready for experimenting with it?
Bad news for YRC. While it appears to work correctly this program is about 6x slower than mm:orc:
discard """
output: '''true peak memory: true'''
cmd: "nim c --gc:orc -d:release $file"
"""
## Concurrent variant of torcbench.
##
## Each thread works entirely on its own private heap — no nodes are shared
## across threads — so the test is valid under both mm:orc and mm:yrc.
## Under mm:orc the cycle collector runs per-thread.
## Under mm:yrc the global lock arbitrates concurrent collectors.
import std/[locks, strutils, times]
const
NumThreads = 8
OuterIters = 25 ## per thread; total work ≈ 100 iters as in torcbench
ListLen = 5000
TreeIters = 400
TreeDepth = 8
type
DLNode = ref object
value: string
next: DLNode
prev {.cursor.}: DLNode # weak back-ref, same as DoublyLinkedNode
DLList = object # value type, embedded in Node
head: DLNode
tail: DLNode
Node = ref object
parent: DLList # copy of the list header (head/tail refs)
le, ri: Node
self: Node # self-cycle to force cycle detection
proc dlAppend(L: var DLList; s: string) =
let n = DLNode(value: s)
if L.head == nil:
L.head = n
L.tail = n
else:
n.prev = L.tail
L.tail.next = n
L.tail = n
proc buildTree(parent: DLList; depth: int): Node =
if depth == 0:
return nil
if depth == 1:
return Node(parent: parent)
result = Node(
parent: parent,
le: buildTree(parent, depth - 1),
ri: buildTree(parent, depth - 2))
result.self = result
# ---------------------------------------------------------------------------
# Per-thread workload — identical to torcbench.main() but self-contained
# ---------------------------------------------------------------------------
proc threadWork(_: pointer) {.thread.} =
for i in 1 .. OuterIters:
var leakList: DLList
for j in 1 .. ListLen:
leakList.dlAppend(newString(200))
for k in 0 .. TreeIters:
discard buildTree(leakList, TreeDepth)
GC_fullCollect()
var threads: array[NumThreads, Thread[pointer]]
for i in 0 ..< NumThreads:
createThread(threads[i], threadWork, nil)
joinThreads(threads)
GC_fullCollect()
when not defined(useMalloc):
echo getOccupiedMem() < 10 * 1024 * 1024, " peak memory: ", getMaxMem() < 10 * 1024 * 1024
Now given how YRC works that is not surprising. We need yet another approach...
Since then many more experiments have been run and a "non-blocking YRC" was developed. It was 20 times slower than ORC. Impractical. We will follow Rust, C++ and Swift and move forward to Nim 3 without a cycle collector.
The good news is that I found a good way to detect most cycles at compile-time, so we can flag the problematic code and offer a nice upgrade path.
I found a good way to detect most cycles at compile-time interesting