Hello Nim community!
I was looking through some benchmarks and I noticed that Java/JVM languages perform exceptionally well for the binary trees benchmark - running 2-3x faster than other solutions. From what I can tell this is largely due to the very advanced garbage collectors for the JVM that makes it exceptionally good at the binary trees workload.
I was wondering if it's possible to take a similar approach in Nim. I'm not an expert on Java or GCs, but I believe most of the Java GCs are generational and compacting, meaning that the actual address of objects are moved around in memory. I'm aware of Nim's mark-and-sweep collector but I didn't see any information about whether it uses generations/compaction.
I'm not expecting it to be added to the Nim runtime or anything, but it's something that I was thinking about experimenting with and wanted to know if there are any known roadblocks before I spend a lot of time on it.
Thanks!
that benchmark is already a bit famous among Nim people because of how terrible it is (though it's not even the worst of all the benchmarks on the site). It measures nothing but allocation, deallocation and GC speed, which especially in a language such as Nim where value types are are very common is just not representative of anything.
Unfortunately allocation and deallocation with --gc:orc and --threads:on is slow, because it has to lock/unlock the heap. But even then in such an extreme example classical GC's will be faster than --gc:orc's ref counting, because they can work in bulk.
What the benchmark doesn't say are things like memory usage and that --gc:arc (--gc:orc a bit less) brings predictable performance. Also contrary to Nim's old thread local gc it can be used across multiple threads, which admittedly Java's GC can too, but that's a trade off for it being a beast with lots of complexity and to my knowledge pausing.
Regarding the compacting (which is another one of those complexities), it can't be used, atleast not with a lot of Nim code interoperating with C code, because that requires that addresses aren't moved. It would probably also not work with stack scanning (which is what Nim's old GC does).
At last have you taken a look at Boehm GC? If you need a classical GC which works with multithreaded Nim code, this is it. Though I can't say anything on how well it performs in terms of speed.
I just looked at Boehm and according to this SO post, it doesn't perform any compaction or reference moving (although the Wikipedia page said that it can if integrated fully).
I do realize that the binary trees benchmark is kind of a worse-case scenario for the ARC/ORC approach, and that there are ways to make it much faster, but I thought that having a Java-like GC with Nim's use of value types would be even better since you can still get the advantage of value types with the advantages of a high throughput GC for batch processing.
I was looking at Rust a while back and I think it introduced a Pin concept to ensure that memory doesn't move for cases of C interfacing like you described. I'll look at Nim's source code a little bit more, but it doesn't look like it setup for moving references at all. If I feel particularly motivated maybe I'll try to modify it and experiment :)
Thanks for your response!
This is what I know about GCs:
RefC is not fast, but it's stable. When the reference count of an object reduced to 0, the GC checks if it's still on the heap and then frees it if it's.
ORC is fast, but it comes with undefined behaviors. Calling =destroy twice without wasMoved results in segmentation fault. Calling wasMoved without =destroy results in memory leak.
ARC is even fast, but less stable. It's an extreme version of ORC.
Boehm was designed for C programs. It scans the stack for variables that looks like a pointer, and maintain the refcounts.
A JVM GC "stops the world." It was designed for a stack-based virtual machine.
Arc is orc without a cycle collector, meaning it just doesn't clean up memory with cycles.
Refc isn't that much slower than orc. From what I know it's one of the faster GCs out there, and orc isn't really a GC.
Calling =destroy twice without wasMoved results in segmentation fault. Calling wasMoved without =destroy results in memory leak.
While true, you're not supposed to call them yourself. Unless you are making a destructor-based object from raw pointers.
It's in principle possible to add a Java-style GC to Nim but precise stack root traversal is in practice just not feasible when you produce C(++) code. While it can be done, it cannot be done efficiently. You don't have to be precise in the stack root traversal but when you don't, add about 5 years of development time to your GC project.
If you want GC algorithms beyond ORC I would look into creating a code generator that targets .NET. It too offers parallel GCs and does support value types.
I'm not expecting it to be added to the Nim runtime or anything, but it's something that I was thinking about experimenting with and wanted to know if there are any known roadblocks before I spend a lot of time on it.
Just forget about it, it will drive you crazy. Many papers about GCs are either incomplete or contain severe bugs or both. :-)
I've benchmarked a lot of implementations of the binary tree benchmark for a number of different allocators and garbage collectors. Source code is available here.
We'll go into the details of the benchmark in a moment, but I need to clarify what I wanted to show so I don't get misunderstood:
What I didn't want to do was to make any overall comparison, especially between garbage collectors, because you really can't without implementing a wide variety of real-world programs in a number of different languages, which is infeasible. The relative performance of garbage collectors (and other allocators) can vary greatly between programs and programming languages and depending on whether you need throughput or latency more, different GCs can be better options.
In addition, the binary trees benchmark is anything but representative. It is sort of the absolutely worst case for a GC, because it has a large live heap while allocations happen and is almost all about allocations and deallocations, with hardly any computations happening. I went for the worst case specifically to show the competitiveness of garbage collectors with manual allocations even when the deck was stacked against them. (Note that the benchmark originated as part of the Boehm GC tests in a somewhat different form.)
You will, however, notice that Java and OCaml blow pretty much everybody else out of the water. While some of this is due to how they implement their respective GCs, part of it is also an artifact of the benchmark.
So, this is actually not very representative. Realistic imperative code should not spend more than 10%-20% of its time on allocation-related work, thus you can probably divide the differences in the benchmark by a factor of five or so.
Most programs that are presumed to have a garbage collection problem actually have a garbage problem, i.e. in that the code allocates memory more often than necessary and that incurs much of the overhead.
Functional programming languages generally cannot avoid this; unless they exclusively operate on scalar values or get around it with compiler optimizations, they have a very high allocation rate and require an efficient GC. Imperative languages that can use in-place operations are not as limited.
The JVM has a related problem in that it (currently) lacks value types. For example, if you want to implement a matrix of complex numbers and represent each complex number as a pair of reals, you pretty much end up with doing one allocation per number, whereas Nim or even C# can get away with 1-2 allocations for the entire matrix.
So, while OCaml and Java do really, really well, they also need a high performance GC more than any other language. Imperative language have the alternative that they can use an allocation strategy that reduces the percentage of time spent on allocations down to a smaller value to the point where it becomes insignificant for throughput. (Latency is a bit more complicated.)
[1] This results from people badly misreading an old paper. For a counterpoint, see e.g. the GC_FREE_SPACE_DIVISOR=10 benchmarks for the Boehm GC, which effectively limit memory overhead to around 10% and are still competitive.
Unfortunately allocation and deallocation with --gc:orc and --threads:on is slow, because it has to lock/unlock the heap, but this could be alleviated soon with mimalloc. But even then in such an extreme example classical GC's will be faster than --gc:orc's ref counting, because they can work in bulk.
Is this really the case? AFAIK arc/orc doesn't do any locking but I haven't been keeping up with it much. Any further reading on how this locking operates?
Is this really the case? AFAIK arc/orc doesn't do any locking but I haven't been keeping up with it much. Any further reading on how this locking operates?
when you create a ref object with gc:arc/gc:orc it calls nimNewObj which in turn calls alignedAlloc0 which when the program is compiled with threads:on calls allocShared0, which in the end when Nim's integrated allocator is used, is implemented via a lock and unlock of a mutex.
I've profiled the tree benchmark and locking and unlocking was the number one measured hotspot. All Nim benchmarks on that site are compiled with threads:on and gc:orc, iirc some of the Nim benchmark made use of parallism.
[1] This results from people badly misreading an old paper.
FWIW it did match up with my measurements.
And for plenty of domains (scientific computing, embedded devices) optimized reference counting does work better than tracing techniques. For embedded it really helps if objects are freed "immediately" as that translates into less memory consumption and the same is true for scientific computing when it comes to megabyte sized matrixes.
Wow! Thanks for such an extensive response!
I see what you're saying about it being an unrealistic benchmark, and how Nim and C# simply don't put as much pressure on the GC in real-world programs. I was looking at applying Java GC techniques mostly out of curiosity, I didn't expect a big difference in real-world code since I know custom allocators are common for high performance work anyway.
Realistic imperative code should not spend more than 10%-20% of its time on allocation-related work, thus you can probably divide the differences in the benchmark by a factor of five or so.
So it seems like you'd expect to see almost no difference except in the most extreme cases, and even in those cases the Nim program would likely just be producing unnecessary garbage anyway, right?
So it seems like you'd expect to see almost no difference except in the most extreme cases, and even in those cases the Nim program would likely just be producing unnecessary garbage anyway, right?
"Unnecessary garbage" can be misleading. Check out this PR, https://github.com/nim-lang/Nim/pull/20433 it replaces a tree structure that was highly optimized by string and there are the results:
PR | Runtime | Memory consumption |
---|---|---|
Before | 11.8s | 670MiB |
After | 11.2s | 534MiB |
The tree wasn't "unnecessary garbage", it represented the final output of the compiler (the C code in this case). But imperative code and mutable data structures simply allow for manual optimizations that are not obtainable otherwise. The refactorings are beyond the scope of any state of the art compiler optimizer.
So it seems like you'd expect to see almost no difference except in the most extreme cases, and even in those cases the Nim program would likely just be producing unnecessary garbage anyway, right?
Not entirely what I meant, though I may not have been clear. Some imperative algorithms are allocation-heavy even if you write them naturally and if they occur in a hot loop, you can have a problem.
However, there are a few points to make:
1. You can generally make them more efficient using imperative code. The standard insertion algorithm for red-black-trees requires O(1) allocation work per insertion (and also O(1) memory writes overall, but of course still O(log n) memory reads); a purely functional red-black-tree implementation requires O(log n) allocation work per insertion.
2. You can reduce the allocation work in many cases. There are a number of common approaches:
(a) Represent e.g. a tree as an array and references as integer offsets within the array.
(b) If your allocator is slow, keep a size-limited free list per data structure. Say, up to 100 entries. Whenever you free a node, put it on the free list, unless it exceeds it max size, then you free it normally. To allocate a node, you go to the free list first, unless it's empty, in which case you allocate normally. This means for data structures that fluctuate in size, you save a fair amount of allocations.
(c) Represent subjects as tuples that reference another data structure. For example, I had a string interning implementation I wrote once, where I didn't allocate the string that was to be interned separately (which in most cases would have been discarded right after the interning operation), but passed a (original string, start, length) triple. String interning allocated a string only if there wasn't already an interned variant.
Note that on contemporary hardware with state of the art GC, there's a limit on how much you can allocate. This is difficult to reach within a single thread, but if you use multiple threads and use a purely functional style, you can run fairly easily into it. Even the best GC can't fix that.
The core issue that I tend to get back to is that as I wrote above, most programs that run into GC issues have a garbage problem rather than a garbage collection problem. And unlike with most purely functional languages, imperative or multi-paradigm languages can usually avoid it or work around it.
Is this really the case? AFAIK arc/orc doesn't do any locking but I haven't been keeping up with it much. Any further reading on how this locking operates?
Nim memory management has 2 components:
_Note: Boehm handles both 1 and 2
Thing is, TLSF is designed for real-time system not for multicore systems, so to handle concurrent requests for memory a lock has been added. This impacts all GC besides Boehm (ignoring Go GC which is untested AFAIK).
The future memory allocator, mimalloc, is threadsafe with many techniques to handle potential contention from multiple threads (like extreme freelist sharding) and so fix this performance issue. This is why I avoid Nim allocator where I can be allocating from multiple threads in a rapid manner (https://github.com/mratsim/weave/blob/71dc2d7/weave/memory/allocs.nim#L32-L41 )
Note: I do think TLSF has value but probably as a library?
FWIW back then the TLSF algorithm was heavily modified to become a very good general purpose allocator for Nim. It's just that nobody ever figured out how to make it work well for multi threading.
Mimalloc is the future and it is O(1) too afaict.