I've patched Nim's standard allocator to support a single shared heap without the global lock. Took ideas from mimalloc.
The allocator uses 3 types of "chunks": Small, large and huge chunks. Allocations are thread-local but a thread can free a memory block that it doesn't own. For small chunks we use a lockfree shared list structure in addition to the threadlocal free list. For large and huge chunks we acquire a lock that belongs to the block's memory region. So there is contention between the allocating and the freeing thread but not between all threads at the same time. It should scale much better.
But the patch needs many more eyes on it: https://github.com/nim-lang/Nim/pull/20492
The thread sanitizer doesn't like it either:
WARNING: ThreadSanitizer: data race (pid=20138)
Read of size 8 at 0x7fa1834800a0 by main thread:
#0 compensateCounters__system_6660 /home/runner/work/Nim/Nim/lib/system/alloc.nim:751:80 (test+0x4cabe4)
#1 rawAlloc__system_6692 /home/runner/work/Nim/Nim/lib/system/alloc.nim:794:5 (test+0x4cb661)
#2 alloc__system_6888 /home/runner/work/Nim/Nim/lib/system/alloc.nim:986:11 (test+0x4c3069)
#3 allocImpl__system_1762 /home/runner/work/Nim/Nim/lib/system/alloc.nim:1062:11 (test+0x4cbbe0)
#4 allocSharedImpl /home/runner/work/Nim/Nim/lib/system/alloc.nim:1118:11 (test+0x4b9fd0)
#5 allocShared0Impl__system_1777 /home/runner/work/Nim/Nim/lib/system/alloc.nim:1121:11 (test+0x4c58d0)
#6 createThread__system_4338 /home/runner/work/Nim/Nim/lib/system/threads.nim:342:24 (test+0x4d143f)
#7 createThread__system_4329 /home/runner/work/Nim/Nim/lib/system/threads.nim:377:2 (test+0x4d1811)
#8 atmt_mupsic_threadeddotnim_Init000 /home/runner/work/Nim/Nim/pkgstemp/lockfreequeues/tests/t_mupsic_threaded.nim:61:7 (test+0x5f6692)
#9 PreMainInner /home/runner/work/Nim/Nim/lib/pure/unittest.nim:225:2 (test+0x72dd2f)
#10 PreMain /home/runner/work/Nim/Nim/lib/pure/unittest.nim:247:2 (test+0x72dd9d)
#11 NimMain /home/runner/work/Nim/Nim/lib/pure/unittest.nim:255:2 (test+0x72dee1)
#12 main /home/runner/work/Nim/Nim/lib/pure/unittest.nim:263:2 (test+0x72df9e)
Previous write of size 8 at 0x7fa1834800a0 by thread T6:
#0 addToSharedFreeList__system_6621 /home/runner/work/Nim/Nim/lib/system/alloc.nim:731:14 (test+0x4be196)
#1 rawDealloc__system_6837 /home/runner/work/Nim/Nim/lib/system/alloc.nim:898:4 (test+0x4bda65)
#2 dealloc__system_6896 /home/runner/work/Nim/Nim/lib/system/alloc.nim:1003:2 (test+0x4be341)
#3 deallocImpl__system_1766 /home/runner/work/Nim/Nim/lib/system/alloc.nim:1068:2 (test+0x4be398)
#4 deallocSharedImpl__system_1779 /home/runner/work/Nim/Nim/lib/system/alloc.nim:1130:2 (test+0x4be3d8)
#5 deallocShared /home/runner/work/Nim/Nim/lib/system/memalloc.nim:314:2 (test+0x4be418)
#6 threadProcWrapper__system_4359 /home/runner/work/Nim/Nim/lib/system/threads.nim:222:2 (test+0x4d13d6)
Awesomeness! This should be a nice bump to performance.
Have you done any benchmarking yet? I was tinkering with one of those benchmark games and wondering why the Rust version was so much faster. It was a allocation heavy test and the issue was that turning on threads in the Nim version cut performance, but with threads:off they were identical. If I remember I'll try and run it with this to see how it does on it. It only used one thread but i'd like to see the affect this change has.
Assuming no issues are found with these changes, will they only land on 2.0 or will they also make it into the 1.x series?
Maybe once it has been proven to be stable in 2.x we can backport it to the 1.x series but right now it doesn't work and once it does, it's much too risky.
The thread sanitizer doesn't like it either:
FreeCell.next field should be updated by atomicLoad and atomicStore (or be Atomic[ptr FreeCell]) to remove that warning.
mimalloc has a very different fragmentation profile and seems to not coalesce memory blocks at all (?). TLSF was designed for embedded realtime systems, making it work with threads in a novel way cannot hurt.
Advantages and disadvantages will be easier to determine once one can measure both and compare some results.
There is coalescence at the segment level (4MB holds multiple page): https://github.com/microsoft/mimalloc/blob/f2712f4/src/segment.c#L623
That said, in order to provide the heartbeat so that GC expensive maintenance can be hooked to that and amortize, mimalloc does not immediately return a freed memory block to the allocatable blocks.
They are only made available again when the page has been fully used. At this point coalescing is free since the whole page has been freed.
I found this document: https://os.itec.kit.edu/downloads/2020_BA_Lefeuvre_Toward_Specialization_of_Memory_Management_in_Unikernels.pdf
TLSF is usually about 11% slower but its memory consumption is much better for embedded devices.