Is there any arena allocator's for Nim that work with ARC? In particular, one that works on a per-thread basis to avoid the locking overhead with malloc.
There appears to be a memory pool in Weave. However, it looks to utilize some locking (?). Would it be possible to have a simpler arena pool allocator in the standard library?
Would it be possible to have a simpler arena pool allocator in the standard library?
Please write a concrete RFC, many people want arenas and pools but it's not clear what exactly that implies.
Please write a concrete RFC, many people want arenas and pools but it's not clear what exactly that implies.
First I’d like to get a feeling for if there’s anything already out there close enough for my use case.
My interest is more of a memory “cache” per thread that uses malloc to get larger memory blocks but then provides a cache of re-useable chunks of memory for the current thread without requiring locks/mutexes. Arenas would seem like a very fast way to provide such thread-local memory caching.
In particular on embedded devices using the system’s malloc is expensive. It’s not terrible but keeping code fast requires a fair bit of effort to avoid allocations. It’d be nice to be able to repeatedly create-and-free a bunch of small objects without the slowdown from lots of calls to malloc. Currently I have to manually pre-allocate objects and reuse them to speed up sections of code.
Unfortunately, neither the standaloneheap or the virtual pages allocators work well on many MCU’s.
@PMunch’s suggestion I believe relies on virtual memory. @Planeti’s link to the fusion lib pool allocator could work… but it’s unclear how this could be used with libraries. Still it could work well for many use cases.
It’s not currently possible to annotate a block of code or function to use a custom allocator, is it? That’s seems like it’d be difficult to implement! Possibly I’m thinking of the “region” based memory management that was experimented with at some point.
I could write an RFC if it’d be helpful. I’ve gone back and forth over it being worthwhile issue to pursue.
Unfortunately, neither the standaloneheap or the virtual pages allocators work well on many MCU’s.
This is the key point, I think. What would work well on MCUs? What does malloc do on these systems? standaloneheap might be fixable, I've always been irrationally proud of this hack. :-)
This is a standalone threadsafe memory pool. The block size is fixed at compile-time:
https://github.com/mratsim/weave/blob/71dc2d70/weave/memory/memory_pools.nim
This is the key point, I think. What would work well on MCUs? What does malloc do on these systems? standaloneheap might be fixable, I've always been irrationally proud of this hack. :-)
haha, StandAloneHeap is a pretty great hack! It's one of the first things in the stdlib I read and thought: "that's pretty handy I like Nim's style".
StandAloneHeap works well on Arduino's. On the ESP32 it was limited to like 1/2 of the RAM size due to linking issues. It had some limit on how large the static objects could be. One solution I thought about was copying StandAloneHeap and making a StandAloneMalloc that'd use malloc once upon initialization. Though it seems like StandAloneHeap would be prone to memory fragmentation?
Honestly, I'm not sure how the ESP32's malloc works... except that it's pretty buggy. :/
This is a standalone threadsafe memory pool. The block size is fixed at compile-time:
Thanks! I was reading through that. It looks like it's solving a similar type of problem. Not sure if it could be tweaked for my use case. The comments mention it being for cross-thread uses and it looks to use atomics to keep track of the index pointer?
MemBlock {.union.} = object
next: Atomic[pointer]
...
After tossing around some thoughts for a bit, it seems I'm thinking of something like:
On thread (Task's) initialization a call can be made to setup a thread-local cache:
It's a pattern I believe Jemalloc and OCaml utilize. The terminology seem to vary though. It's a bit inefficient in memory space as you need to use whatever predefined blocks sizes are setup. But the tradeoff would be only a couple of branches and no-locks or atomics. I'm not sure how dealloc works. Perhaps you could bitmask the address to determine which block it belongs to?
Hmmm... would this method be able to abuse the reference count byte before the memory? ;)
This approach could be bunker's or inefficient. Also I'm not quite sure I grok how the osAllocPages work (perhaps it's pretty similar to the algo I lined out above?). Perhaps it'd be easier to make a hybrid StandAloneMalloc and use the pages system. I'm open to thoughts. :-)
Nim's allocator is based on the TLSF algorithm so block sizes and how they relate to fragmentation is a mostly solved problem; well at least it's very well researched and TLSF was designed for embedded. Alloc/dealloc is O(1).
Ignoring the threading aspect (and the slow locks) for the moment, it seems all you really need is a version of osalloc.nim that implements osAllocPages via malloc and osDeallocPages via free. It's a ten line fix...
PR is here: https://github.com/nim-lang/Nim/pull/18490
But we also need to think about multi threading.
PR is here: https://github.com/nim-lang/Nim/pull/18490
Awesome! I'll try it out this week.
But we also need to think about multi threading.
That'd probably really speed things up when using ARC on some systems. Currently I suspect there's actually 2-3 locks when using the system malloc on my current MCU's. Enabling TLSF will probably speed it up a bit by removing the extra mallocs, plus likely improving stability.
One approach would be to have a thread-specific call that would setup a TLSF for that thread, if that's possible. I'll need to read up on TLSF a bit, but perhaps it'd be possible to setup a TLSF per-thread by default? Not sure the overhead from doing that. Of course, both of those methods would run into issues when transferring memory between threads. Similar to the problem @mratsim's thread pools are trying to resolve. Tricky! :-)