In the past 2 days I've been researching various ways to avoid expensive allocations/deallocations in tight loops on GPU, by reusing memory.
This is also a recurrent issue in game programming so I'd like to share my approach and request for comments.
Regarding the decay/timedEviction, I'm not too sure what formula to use, exponential decay (radioactivity-like), fixed-time decay or other schemes.
Here is the code, Ididn't implement the interface yet:
import tables,
intsets,
deques,
times
type
BlockSize = int
Allocator = proc (size: Natural): pointer {.noconv.}
Deallocator = proc (p: pointer) {.noconv.}
CachedBlock = tuple[epoch: Time, address: ByteAddress, size: BlockSize]
## Describe a contiguous allocated and available memory chunk in the pool:
## - starting timestamp of availability,
## - starting address,
## - size in bytes
DecayingObjectPool = object
freeBlocks: Table[BlockSize, ByteAddress]
evictionQueue: Deque[CachedBlock]
lazyReused: IntSet
allocator: Allocator
deallocator: Deallocator
## Object pool / caching allocator with timed eviction
## - Keep track of available blocksizes in the pool.
## - Free blocks that are too old expire and are returned to the OS/devices via the deallocator.
## - Eviction is managed by a queue however reallocated objects must be
## be removed from the EvictionQueue as well and can be at arbitrary positions.
## To avoid costly deletion in the middle of the queue, reused objects are tracked
## in lazyReused and will be removed lazily from the queue when they expire but will not
## trigger the deallocator.
when defined(debug) or defined(test):
idleMem: Natural
usedMem: Natural
nbAllocations: Natural
nbDeallocations: Natural
cacheHits: Natural
cacheMisses: Natural
proc initDecayingObjectPool(proc_alloc: Allocator, proc_dealloc: Deallocator): DecayingObjectPool =
result.allocator = proc_alloc
result.deallocator = proc_dealloc
when isMainModule:
let foo = initDecayingObjectPool(alloc0, dealloc)
echo foo
I had a similar problem in the past and I have considered a similar solution.
However, I ended up using one custom fixed size hash table with open addressing. Idea is that table once filled up, start throwing away one element everytime another one needs to be inserted. Table doesn't ever grow. It was not as generic and it was O(1) just for a subset of operations, but it was much faster.
Answer: it depends on what operations you are going to use most. I ended up inserting in 99.99999% cases, no del, no lookup.
@cdome That's funny, actually, I've used a container that seems to work just like yours, despite the fact the use case was totally different. :D It was for an evolutionary program.
@mratsim If I get it, you need at least 93 MB * 32 images = 11 GB 904 MB per batch?
By the way: did you look at Rust allocators collection?
Yes, a Nvidia Titan X is quite a common GPU for deep learning and comes with 12GB of VRAM, GTX 1080 Ti, 11GB and GTX 1070-1080 8GB. Whatever the GPU the goal is to saturate them with the biggest batch they can handle.
There is a lot of research going into more memory-efficient networks, and how to optimize memory usage without impacting accuracy.
I did not came across this Rust allocators collection, great. All my Cuda Allocator research is detailed there
@mratsim Well, you mentioned game programming at first so "serious deep learning" didn't come to my mind. ^^"
As for Rust allocators --- you link to Linux kernel slab on the page you linked. There is (or probably: are) a slab allocator for Rust available.
I found it useful when I worked on my memory caches: