nimforum mirror - A better memory model?

filwit (orginal) [2015-07-26T14:41:28+02:00] view original

Recently I had a conversation with Araq about memory/GC design, comparison, and complications. It got me thinking about ways to improve common GC problems, and I just had an interesting idea about two of the GC designs Nim's default GC uses. Thought I'd start a discussion about them to see if anyone can poke holes in my ideas.

First, the problems I'm trying to solve..

Stack Scanning - Scanning the stack might be fast enough, but it's very tricky to do shared-memory safely and efficiently. You either have to lock/pause threads, or pay for stack variable ref-counts (or equivalent). Neither of which is good from a performance stand point. This is, to my understanding, partially why Nim's default deferred ref-counting GC is thread-local and doesn't manage shared memory (keep in mind, we do have convenient ways to deal with shared memory.. like threadpool.parallel, Boehm GC, etc..).

Cyclic Detection - Scanning for cycles is the 'soft' part of Nim's default soft-realtime GC. It's the only part (currently) that can exceed the real-time limits you impose, and is surely the most time-consuming task a ref-counting system has to perform. The current situation isn't bad, even for very tight realtime constraints, considering you can disable it independently (see GC_disableMarkAndSweep) and re-enable temporarily + force full-collections sparingly on controlled events (like loading screens) or past a certain memory 'pressure' threshold. Even without real-time control we have things like {.acyclic.} to help, and in general Nim's GC is very fast. Still, not having to do it altogether is obviously ideal.

So, now my ideas on how to improve the situation..

Conservative Stack 'Stamping' - The idea is to record the stack location at which a ref object gets instantiates, then, instead of scanning the stack against zero-ref objects, you simply check to see if the current stack position is above the object's record. If the ref-count of an object hits zero before the stack rolls back the object is simply kept alive by a zero-ref table which is then incrementally checked next time something is allocated on the heap.

I think that would avoid the threading concerns with stack scanning.. granted I'm not sure how a stack looks (in terms of memory positions) to multiple threads, so there may be (and probably is) problems I'm unaware of. If you know of any, please let me know. I also don't think this method would be any worse at 'inaccurately' keeping objects alive any more than the existing conservative scanner, but I would also like to hear any thoughts on that as well. The only obvious downside I can think of is that it takes up more memory per-ref-object, but even that may have a solution (a global stack-location-sorted ref list maybe?).

Acyclic Type-Graphs by Default - This idea only really works with more than what Nim's memory management currently does for ptr object. Part of my discussion with Araq was about eventually adding a basic form of compile-time memory safety onto Nim's (currently unsafe) pointer types for maximizing Nim's usefulness in domains where a GC is simply a bad idea (kernels/drivers, embedded devices, some hard-realtime simulations, etc). Something similar to how Rust manages ownership, albeit initially much simpler (no life-time vars, etc).. before I go any further, please note that I neither speak for Araq nor contribute much to the compiler, so all this is entirely theoretical without any plans or promises whatsoever.

That said, the actual idea is about restricting any object from defining a member of any type that could cause cycles. That means a ref object would be just as restricted as a regular one (in terms of what members are allowed). Obviously that is a bit too restrictive, which is why the idea only really works if we had memory-safe ptr object, since those have similar modeling ability but do not cause any cycles (given their lifetime is handled at compile-time). Also, you could have a {.allowCycles.} pragma to alleviate the restrictions for specific types (things like JSON/XML nodes might need this).

If we did have compile-time ptr safety, then defaulting to acyclic makes a lot of sense. Acyclic types means that as soon as the ref-count hits zero we know for sure the object is garbage (at least from the heaps perspective) and never before that point, meaning it would be hard-realtime. It would also make sense to encourage use of ptr object in general, since it's very deterministic and you get compile-time errors instead of runtime ones.

Here's a small example of my thoughts (for clarity)...

type
  Item = object
    x, y: float
  
  Person = object
    name: string
    item1: ptr Item
    item2: ref Item
    buddy: ptr Person
    #buddy2: ref Person # error! (causes cycles)

var guns: seq[Item] = loadGuns()
var swords: seq[ref Item] = loadSwords()

var philip: ref Person
philip.new()

var andrew = Person(
  name: "Anrew",
  item1: guns[0], # works (locks 'guns')
  #item2: guns[0], # error! (not ref)
  item2: swords[0], # works
  buddy: philip) # works

#guns.del(0) # error! ('guns' is locked by 'andrew.item1')
swords.del(0) # works

I think if both ideas actually worked they would play well with one another. Then again, there's probably flaws I'm not seeing here. What do you think?

Jehan (orginal) [2015-07-26T19:35:58+02:00] view original

Your "conservative stack stamping" idea won't work. Aside from the fact that it doesn't correctly determine liveness (it could both extend an object's lifetime unnecessarily or collect it too early), there's simply not an efficient way to implement it in a multi-threaded environment. You'd have to:

Compare the stack reference to possibly dozens (if not hundreds) of stack pointers for each zero count object, one per thread.

In order to correctly determine do that, you'd also have to halt all the threads temporarily (stop the world), which is the last thing you want to do.

Efficient multi-threaded garbage collection for shared heaps is hard and there simply is no easy solution. It's an order of magnitude harder than for thread-local heaps; the problem has been researched for decades by the brightest minds in the fields and it's doubtful that anyone can come up with a silver bullet that avoids the hard parts. You can make it somewhat easier by having a separate shared heap that's separate from thread-local heap and where you accept slower performance, but even then it's far from trivial (write barriers alone can become a lot more complicated if multiple threads can write to a reference at the same time).

As I've said before, the easiest way to get shared memory in a multi-threaded world is not to try and reinvent the wheel, but to follow the established designs of Eiffel and Erlang, which basically allow for shared lockable heaps by separating memory into disjoint data spaces (determining the lifetime of heaps can still be tricky, but it's still an order of magnitude easier than full shared memory GC).

As for cycles, it is practically always possible to avoid them through programmatic effort. Pretty much the only place where they can occur by accident is when you store closures on the heaps (e.g., callbacks) and the environment captures a reference to the heap that directly indirectly also references the object that stores the closure. Cyclic data structures can be transformed into non-cyclic ones using either weak references or representing internal objects in the data structure through handles (distinct int types that are indices for a seq or keys for a tree or trie). The latter is for example what ParaSail does to avoid cycles. A command line option to warn against cyclic references (or even make them errors) would probably be a useful idea.

Note also that the worst case pause time due to dealing with cycles is not necessarily big; it's determined by the longest cycle on the heap, and shouldn't be too big a deal if cycles are short.

filwit (orginal) [2015-07-27T02:27:07+02:00] view original

Yeah, I realized there was some problems with my suggestions a bit after I posted them. Also, I appreciate that there have been many brilliant people pounding there heads against this problem for decades, and that my efforts here most likely wont add anything to the beat of that drum. However, I'm not entirely convinced (yet) that "stack stamping" wouldn't work, or wouldn't be more efficient than stack scanning.. it's just significantly more complicated than my initial thought (mostly from a compile-time perspective).

So, my initial 'stack stamping' idea fails in very basic scenarios.. example:

proc newNum(num:float): ref float =
  # this function illustrates what the compile would do at a low level for heap allocations
  const refCountSize = sizeof int
  const stackLocSize = sizeof int
  
  var stackTop: byte # addr of this is current stack location
  
  # allocate heap (use malloc for illustration purposes)
  var heapObj = malloc refCountSize + stackLocSize + sizeof float
  
  # pseudo construction..
  heapObj[][0] = 1 # initial ref-count
  heapObj[][1] = addr stackTop # stack stamp
  heapObj[][2] = num # actual data
  
  return addr heapObj[][2] # pass back data addr


proc someFunc =
  var a: ref float
  var b: ref float = newNum(123) # stack stamp is here
  a = b # whoops, now a stack var before our stamp points to our object (potential memory bug!)

However, looking at the problem (from a single-threaded perspective) all we'd have to do to fix this is some semantic analysis magic which evaluates functions and stamps heap-allocated object more intelligently.. eg:

proc newNum(num:float): ref float =
  # here we allocate space for the 'stamp' like before, but don't actually set it
  ...

proc someFunc =
  var a: ref int
  var b: ref int = newNum(123)
  
  # injectd by compiler:
  b[][1] = addr a # compiler sets b's stamp as 'addr a' because sem-analysis tells us 'a' is
                  # the first variable in this context which (potentially) references 'b'
  
  a = b # no memory bugs! (b's obj has a's stack stamp.. so collection is prevented until a is gone)

So, at least from a single-threaded perspective, that seems possible.. granted I haven't gone into var parameters/returns.. which obviously complicate the semantic-analysis quite a bit. But, if that works, isn't that an improvement to the existing thread-local scanning (from a performance standpoint)? You trade more heap memory for not needing to scan the stack.

Now, for multi-threaded stamping, as you point out, things would need multiple 'stamps' for each thread's stack (at least, variables marked as 'shared' would need this). I think even a naive approach to this wouldn't be bad if 'shared' was an explicit marker. For instance 'shared' objects could have, instead of a single stamp, a (cached) sequence of stamps which is populated by the initial stack-locations each thread first "borrows" the reference to the shared object. It would require even more semantic analysis and thus may kill some of the undeterminism of shared ownership, but it might actually work safely and efficiently. Comparing against hundreds of stack-stamps per zero-ref object still seems preferable to scanning hundreds of stacks per object, to me.

Either way, I would like to consider this idea a bit more, and I appreciate your arguments, Jehan. Thank you.

Concerning cycles, I believe Nim already scans for cyclic types at compile-time to avoid needing to scan those types at runtime when they change references (please correct me if I'm wrong about that, Araq). So in a way Nim is already more intelligent than my "prevent cyclic types" suggestion.. even if it would be nice to get some warnings or something about types which do have potential cycles in them (not that it's really difficult to figure out manually).

Jehan (orginal) [2015-07-27T05:32:46+02:00] view original

This stack stamping approach still doesn't work, even in the single-threaded case, because references can move up and down the stack arbitrarily. E.g.:

proc alpha(x: var ref int) =
  var y: ref int
  swap x, y

proc beta() =
  var x: ref int
  new x
  alpha(x)

You can't infer anything from the stack location that an object reference was first stored in. You need to track all of them, which you cannot do in a single machine word; without that information, you either overestimate or underestimate lifetime. The closest thing that actually works is region inference region-based memory management (which still tends to overestimate object lifetime). Implemented, inter alia, in Cyclone, one of the languages that influenced Rust.

filwit (orginal) [2015-07-27T10:40:56+02:00] view original

you either overestimate or underestimate lifetime.

Well yes, that's the whole idea.. to overestimate lifetime until the stack shrinks below the object's stamp.. ensuring anything "before" the stamp couldn't possibly retain a reference to the heap-object (therefor only once the stack shrinks below the stamp is it safe to free the object). Similar to how conservative scanning will keep anything alive that looks like a reference into the heap, even if it's actually not. Perhaps it's statistically less likely for a conservative scanner to cause lifetime errors, but in theory the issue is similar.

I really don't see how your code example illustrates where stamping fails. The heap allocated object still retains it's stamp to the 'new x' line, so it doesn't matter what happens after that (since the stamp value never changes after it's initial set, and only the stack references 'x' and 'y' where swapped). The heap-object's stamp is entirely independent of what actual stack references may point to it, it only represents a "guaranteed point" at which the stack must "roll back" too in order for the heap object to be safely destroyed once it's ref-counts hit's zero (both the stack remaining above it's stamp and it's ref-count determine it's lifetime).

Just to be very clear, I'll add some comments to your code example:

proc alpha(x: var ref int) =
  var y: ref int # lets pretend the stack addr here is '2008'
  swap x, y # y = x, x = nil.. but nothing happens to the heap-obj.. even the refc remains..
            # regardless, the stack will never shrink below '2008' until we exit this function

proc beta() =
  var x: ref int # let's pretend the stack addr here is '2004'
  new x # heap obj created here, it looks like (refc:0, stamp:2004, data:0)
  alpha(x)
  # the stack never shrinks below '2004' here, so until we exit this
  # function the heap-obj will always live, regardless if x now points to nil

# lets pretend the stack addr here is '2000'
beta()
# at this point the stack addr is again '2000', which is now lower than our heap-obj's stamp.
# so now the first time it's refc hits 0 we can know it's free to destroy, but never before this point.

Sorry if I'm missing something obvious.

Jehan (orginal) [2015-07-27T13:07:44+02:00] view original

"Overestimating the lifetime" is another term for "memory leak". E.g. the following:

proc main() =
  var s: seq[int]
  for i in 1..1_000_000:
    s = @[i]
main()

You can move s further up the stack as much as you want.

filwit (orginal) [2015-07-27T14:08:57+02:00] view original

Yes I realize there is this concern about the design. The question is if this is a realistic scenario, given that modern computers have so much ram (I know, not everything is a modern PC) and that you rarely allocating large chunks of (garbage) memory within a single function. Technically a similar problem exist with the existing scanning solution, granted I'm sure stamping would keep objects alive much more often than conservative scanning misreads would.

Also, while the main goal of this was to avoid the complications of stack scanning within a multi-threaded/shared-memory application (without throwing efficiency to the wind or spending a decade developing a shared-incremental GC).. it would make some sense to use this when memory is abundant and fallback to a STW scanning (for shared memory) when memory becomes scarce.

Of course, I do agree this needs to be evaluated a lot more.. perhaps there are very common scenarios which cause large amounts of garbage with this system I'm simply not imagining (very possible). The last thing you want is a memory model which leaks so much memory on occasion that it forces a STW collector to kick in. The thing is, this is mostly designed for realtime sensitive applications (otherwise why even care about STW), and those usually tightly control most allocations (because they're stutter prone by nature) and that probably means this model would work well? I'll give it more thought..

The other idea I've had is to not store the stamps within the object themselves but in some GC cache, then at specific stack "roll back points" you traverse that cache backwards up until the current stack location (since they are 'in order' it would be fast) and free each stamp-associated object from the stack (thus preventing objects from being artificially kept alive if stack re-grows past their stamp before their refc hits zero). Anyways.. thanks for the input. This is all new teritory for me, and your experience here is very valuable feedback.

vbtt (orginal) [2015-07-28T00:54:37+02:00] view original

@Jehan:

shared lockable heaps by separating memory into disjoint data spaces

I'd like to back this line of thinking rather than 'lets allow a giant shared memory hairball of objects and then try to efficiently GC it'. There are things that new languages can do to promote good design and I believe default isolated heaps is one of them. Heaps don't have to be tied to threads (like they are currently). Sharing data should be implemented by having both messaging and various types of transactional memory. IOW you can't just share any regular object but you'd put it in a shared memory structure - such as a table or list or counter - with well defined and efficient semantics for concurrent access.

Mirror of forum.nim-lang.org

1473 :: A better memory model?