nimforum mirror - Copy-on-write container

3355 :: Copy-on-write container

[2017-11-22T13:36:27+01:00]

View Orginal

mratsim (orginal) [2017-11-22T13:36:27+01:00] view original

After exploring various designs for a container (Tensor) which only copy memory when necessary (shallow let and deep var, using the GC refcount) and hitting technical roadblocks. I've settled on this one. Note that this should work nicely with the future =move and =sink operator:

{.experimental.}

type
  Storage[T] = ref object
    refcount: int
    data: seq[T]
    # lock / guard ?

type COWobject[T] = object
  metadata: int
  storage: Storage[T]

proc detach[T](c: var COWobject[T]) =
  # Note: this only works without races if only the main thread can access this.
  # Also increment is only done on assignment, slices do not increment.
  if c.storage.refcount == 1:
    return
  
  let old_store = c.storage
  var fresh_store: Storage[T]
  new fresh_store
  
  fresh_store.refcount = 1
  deepCopy(fresh_store.data, old_store.data)
  
  c.storage = fresh_store
  dec old_store.refcount

proc `=`[T](dst: var CowObject[T]; src: CowObject[T]) =
  inc src.storage.refcount
  system.`=`(dst, src)

proc `=destroy`[T](c: CowObject[T]) =
  # Note from Nim manual: destructors are tied to a variable
  # And will not trigger for say slices.
  dec c.storage.refcount

proc toCowObj[T](s: varargs[T]): COWobject[T] {.noInit.} =
  result.metadata = 1337
  new result.storage
  result.storage.data = @s

proc `[]`[T](c: COWobject[T], idx: int): T =
  c.storage.data[idx]

proc `[]`[T](c: var COWobject[T], idx: int): var T =
  detach c
  c.storage.data[idx]

proc `[]=`[T](c: var COWobject[T], idx: int, val: T) =
  detach c
  c.storage.data[idx] = val


proc main() =
  let a = toCowObj(1, 2, 3, 4, 5)
  let b = a
  var c = a
  let d = c
  var e = c
  let f = e
  let g = f
  
  c[1] += 10
  e[2] = 100
  
  
  echo "\n\nMemory location"
  echo "a: [1, 2, 3, 4, 5]: " & $a.repr
  echo "b: [1, 2, 3, 4, 5]: " & $b.repr
  echo "c: [1, 12, 3, 4, 5]: " & $c.repr
  echo "d: [1, 2, 3, 4, 5]: " & $d.repr
  echo "e: [1, 2, 100, 4, 5]: " & $e.repr
  echo "f: [1, 2, 3, 4, 5]: " & $f.repr
  echo "g: [1, 2, 3, 4, 5]: " & $g.repr
  
  echo "\n\n"
  echo "a: [1, 2, 3, 4, 5]: " & $a
  echo "b: [1, 2, 3, 4, 5]: " & $b
  echo "c: [1, 12, 3, 4, 5]: " & $c
  echo "d: [1, 2, 3, 4, 5]: " & $d
  echo "e: [1, 2, 100, 4, 5]: " & $e
  echo "f: [1, 2, 3, 4, 5]: " & $f
  echo "g: [1, 2, 3, 4, 5]: " & $g

main()

Critics welcome :).

Usage context - storage will be a tensor/multidimensional array of several MBs to a few GBs:

It will be sliced in a read-only manner quite often in tight loops.

I expect many readers and very few writers.

Only the main thread will create/destroy the objects. Multithreading will only be used to compute data passed by pointers.

Udiknedormin (orginal) [2017-11-22T15:22:07+01:00] view original

Could you elaborate about the main thread being the only one being able to create and destroy the objects? It sounds quite restrictive so I'd like to hear what your motivation and the general idea was.

mratsim (orginal) [2017-11-22T18:11:14+01:00] view original

I will use this container for Tensors/multidimensional arrays.

It already scales almost perfectly with the number of cores through OpenMP: 4 cores --> 3.5x (reduce operations like sum) to 4x speedup (map and element-wise operations like matrix addition).

So for now, it's like this, assuming compiled with openmp:

let a = randomTensor([100, 100], 10) # Main thread: create a 100x100 tensor filled with range 0..9
let b = randomTensor([100, 100], 10) # Main thread: create a 100x100 tensor filled with range 0..9

let c = a + b # All threads do addition on chunks of size 100x100 div thread_count

let sum = c.sum # All threads do partial sum on chunks of size 100x100 div thread_count, we have thread_count partial sums. Main thread will then further reduce on the partial sums.

The let a, let b, let c, let sum are all done serially in the main thread.

I don't have a use case (yet ?) to introduce even more parallelism. That would require lock/guard or atomics (not sure we can compare-and-swap ref though) in the proposed copy-on-write container as well which will make it harder to implement, and maybe slower for the general use-case.

Here is a blog post where a D-lang dev explores reference counting in a garbage-collected language and destructors data-races.

Udiknedormin (orginal) [2017-11-22T21:36:45+01:00] view original

@mratsim Oh, I forgot Arraymancer always uses OpneMP so you're talking about threads created by it. I don't use OpenMP that often, that's probably why I forgot about it.

Oh, by the way: the elementary operations you mentioned, addition, sum etc, can be easily split into equal chunks. But not so easy for map, the cost of which can depend on the element's value. Would you use dynamic scheduling then?

The idea of a ref-counted coarray is bizzare to me, to be honest. Maybe that's partly because I'm used to per-process coarrays and yours are per-thread coarrays. Still, I don't really like the very idea of ref-counted garbage-collection for tensors. Why? Because there is no reason I couldn't have two equally privileged references to the same array inside of the same thread. But you have views for slicing anyway, so why not one array and many views over it? That would probably help to avoid some problems you mentioned. The only thing is it would need proper destructors... OR a python-like with context. We use it for files in Nim anyway, so I can't see why not?

mratsim (orginal) [2017-11-23T09:37:35+01:00] view original

@Udiknedormin: Actually map is the easiest, you can just do:

for i in 0||(container.len - 1):
  result[i] = mapped_op(container[i])

OpenMP will divide work statically in (container.len - 1) / threads_count chunks.

Right now Arraymancer offers 2 choices, value and ref semantics.

let a = randomTensor([5, 5], 10) # int  5x5 tensor with value between 0..9
let b = a # Copy a in b
let slice = a[0..2, 0..2] # Copy a slice of a in slice

Note that it always copy.

Alternatively you can do

let a = randomTensor([5, 5], 10) # int  5x5 tensor with value between 0..9
let b = a.unsafeView # b is a view of a
let slice = a.unsafeSlice(0..2, 0..2) # slice is a view of a[0..2, 0..2]

The problem i want to solve is that the natural syntax is slow due to the many allocations. But I want to keep value semantics on the surface.

Note: the proper way to have ref semantics in Arraymancer would be:

type
  Tensor[T] = object
    shape = seq[int]   # Store the dimensions like 5x5
    other_metadata = seq[int] # Just an example
    storage = CpuStorage[T]
  
  CpuStorage[T] = ref object
    data = seq[T]
  
  AlternativeCpuStorage {.shallow.} [T] = object
    data = seq[T]

So the copy-on-write I proposed in OP just add a refcount to CpuStorage.

Note that it is not used for garbage collection, but to know if mutation can happen in place or not. Garbage collection is still done by Nim GC.

Udiknedormin (orginal) [2017-11-24T18:06:32+01:00] view original

@mratsim No, it's not. That's why I asked whenever you use dynamic scheduling. Imagine you have a sequence of 1, 2, 4, 8, ..., 1048576. Now, map it with an operation with O(N) complexity, where N is the value of your number. If you use static scheduling, it's entirely possible most of the work will be done in a single thread while others wait for it.

mratsim (orginal) [2017-11-24T20:13:25+01:00] view original

Ah right, indeed. Well let's say that it's a non-supported use case shrugs and that Arraymancer tensors are the wrong container for that. I'm not aware of any scientific/numerical computing situation with an operation depends not only on the tensor size but also the value of what is wrapped.

Now regarding copy-on-write in Arraymancer, after tinkering a bit, I am convinced that it is not suitable and that plain reference semantics (or even value semantics are better).

I've explored using a refcounter or a simpler "shared_bit" boolean that just tracks if it was shared at any point (= assignment) or moved (=sink) and they don't cut it for the following reasons:

Tensors wrapped in containers: in neural networks, you create a tensor then you wrap it in a container that will be used in a graph that will keep track of all operations applied to it.

import ../src/arraymancer, sequtils

let ctx = newContext Tensor[int] # Context that will track operations applied on the tensor to compute gradient in the future

let W = ctx.variable toSeq(1..8).toTensor.reshape(2,4) # What is the refcount? --> it's 0

let x = toSeq(11..22).toTensor.reshape(4,3)
let X = ctx.variable x # What is the refcount? it's 1 until x goes out of scope

The refcount and logic required becomes hard to follow and will probably lead to an overengineered solution (or test suite).

Predictability: When everything deepcopies or everything shallowcopies, everything is easier to reason about. If there is a perf or a sharing issue just add a shallow copy or a clone and done.

3. Workaroundability: Since copy-on-write must overload assignment, if you want to ensure shallow-copy for example you have to use:

let foo = toCowObj(1, 2, 3, 4, 5)
var bar: CowObject[int]
system.`=`(bar, foo)

Udiknedormin (orginal) [2017-11-27T09:36:03+01:00] view original

@mratsim Oh, really, you don't know any example of an operation the cost of depends on values? Well, I easily know one: sorting. ;)

mratsim (orginal) [2017-11-27T11:57:27+01:00] view original

In which cases would you need to store a seq or a list in a tensor or a Numpy ndarray?

Udiknedormin (orginal) [2017-11-27T17:35:39+01:00] view original

Let's say I want to do some operations on particles. They should be vectorized (and maybe parallelized, some calculations could also benfit from GPU) and it would be really nice if particles from the same space grid would be in the same place in the sequence, as they will need to access the same grid of magnetic field when calculating their movement. Why shouldn't I use a 2D or 3D (depends on the simulation) tensor rather than a seq?

Also: numpy provides sorting, mind you.

mratsim (orginal) [2019-10-03T21:42:23+02:00] view original

It should, the refcount is not using GC internals but it's own field.

In the end I didn't go with COW due to:

Overloading = is really troublesome, I don't think it changed since then.

Also what totally killed copy-on-write for me is that it doesn't work if your type is wrapped in a container.

Lastly it's very hard to predict performance of copy-on-write, compared to value semantics always copy or reference semantics always share, with explicit exceptions.

I've put my thoughts here: https://github.com/mratsim/Arraymancer/issues/157#issuecomment-346763773 and https://github.com/mratsim/Arraymancer/issues/160

cumulonimbus (orginal) [2019-10-04T09:43:40+02:00] view original

Thanks! I'm still going to try to CoW, using a similar style to C++, though using the standard GC, that would essentially mean the refcounting will happen twice all the time ... I'll look for a way to piggyback on Nim's refcounter when I have the time.

Araq (orginal) [2019-10-04T10:42:57+02:00] view original

Overloading = is really troublesome, I don't think it changed since then.

Wait what? Everything changed about it, it now has a spec and a different implementation. Containers with =, =sink and =destroy have never been easier. Caveat: The old default seq implementation doesn't call =destroy.

cumulonimbus (orginal) [2019-10-06T22:15:50+02:00] view original

So ... I was trying to read up on =sink and =destroy, and am a bit confused:

There's https://github.com/nim-lang/Nim/wiki/Destructors , which appears to be outdated, forwarding to https://github.com/nim-lang/Nim/wiki/Destructors,-2nd-edition which refers to https://github.com/nim-lang/Nim/blob/devel/doc/destructors.rst which does not document which version it actually references.

Also, the latest destructors doc refers to lent and owned references which are only documented there and not in the 1.0.0 language manual, and you mention the "old default seq implementation" which fails to call =destroy, implying that there's a new one? (but how do I choose which one is used?)

I would like to implement a "functional" copy-on-write refcounted sequence (a-la APL / J / K / Ocaml without references), that is - the semantics of every value of this kind is equivalent to a value type, and thus can have no cycles -- much like refcounted strings. (so .. refcount is precise)

Araq, hopefully this is not too much to ask - but, what would you recommend as an implementation strategy? My idea would be own-managed ptrs to memory, and using the =,``=sink``,``=destroy``, []= operators to manage the references. It seems like sink, owned and lent may help with removing unneeded refcnt operations in some places, but since they are not in the 1.0.0 manual I am at loss about whether they can be relied on to be there, and/or which gc model they assume/require.

Thanks in advance.

cumulonimbus (orginal) [2020-06-24T23:44:43+02:00] view original

So ... It's been a while, and an awful lot has happened on the arc/orc front; so hopefully it's OK to bump this up:

What's a good strategy for implementing copy-on-write with --gc:arc and/or --gc:orc?

Is it possible to inspect/access the refcounter, implicit or explicit, from --gc:arc and/or --gc:orc?

@mratsim switched away from CoW, but it is put to great use in APL / J / K implementations; In general, they simulate a "value only" system by using only references and reference counts; Anything that is only referenced once is just modified in place when you want to modify it. So you get easy to debug value semantics, and (most of the time, with very little care required) reference performance. This is in contrast with e.g. Clojure's persistent vector implementation which clones a limb on every modification and thus generates a lot of garbage; or with R (v3, haven't looked at the v4 changes) using inaccurate refcounts, which requires some care to not generate too much garbage.

As --gc:arc and --gc:orc already maintain refcounts (some in memory, some in the AST), sane access to them would greatly simplify such CoW schemes - which otherwise duplicate all the refcounts (or otherwise has use ptrs instead ...)

Any ideas / docs / pointer?

snej (orginal) [2020-06-25T02:04:11+02:00] view original

Did you watch Andreas's talk on ARC and ORC from Saturday's NimConf? He has a running example where he builds a toy seq from scratch, including the ref-counting.

Basically you can implement your own ref-counting, if your object is a non-ref. You give it a ptr to the shared state, and put a refcount in the shared state, and then implement the ARC hooks like =destroy and =sink.

mratsim (orginal) [2020-06-25T14:54:56+02:00] view original

This is a RFC on copy-on-write strings https://github.com/nim-lang/RFCs/issues/221 (no code or pseudo-code though)

For your own refcounting scheme, as @snej mention, it's easy to do with destructors, see my atomic ref counted type here: https://github.com/mratsim/weave/blob/9f0c384f/weave/cross_thread_com/flow_events.nim#L173-L201

type
  FlowEvent* = object
    e: EventPtr
  
  EventPtr = ptr object
    refCount: Atomic[int32]

# Refcounting is started from 0 and we avoid fetchSub with release semantics
# in the common case of only one reference being live.

proc `=destroy`*(event: var FlowEvent) =
  if event.e.isNil:
    return
  
  let count = event.e.refCount.load(moRelaxed)
  fence(moAcquire)
  if count == 0:
    # We have the last reference
    if not event.e.isNil:
      if event.e.kind == Iteration:
        wv_free(event.e.union.iter.singles)
      # Return memory to the memory pool
      recycle(event.e)
  else:
    discard fetchSub(event.e.refCount, 1, moRelease)
  event.e = nil

proc `=sink`*(dst: var FlowEvent, src: FlowEvent) {.inline.} =
  # Don't pay for atomic refcounting when compiler can prove there is no ref change
  `=destroy`(dst)
  system.`=sink`(dst.e, src.e)

proc `=`*(dst: var FlowEvent, src: FlowEvent) {.inline.} =
  `=destroy`(dst)
  discard fetchAdd(src.e.refCount, 1, moRelaxed)
  dst.e = src.e

To add Copy-On-Write on top you need to change the = so that it checks how many reference there are and if there is one detach/copy, =sink and destroy can stay as-is.

snej (orginal) [2020-06-25T19:07:42+02:00] view original

I've read that C++ abandoned CoW for std::string because the atomic ref-counting turned out to be more expensive on average than simply copying the string every time. But of course YMMV; the tradeoff depends on the size of the objects and how expensive they are to copy.

And for objects that won't be used concurrently on multiple threads, one can drop down to plain old ints for the refcounts (and skip the fences). That's probably the approach I'll use — I do need threads, but I'm going to try my best to enforce move semantics (a la Rust) for objects being sent between threads.

cumulonimbus (orginal) [2020-07-02T15:52:24+02:00] view original

@snej @mratsim - Thank you both very much. I've re-watched Andreas' talk and gone through the String CoW and mratsim's example. It looks like I'm covered, but there's one thing that I haven't managed to convince myself of, described below.

Is this enough to guarantee that the compiler won't lend something "behind my back?"; by which I mean something like:

proc action(a:cowType, b:cowType, c:cowType, d:cowType):
        modify(a);
        modify(b);
        modify(c);
        modify(d);
     
     var x = cowType("Hello world");
     action(x,x,x,x)
     `=destroy`(x)

Now, as far as refcounting goes for "liveness", it's enough to count 'x' once on entry to action() (or even zero, as it is lent). But for CoW purposes, we need to count all 4 uses, so that if 3 of them modify it, they each get their own copy (but the 4th modification doesn't - it can modify x in place because that's the last copy and it's a sink.

At the very least, looks like cowType always has to be passed by value for this tracking to occur (even "modify"), and var shouldn't really be used. But I have still not been able to convince myself that there is no case in which the RC will be enough for liveness tracking but not enough for CoW tracking. Perhaps I just need to spend more time on it.....

Another question I came up with while sketching a solution: I'm trying to implement a (sort-of-) language/dsl, in which the CoW makes everything have value-semantics, although in-place array and table modifications are efficient. That, unfortunately, necessitates refcounting. However, many times one would only call "pure" functions which are guaranteed not to modify anything, directly or indirectly -- and if so, the reference is completely lent, and no refcounting is needed.

Nim's effect system should be able to help with that: have a .tag[modify_refcount] on the proc funcs that actually modify refcount; and .tag[] on those that shouldn't, and the compiler would verify that those that shouldn't are indeed side-effect-free (w.r.t refcounting). But how would it be elided in practice? by passing them as "var"? Can a template/macro have access to the inferred tags, so it can act on this and perhaps save the incref/decref in this case?

I'm hoping to dive into the compiler this weekend to figure out, but would still appreciate any help/pointers.

Mirror of forum.nim-lang.org

3355 :: Copy-on-write container