After exploring various designs for a container (Tensor) which only copy memory when necessary (shallow let and deep var, using the GC refcount) and hitting technical roadblocks. I've settled on this one. Note that this should work nicely with the future =move and =sink operator:
{.experimental.}
type
Storage[T] = ref object
refcount: int
data: seq[T]
# lock / guard ?
type COWobject[T] = object
metadata: int
storage: Storage[T]
proc detach[T](c: var COWobject[T]) =
# Note: this only works without races if only the main thread can access this.
# Also increment is only done on assignment, slices do not increment.
if c.storage.refcount == 1:
return
let old_store = c.storage
var fresh_store: Storage[T]
new fresh_store
fresh_store.refcount = 1
deepCopy(fresh_store.data, old_store.data)
c.storage = fresh_store
dec old_store.refcount
proc `=`[T](dst: var CowObject[T]; src: CowObject[T]) =
inc src.storage.refcount
system.`=`(dst, src)
proc `=destroy`[T](c: CowObject[T]) =
# Note from Nim manual: destructors are tied to a variable
# And will not trigger for say slices.
dec c.storage.refcount
proc toCowObj[T](s: varargs[T]): COWobject[T] {.noInit.} =
result.metadata = 1337
new result.storage
result.storage.data = @s
proc `[]`[T](c: COWobject[T], idx: int): T =
c.storage.data[idx]
proc `[]`[T](c: var COWobject[T], idx: int): var T =
detach c
c.storage.data[idx]
proc `[]=`[T](c: var COWobject[T], idx: int, val: T) =
detach c
c.storage.data[idx] = val
proc main() =
let a = toCowObj(1, 2, 3, 4, 5)
let b = a
var c = a
let d = c
var e = c
let f = e
let g = f
c[1] += 10
e[2] = 100
echo "\n\nMemory location"
echo "a: [1, 2, 3, 4, 5]: " & $a.repr
echo "b: [1, 2, 3, 4, 5]: " & $b.repr
echo "c: [1, 12, 3, 4, 5]: " & $c.repr
echo "d: [1, 2, 3, 4, 5]: " & $d.repr
echo "e: [1, 2, 100, 4, 5]: " & $e.repr
echo "f: [1, 2, 3, 4, 5]: " & $f.repr
echo "g: [1, 2, 3, 4, 5]: " & $g.repr
echo "\n\n"
echo "a: [1, 2, 3, 4, 5]: " & $a
echo "b: [1, 2, 3, 4, 5]: " & $b
echo "c: [1, 12, 3, 4, 5]: " & $c
echo "d: [1, 2, 3, 4, 5]: " & $d
echo "e: [1, 2, 100, 4, 5]: " & $e
echo "f: [1, 2, 3, 4, 5]: " & $f
echo "g: [1, 2, 3, 4, 5]: " & $g
main()
Critics welcome :).
I will use this container for Tensors/multidimensional arrays.
It already scales almost perfectly with the number of cores through OpenMP: 4 cores --> 3.5x (reduce operations like sum) to 4x speedup (map and element-wise operations like matrix addition).
So for now, it's like this, assuming compiled with openmp:
let a = randomTensor([100, 100], 10) # Main thread: create a 100x100 tensor filled with range 0..9
let b = randomTensor([100, 100], 10) # Main thread: create a 100x100 tensor filled with range 0..9
let c = a + b # All threads do addition on chunks of size 100x100 div thread_count
let sum = c.sum # All threads do partial sum on chunks of size 100x100 div thread_count, we have thread_count partial sums. Main thread will then further reduce on the partial sums.
The let a, let b, let c, let sum are all done serially in the main thread.
I don't have a use case (yet ?) to introduce even more parallelism. That would require lock/guard or atomics (not sure we can compare-and-swap ref though) in the proposed copy-on-write container as well which will make it harder to implement, and maybe slower for the general use-case.
Here is a blog post where a D-lang dev explores reference counting in a garbage-collected language and destructors data-races.
@mratsim Oh, I forgot Arraymancer always uses OpneMP so you're talking about threads created by it. I don't use OpenMP that often, that's probably why I forgot about it.
Oh, by the way: the elementary operations you mentioned, addition, sum etc, can be easily split into equal chunks. But not so easy for map, the cost of which can depend on the element's value. Would you use dynamic scheduling then?
The idea of a ref-counted coarray is bizzare to me, to be honest. Maybe that's partly because I'm used to per-process coarrays and yours are per-thread coarrays. Still, I don't really like the very idea of ref-counted garbage-collection for tensors. Why? Because there is no reason I couldn't have two equally privileged references to the same array inside of the same thread. But you have views for slicing anyway, so why not one array and many views over it? That would probably help to avoid some problems you mentioned. The only thing is it would need proper destructors... OR a python-like with context. We use it for files in Nim anyway, so I can't see why not?
@Udiknedormin: Actually map is the easiest, you can just do:
for i in 0||(container.len - 1):
result[i] = mapped_op(container[i])
OpenMP will divide work statically in (container.len - 1) / threads_count chunks.
Right now Arraymancer offers 2 choices, value and ref semantics.
let a = randomTensor([5, 5], 10) # int 5x5 tensor with value between 0..9
let b = a # Copy a in b
let slice = a[0..2, 0..2] # Copy a slice of a in slice
Note that it always copy.
Alternatively you can do
let a = randomTensor([5, 5], 10) # int 5x5 tensor with value between 0..9
let b = a.unsafeView # b is a view of a
let slice = a.unsafeSlice(0..2, 0..2) # slice is a view of a[0..2, 0..2]
The problem i want to solve is that the natural syntax is slow due to the many allocations. But I want to keep value semantics on the surface.
Note: the proper way to have ref semantics in Arraymancer would be:
type
Tensor[T] = object
shape = seq[int] # Store the dimensions like 5x5
other_metadata = seq[int] # Just an example
storage = CpuStorage[T]
CpuStorage[T] = ref object
data = seq[T]
AlternativeCpuStorage {.shallow.} [T] = object
data = seq[T]
So the copy-on-write I proposed in OP just add a refcount to CpuStorage.
Note that it is not used for garbage collection, but to know if mutation can happen in place or not. Garbage collection is still done by Nim GC.
Ah right, indeed. Well let's say that it's a non-supported use case shrugs and that Arraymancer tensors are the wrong container for that. I'm not aware of any scientific/numerical computing situation with an operation depends not only on the tensor size but also the value of what is wrapped.
Now regarding copy-on-write in Arraymancer, after tinkering a bit, I am convinced that it is not suitable and that plain reference semantics (or even value semantics are better).
I've explored using a refcounter or a simpler "shared_bit" boolean that just tracks if it was shared at any point (= assignment) or moved (=sink) and they don't cut it for the following reasons:
import ../src/arraymancer, sequtils
let ctx = newContext Tensor[int] # Context that will track operations applied on the tensor to compute gradient in the future
let W = ctx.variable toSeq(1..8).toTensor.reshape(2,4) # What is the refcount? --> it's 0
let x = toSeq(11..22).toTensor.reshape(4,3)
let X = ctx.variable x # What is the refcount? it's 1 until x goes out of scope
The refcount and logic required becomes hard to follow and will probably lead to an overengineered solution (or test suite).
3. Workaroundability: Since copy-on-write must overload assignment, if you want to ensure shallow-copy for example you have to use:
let foo = toCowObj(1, 2, 3, 4, 5)
var bar: CowObject[int]
system.`=`(bar, foo)
Let's say I want to do some operations on particles. They should be vectorized (and maybe parallelized, some calculations could also benfit from GPU) and it would be really nice if particles from the same space grid would be in the same place in the sequence, as they will need to access the same grid of magnetic field when calculating their movement. Why shouldn't I use a 2D or 3D (depends on the simulation) tensor rather than a seq?
Also: numpy provides sorting, mind you.
It should, the refcount is not using GC internals but it's own field.
In the end I didn't go with COW due to:
I've put my thoughts here: https://github.com/mratsim/Arraymancer/issues/157#issuecomment-346763773 and https://github.com/mratsim/Arraymancer/issues/160
Overloading = is really troublesome, I don't think it changed since then.
Wait what? Everything changed about it, it now has a spec and a different implementation. Containers with =, =sink and =destroy have never been easier. Caveat: The old default seq implementation doesn't call =destroy.
So ... I was trying to read up on =sink and =destroy, and am a bit confused:
There's https://github.com/nim-lang/Nim/wiki/Destructors , which appears to be outdated, forwarding to https://github.com/nim-lang/Nim/wiki/Destructors,-2nd-edition which refers to https://github.com/nim-lang/Nim/blob/devel/doc/destructors.rst which does not document which version it actually references.
Also, the latest destructors doc refers to lent and owned references which are only documented there and not in the 1.0.0 language manual, and you mention the "old default seq implementation" which fails to call =destroy, implying that there's a new one? (but how do I choose which one is used?)
I would like to implement a "functional" copy-on-write refcounted sequence (a-la APL / J / K / Ocaml without references), that is - the semantics of every value of this kind is equivalent to a value type, and thus can have no cycles -- much like refcounted strings. (so .. refcount is precise)
Araq, hopefully this is not too much to ask - but, what would you recommend as an implementation strategy? My idea would be own-managed ptrs to memory, and using the =,``=sink``,``=destroy``, []= operators to manage the references. It seems like sink, owned and lent may help with removing unneeded refcnt operations in some places, but since they are not in the 1.0.0 manual I am at loss about whether they can be relied on to be there, and/or which gc model they assume/require.
Thanks in advance.
@mratsim switched away from CoW, but it is put to great use in APL / J / K implementations; In general, they simulate a "value only" system by using only references and reference counts; Anything that is only referenced once is just modified in place when you want to modify it. So you get easy to debug value semantics, and (most of the time, with very little care required) reference performance. This is in contrast with e.g. Clojure's persistent vector implementation which clones a limb on every modification and thus generates a lot of garbage; or with R (v3, haven't looked at the v4 changes) using inaccurate refcounts, which requires some care to not generate too much garbage.
As --gc:arc and --gc:orc already maintain refcounts (some in memory, some in the AST), sane access to them would greatly simplify such CoW schemes - which otherwise duplicate all the refcounts (or otherwise has use ptrs instead ...)
Any ideas / docs / pointer?
Did you watch Andreas's talk on ARC and ORC from Saturday's NimConf? He has a running example where he builds a toy seq from scratch, including the ref-counting.
Basically you can implement your own ref-counting, if your object is a non-ref. You give it a ptr to the shared state, and put a refcount in the shared state, and then implement the ARC hooks like =destroy and =sink.
This is a RFC on copy-on-write strings https://github.com/nim-lang/RFCs/issues/221 (no code or pseudo-code though)
For your own refcounting scheme, as @snej mention, it's easy to do with destructors, see my atomic ref counted type here: https://github.com/mratsim/weave/blob/9f0c384f/weave/cross_thread_com/flow_events.nim#L173-L201
type
FlowEvent* = object
e: EventPtr
EventPtr = ptr object
refCount: Atomic[int32]
# Refcounting is started from 0 and we avoid fetchSub with release semantics
# in the common case of only one reference being live.
proc `=destroy`*(event: var FlowEvent) =
if event.e.isNil:
return
let count = event.e.refCount.load(moRelaxed)
fence(moAcquire)
if count == 0:
# We have the last reference
if not event.e.isNil:
if event.e.kind == Iteration:
wv_free(event.e.union.iter.singles)
# Return memory to the memory pool
recycle(event.e)
else:
discard fetchSub(event.e.refCount, 1, moRelease)
event.e = nil
proc `=sink`*(dst: var FlowEvent, src: FlowEvent) {.inline.} =
# Don't pay for atomic refcounting when compiler can prove there is no ref change
`=destroy`(dst)
system.`=sink`(dst.e, src.e)
proc `=`*(dst: var FlowEvent, src: FlowEvent) {.inline.} =
`=destroy`(dst)
discard fetchAdd(src.e.refCount, 1, moRelaxed)
dst.e = src.e
To add Copy-On-Write on top you need to change the = so that it checks how many reference there are and if there is one detach/copy, =sink and destroy can stay as-is.
I've read that C++ abandoned CoW for std::string because the atomic ref-counting turned out to be more expensive on average than simply copying the string every time. But of course YMMV; the tradeoff depends on the size of the objects and how expensive they are to copy.
And for objects that won't be used concurrently on multiple threads, one can drop down to plain old ints for the refcounts (and skip the fences). That's probably the approach I'll use — I do need threads, but I'm going to try my best to enforce move semantics (a la Rust) for objects being sent between threads.
@snej @mratsim - Thank you both very much. I've re-watched Andreas' talk and gone through the String CoW and mratsim's example. It looks like I'm covered, but there's one thing that I haven't managed to convince myself of, described below.
Is this enough to guarantee that the compiler won't lend something "behind my back?"; by which I mean something like:
proc action(a:cowType, b:cowType, c:cowType, d:cowType):
modify(a);
modify(b);
modify(c);
modify(d);
var x = cowType("Hello world");
action(x,x,x,x)
`=destroy`(x)
Now, as far as refcounting goes for "liveness", it's enough to count 'x' once on entry to action() (or even zero, as it is lent). But for CoW purposes, we need to count all 4 uses, so that if 3 of them modify it, they each get their own copy (but the 4th modification doesn't - it can modify x in place because that's the last copy and it's a sink.
At the very least, looks like cowType always has to be passed by value for this tracking to occur (even "modify"), and var shouldn't really be used. But I have still not been able to convince myself that there is no case in which the RC will be enough for liveness tracking but not enough for CoW tracking. Perhaps I just need to spend more time on it.....
Another question I came up with while sketching a solution: I'm trying to implement a (sort-of-) language/dsl, in which the CoW makes everything have value-semantics, although in-place array and table modifications are efficient. That, unfortunately, necessitates refcounting. However, many times one would only call "pure" functions which are guaranteed not to modify anything, directly or indirectly -- and if so, the reference is completely lent, and no refcounting is needed.
Nim's effect system should be able to help with that: have a .tag[modify_refcount] on the proc funcs that actually modify refcount; and .tag[] on those that shouldn't, and the compiler would verify that those that shouldn't are indeed side-effect-free (w.r.t refcounting). But how would it be elided in practice? by passing them as "var"? Can a template/macro have access to the inferred tags, so it can act on this and perhaps save the incref/decref in this case?
I'm hoping to dive into the compiler this weekend to figure out, but would still appreciate any help/pointers.