nimforum mirror - Weave+ARC WIP or my bug?

ggibson (orginal) [2021-09-11T21:18:58+02:00] view original

I didn't find any statements that ARC and Weave don't work together, but they seem to not work together with even a trivial example. Found some older forum discussion by mratsim on the topic https://forum.nim-lang.org/t/6352#39259

Putting loadBalance(Weave) in and out makes me think this has something to do with how the main thread is handled, based on when it crashes with ARC, but this is just a wild guess. For all I know I'm missing some important Weave command that makes this work.

(sorry for the awkard example - it's a reduction from the affected piece of my actual project)

import weave

type
  Node = ref object
    id:int
    parent:Node

var globalID {.threadvar.}:int

proc inc(head:var Node) =
  let old = head
  new head
  inc globalID
  head.id = old.id + 1
  head.parent = old

proc multiMain =
  init(Weave)
  parallelFor i in 1..8:
    # make and grow a linked list
    # one link at a time
    var list:Node
    new list
    list.parent = nil
    var timesPrinted = 1
    
    for time in 0..20_000_000:
      # grow ref chain (linked list)
      inc list
      if time mod 4_000_000 == 0:
        stdout.write $timesPrinted
        stdout.flushFile
        inc timesPrinted
        loadBalance(Weave)
  exit(Weave)
  quit(0)

ggibson (orginal) [2021-09-11T21:58:40+02:00] view original

This has nothing to do with Weave. Even remaking the example to use built-in threading mechanisms causes the same symptoms of slowness and crashing when compiled with ARC.

Araq (orginal) [2021-09-11T22:16:23+02:00] view original

The auto-generated destructors are recursive and you produce a stack overflow with long linked lists. Give your type a custom destructor and ideally stop using linked lists, they have been a bad idea for the hardware for about 20 years now.

ggibson (orginal) [2021-09-11T22:37:11+02:00] view original

Thanks @Araq, I was just beginning to realize this while playing with the destructors, but I'm very glad to get set straight! I agree about linked lists. Unfortunately, linked list is an elegant solution for what I'm doing (tree of life phylogeny simulation stuff). I'll see about using an edge list or some other data structure to store ancestor relationships. This is a cool learning experience about ARC though.

Araq (orginal) [2021-09-12T09:17:05+02:00] view original

Fwiw we're also looking at producing non-recursive destructors automatically as it's a porting issue otherwise.

ggibson (orginal) [2021-09-12T19:39:53+02:00] view original

That would be fantastic for porting!

In case you're curious of a user experience with ARC, I'm confused how the custom destructor should actually clear each ref object of my list. I saw some posts that showed using dispose(node) but it seems modern dispose is meant for inter-thread freeing only. So currently I just ensure that there are no references to the current node, then leave it dangling in memory and march on with the rest of the list, hoping it gets cleared. How should I "delete" the ref object here, or is it happening automatically?

I tried looking at memory usage to infer if memory was indeed being cleared up, but the linux-OS task memory usage and Nim's occupied memory report very different things, with the OS reporting an ever-increasing glob of memory (100MB/s) and Nim's total at < 1MB.

Araq (orginal) [2021-09-12T19:54:37+02:00] view original

Warning, untested:

proc `=destroy`(x: Node) {.nodestroy.} =
  var it = x
  while it != nil:
    let nxt = it.next
    it.next = nil
    `=destroy`(it[])
    it = nxt

mratsim (orginal) [2021-09-13T11:05:35+02:00] view original

Your code is problematic.

You use threadvar:

Weave tasks move from hardware threads to hardware threads depending on the load. Is it fine if you have 4 threads to sometime have global IDs be [8, 0, 0, 0] or [6, 0, 1, 1] or [0, 3, 2, 3], ...?

Regarding slowness, Nim allocation uses a simple lock strategy with --threads:on. This doesn't scale when you do repeated small allocations and perf is killed by lock contention. Use an object pool.

ggibson (orginal) [2021-09-17T01:21:00+02:00] view original

Thank you! I did not know about the locking as a bottleneck. I've made a pooling/arena mechanism and it helps as you said.

I'm very confused about how to avoid threadvar, if I understand you correctly. It sounds like global ID won't be unique, and threadvar is similar to using global ID. So with Weave, how is thread-local memory achieved? For instance, each thread having a global counter for something. Avoiding threadvar, I would have attempted a global list on the heap and use Global ID to index it, but you're saying that won't work either.

The issue I'm working on is still speed under arc, which is still slower for me for slightly larger thread counts.

mratsim (orginal) [2021-09-17T10:33:54+02:00] view original

I'm very confused about how to avoid threadvar, if I understand you correctly. It sounds like global ID won't be unique, and threadvar is similar to using global ID. So with Weave, how is thread-local memory achieved? For instance, each thread having a global counter for something. Avoiding threadvar, I would have attempted a global list on the heap and use Global ID to index it, but you're saying that won't work either.

It's not just Weave, multithreading runtimes in general require the liberty to schedule your tasks on any thread that is available as they see fit if you use data parallelism (parallel for) In that case what you usually need is not thread-local memory but "task-local" memory.

In Weave this is used with parallelForStaged construct which allows you to setup a local context before splitting of the task occurs.

Your parallel loop is put in a loop: section and your local context is created and destroyed in a prologue: and epilogue: section.

https://github.com/mratsim/weave#parallel-for-staged

proc sumReduce(n: int): int =
  let res = result.addr # For mutation we need to capture the address.
  
  parallelForStaged i in 0 .. n:
    captures: {res}
    awaitable: iLoop
    prologue:
      var localSum = 0
    loop:
      localSum += i
    epilogue:
      echo "Thread ", getThreadID(Weave), ": localsum = ", localSum
      res[].atomicInc(localSum)
  
  let wasLastThread = sync(iLoop)

init(Weave)
let sum1M = sumReduce(1000000)
echo "Sum reduce(0..1000000): ", sum1M
doAssert sum1M == 500_000_500_000
exit(Weave)

Alternatively, you allocate a heap array with your "thread-local" contexts and spawn a task per contexts.

Mirror of forum.nim-lang.org

8414 :: Weave+ARC WIP or my bug?