nimforum mirror - basic threading question

karl (orginal) [2016-05-03T09:06:09+02:00] view original

Totally new to Nim. I've been going through the forums and I think I understand how to achieve what I want, but I'm looking for verification and some clarification.

I want to load a significant amount of data into memory, and then run ad-hoc queries against it (via an http interface). Loading a copy of the data per thread isn't realistic.

On startup, I can load all the data from the main thread. When a request comes in, I can safely and effectively launch X worker threads, feeding each one a partition of the data to operate on. None of the workers write to this shared global data, but they each have their own result structure. Once all the workers are done, the thread handling the requests takes each worker result and merges them together.

If I wanted to workers to share a single result (say a thread-safe dictionary), that isn't particularly well supported in Nim today, right?

Also, I assume it's ok for the main thread to update the shared global data, so long as it applies some type of locking (either globally, by stopping the web server from handling requests while it updates, or with more granularity where the workers would need to read-lock their partition). I haven't looked at the web component at all, but I assume the main thread could launch the web server in a separate thread, then just do an infinite loop with a sleep and every minute just see if the data needs updating.

Am I on the right track?

andrea (orginal) [2016-05-03T09:52:14+02:00] view original

Nim today has two kind of heaps. By default, you use per-thread, garbage collected heaps. There is also a shared heap which is not garbage collected, and requires manual allocation.

What you can do is:

load the shared data in the shared heap, possibly by mmapping a large file

whenever a query comes in, spawn some threads and let them do their work by reading the shared heap, and constructing some result in their local heap

communicate back the local results to the main thread using channels - this will involve some copying that is done transparently by the channels implementation

assemble the result on the main thread

In order to update the shared data safely, you will need to use some locking, or perform the operations in a suitable order such that the shared data structure is always valid (if possible at all), say by constructing a parallel structure and updating a single pointer at the end (or doing the same piecewise)

karl (orginal) [2016-05-04T15:02:59+02:00] view original

First, thanks a lot for helping me out.

I'm sorry to sound stubbornly stuck on my approach, but I'm curious why you'd recommend that I manually allocate memory for the global data. At least according to this post from Aarq, as long as the GC'd data is created by the main thread, I can share it. That's an old post, so maybe it's no longer true?

I came up with this preliminary code:

proc run(threadCount: int) =
  let chunkSize = int(ceil(data.len / threadCount))
  for i in 0..<threadCount:
    let start = i * chunkSize
    let stop = if i == threadCount-1: data.len else: start + chunkSize
    var slice = data[start..<stop]
    spawn process(slice)
  sync()

But I realize that this passes a deep copy of slice. If slice is large, this is a significant performance and memory hit.

Instead, I now call

spawn process(addr(slice))

and get the data back via:

proc process(p: ptr seq[Data]) {.thread.} =
  var data = cast[ptr seq[Data]](p)[]
  ...

This seems to be working, and it seems to be quite efficient. Of course, if I plan on having my master thread update data, I'll need to add some locks. That's fine.

In general, is this reasonable? More specifically, it seems like the advice around cast is don't use it unless you know what you're doing. So I don't know what I'm doing, but I'd like to learn. What are the dangers/pitfalls? This feels "safe" to me because I know that my global GC'd data is going to outlive the call to process. Is there something else to fear?

andrea (orginal) [2016-05-04T15:25:22+02:00] view original

I have to say I am really not sure myself, but I would guess that the line

var data = cast[ptr seq[Data]](p)[]

performs a copy of the data in the thread-local heap. But I might be wrong.

karl (orginal) [2016-05-04T15:30:14+02:00] view original

Actually, I think it creates a copy of the pointer (which is mostly harmless / useless). I believe I should just be doing:

var data = p[]

Still not sure about the general approach though, but thanks, it seems ok so far. Maybe I'll get lucky and someone else will chime in

andrea (orginal) [2016-05-04T18:24:24+02:00] view original

Just make sure to read this. Assigment copies sequences by default!

Mirror of forum.nim-lang.org

2244 :: basic threading question