First, thank you Araq for creating such a fantastic language.
I just discovered Nimrod last week, so sorry for my noob question but I'm a bit stuck with threads and global variables.
If I define and fill an array (or a seq) and then update it in subsequent threads, I can no longer access this array after the "joinThreads".
Here is a sample code http://pastebin.com/MaX3J4U1
If I only read my array in my threads, the array remains untouched.
I tried everything I could find (pointers, ref, GC_ref/unref) but none worked.
Has someone a hint for me ? Thanks
What happens is that each thread has its own thread-local heap (similar to what Erlang does). Aside from the fact that assigning references (to strings) from one thread's heap to another thread's is inherently unsafe if you don't know what you are doing, the heap containing the strings will be destroyed when the thread terminates, which is where the segmentation fault comes from.
You can use channels to transmit values between threads in a way that doesn't create cross-heap references, or you can use --gc:boehm (which uses a conservative and thread-safe mark and sweep collector with a single shared heap instead of the deferred reference counting implementation with its multiple heaps).
I believe that explicit support for shared heaps is planned for the future.
Thank you for your help Jehan.
I use threads for crunching GB of data so using channels seems not appropriate (am I wrong ?). I tried --gc:boehm but got "could not load: boehmgc.dll" (I'm on Windows 8 64 bits) and building it doesn't work on my system.
But I also tried --gc:none with no effect : it should have been another option, no ?
Do you know if shared heap is planned for the next release (0.9.4) ?
--gc:none only works for short running processes or when you're in full control over every allocation (kernels written in Nimrod use it).
Nimrod already provides a shared heap via allocShared but this requires manual memory management.
Channels also work efficiently when you cast the ref to a ptr. This way no deep copy is performed. You can use GC_ref and GC_unref the ref to ensure it lives long enough for any thread working on the data. There are other ways to ensure the same, structured fork&join parallelism should work too. One important thing you need to keep in mind is that you can't transfer ownership between threads when using Nimrod's GC. For instance, having multiple threads building up some data structure in parallel cannot work at all.
Shared memory requires a good ownership model to be correct (= no race conditions, data races and deadlocks), so personally I consider Nimrod's way to conflate it with manual memory management not too bad. However with the current model there are lots of traps that we really need to document.
A better concurrency model is in the works but is unlikely to happen before version 0.9.8. Major blocking issue is that the design I favor is really complex...
Araq, thank you for this extremely interesting and thorough answer. I will definitely try these options.
BTW can we donate to the Nimrod project?
Thanks for the links dom96 (it's the only way I can contribute to Nimrod at the moment).
allocShared works well but a "lockfree hash table" (I saw this name in the irc logs) would probably be a better fit for my needs. Can we expect this in the near future?
Thank you for your help MFlamer. I will explain a little more what I want to do.
Basically I need to perform aggregations of key/value pairs:
====== ====== key value ====== ====== AAA 100 BBB 200 CCC 300 BBB 400 AAA 500 ====== ======
should be converted into:
====== ====== key value ====== ====== AAA 600 BBB 600 CCC 300 ====== ======
The source table has between 100M and 1G rows.
At the moment, my main process splits the source table into several parts and assigns each one a thread. Each thread performs an aggregation with its own Hash Table and updates an allocShared memory with the result. Then the main process aggregates all these sub-aggregations with another last Hash Table.
I was wondering if I could use your "LockFree HashTable" as the one and only shared Hash Table used by all the threads. But I don't know: 1/ if this could work that way 2/ if this is the most efficient way of doing.
And besides, you probably already have a lot of things to do, so feel free to decline. Either way, I really appreciate your offer.
Wow thanks a lot MFlamer! I didn't expect something so soon! Unfortunately I can't try it today but tomorrow I will work my way through it. Thanks again, much appreciated.
@Jehan : I really appreciate your advice and it sounds like you really are a few years ahead of me. In my current implementation (I say current but I'm not finished yet) each thread creates and updates its own sharedAlloc, then, at the end, updates a global sharedAlloc with only a pointer to its own sharedAlloc : is it what you call "multiple shared heaps"? And, if I understand correctly, you say it would perform more efficiently than with a single sharedAlloc?
The (difficult to avoid) problem with allocShared() is not how many distinct calls or shared data structures you have in your program, but that allocShared() requires a global lock (because you can't have multiple threads concurrently manipulate the underlying data structures of the allocator). This means that while one thread is calling allocShared(), no other thread can. Because your workload seems to be pretty allocation-heavy, this would result in effective serialization, where threads mostly wait taking turns to allocate the next chunk of memory.
It should in principle not be difficult to have multiple shared heaps and pass the heap as an additional optional argument to allocShared(), though; this way you can get away without threads blocking each other. I haven't tried it yet, but the code in system/alloc.nim looks straightforward enough. I'm not sure what Araq's plans are for that (the obvious other – but hardly trivial – solution is to get rid of the global heap lock).
@radsoc
I don't see your problem here: You need
You can pretend you don't have global variables and pass ptr's to them around instead but this only makes code review and static analysis harder. Since only at the end threads need to access the shared memory to store the results, this solution should really fly.
@Jehan I plan to implement allocShared as a wrapper to the lockfree memory allocator here: http://locklessinc.com/ Maybe later followed by our own implementation. We'll see.
@radsoc: My current solution is to use a (tuned version of) the Boehm GC for parallel programs that require shared memory, which sidesteps most of these issues (at the expense of unpredictable pause times, but I don't really need predictable pause times for the kind of batch processing that I do there).
@Araq: You may still need to think about how you're going to deal with the issue of NUMA architectures if you have a single shared heap.
@Araq : you're right, and that's why I wanted to do some benchmarks. I performed a test case against a 1GB file (100M rows - 1M different keys - and value is a const). On my i7-4500U laptop, I get this:
Duration - thread:3 - 25M k/v pairs aggregated into 1M k/v pairs : 9951 ms - 1M k/v pairs saved to allocShared : 352 ms
Duration - thread:0 - 25M k/v pairs aggregated into 1M k/v pairs : 9963 ms - 1M k/v pairs saved to allocShared : 365 ms
Duration - thread:1 - 25M k/v pairs aggregated into 1M k/v pairs : 10031 ms - 1M k/v pairs saved to allocShared : 362 ms
Duration - thread:2 - 25M k/v pairs aggregated into 1M k/v pairs : 10036 ms - 1M k/v pairs saved to allocShared : 368 ms
Duration - 4M k/v pairs aggregated into 1M k/v pairs : 781 ms
Duration - 1M k/v pairs saved into a JSON file : 576 ms
Duration - Total : 11846 ms
The memory usage always stays below 1,3G (<600M for the process + memfile). There is still room for improvements but I'm quite happy with these results. The bottleneck is really the sub-aggregation done by each thread. I will try the lock free hash table.
@Jehan : Thank you for unveiling your current solution, but unfortunately I didn't manage to compile GC Boehm on my windows machine