nimforum mirror - Threads and global variables

radsoc (orginal) [2013-12-01T19:14:55+01:00] view original

First, thank you Araq for creating such a fantastic language.

I just discovered Nimrod last week, so sorry for my noob question but I'm a bit stuck with threads and global variables.

If I define and fill an array (or a seq) and then update it in subsequent threads, I can no longer access this array after the "joinThreads".

Here is a sample code http://pastebin.com/MaX3J4U1

If I only read my array in my threads, the array remains untouched.

I tried everything I could find (pointers, ref, GC_ref/unref) but none worked.

Has someone a hint for me ? Thanks

Jehan (orginal) [2013-12-01T19:59:22+01:00] view original

What happens is that each thread has its own thread-local heap (similar to what Erlang does). Aside from the fact that assigning references (to strings) from one thread's heap to another thread's is inherently unsafe if you don't know what you are doing, the heap containing the strings will be destroyed when the thread terminates, which is where the segmentation fault comes from.

You can use channels to transmit values between threads in a way that doesn't create cross-heap references, or you can use --gc:boehm (which uses a conservative and thread-safe mark and sweep collector with a single shared heap instead of the deferred reference counting implementation with its multiple heaps).

I believe that explicit support for shared heaps is planned for the future.

radsoc (orginal) [2013-12-01T21:50:09+01:00] view original

Thank you for your help Jehan.

I use threads for crunching GB of data so using channels seems not appropriate (am I wrong ?). I tried --gc:boehm but got "could not load: boehmgc.dll" (I'm on Windows 8 64 bits) and building it doesn't work on my system.

But I also tried --gc:none with no effect : it should have been another option, no ?

Do you know if shared heap is planned for the next release (0.9.4) ?

Araq (orginal) [2013-12-01T22:24:17+01:00] view original

--gc:none only works for short running processes or when you're in full control over every allocation (kernels written in Nimrod use it).

Nimrod already provides a shared heap via allocShared but this requires manual memory management.

Channels also work efficiently when you cast the ref to a ptr. This way no deep copy is performed. You can use GC_ref and GC_unref the ref to ensure it lives long enough for any thread working on the data. There are other ways to ensure the same, structured fork&join parallelism should work too. One important thing you need to keep in mind is that you can't transfer ownership between threads when using Nimrod's GC. For instance, having multiple threads building up some data structure in parallel cannot work at all.

Shared memory requires a good ownership model to be correct (= no race conditions, data races and deadlocks), so personally I consider Nimrod's way to conflate it with manual memory management not too bad. However with the current model there are lots of traps that we really need to document.

A better concurrency model is in the works but is unlikely to happen before version 0.9.8. Major blocking issue is that the design I favor is really complex...

radsoc (orginal) [2013-12-01T23:03:44+01:00] view original

Araq, thank you for this extremely interesting and thorough answer. I will definitely try these options.

BTW can we donate to the Nimrod project?

dom96 (orginal) [2013-12-02T01:41:09+01:00] view original

Yes, there are donation links in the community tab on the homepage: http://nimrod-lang.org/community.html

radsoc (orginal) [2013-12-02T15:23:53+01:00] view original

Thanks for the links dom96 (it's the only way I can contribute to Nimrod at the moment).

allocShared works well but a "lockfree hash table" (I saw this name in the irc logs) would probably be a better fit for my needs. Can we expect this in the near future?

MFlamer (orginal) [2013-12-02T17:58:12+01:00] view original

I have started an implementation of a lock Free Hash Table in Nimrod. It is somewhat functioning depending on your requirements. If you are serious about using it, let me know more about what you are trying to do and I will work on it some more. I've recently gotten side tracked on some other projects.

radsoc (orginal) [2013-12-02T19:10:58+01:00] view original

Thank you for your help MFlamer. I will explain a little more what I want to do.

Basically I need to perform aggregations of key/value pairs:

====== ====== key value ====== ====== AAA 100 BBB 200 CCC 300 BBB 400 AAA 500 ====== ======

should be converted into:

====== ====== key value ====== ====== AAA 600 BBB 600 CCC 300 ====== ======

The source table has between 100M and 1G rows.

At the moment, my main process splits the source table into several parts and assigns each one a thread. Each thread performs an aggregation with its own Hash Table and updates an allocShared memory with the result. Then the main process aggregates all these sub-aggregations with another last Hash Table.

I was wondering if I could use your "LockFree HashTable" as the one and only shared Hash Table used by all the threads. But I don't know: 1/ if this could work that way 2/ if this is the most efficient way of doing.

And besides, you probably already have a lot of things to do, so feel free to decline. Either way, I really appreciate your offer.

MFlamer (orginal) [2013-12-03T06:26:38+01:00] view original

radsoc, The table is in the nimrod compiler repo at https://github.com/Araq/Nimrod/blob/master/lib/pure/collections/LockFreeHash.nim. The compiler is really easy to download and bootstrap if you don't already have this version. I'll be on IRC tomorrow and also try to check the logs daily. Ping me if you want some help. I'd be happy to have someone put it to use and it will probably motivate me to continue working on it. If you compile lockfreehash.nim and run, you will see the main limitation of the table. The test case there uses a string as the key and a tuple as the values. Because the underlying atomic operations are limited in size, the table actually creates copies of these in shared memory and stores the pointers in the table. The table does not have any versioning scheme yet (that's next) so, it can't guarantee the value you are working with is the most recent. If you can live with a 30bit int as the value it is stored directly in the table and this is not a problem, there will be no conflicts. I would be interested to see how it performs against your existing strategy.

radsoc (orginal) [2013-12-03T09:40:53+01:00] view original

Wow thanks a lot MFlamer! I didn't expect something so soon! Unfortunately I can't try it today but tomorrow I will work my way through it. Thanks again, much appreciated.

@Jehan : I really appreciate your advice and it sounds like you really are a few years ahead of me. In my current implementation (I say current but I'm not finished yet) each thread creates and updates its own sharedAlloc, then, at the end, updates a global sharedAlloc with only a pointer to its own sharedAlloc : is it what you call "multiple shared heaps"? And, if I understand correctly, you say it would perform more efficiently than with a single sharedAlloc?

Jehan (orginal) [2013-12-03T11:01:35+01:00] view original

The (difficult to avoid) problem with allocShared() is not how many distinct calls or shared data structures you have in your program, but that allocShared() requires a global lock (because you can't have multiple threads concurrently manipulate the underlying data structures of the allocator). This means that while one thread is calling allocShared(), no other thread can. Because your workload seems to be pretty allocation-heavy, this would result in effective serialization, where threads mostly wait taking turns to allocate the next chunk of memory.

It should in principle not be difficult to have multiple shared heaps and pass the heap as an additional optional argument to allocShared(), though; this way you can get away without threads blocking each other. I haven't tried it yet, but the code in system/alloc.nim looks straightforward enough. I'm not sure what Araq's plans are for that (the obvious other – but hardly trivial – solution is to get rid of the global heap lock).

radsoc (orginal) [2013-12-03T16:34:54+01:00] view original

Very interesting, Jehan, thank you for sharing your insights and expertise. But unfortunately what you find "difficult" is just like the Great Wall Of China for me. I've seen the AcquireSys(HeapLock) at the beginning of the allocShared proc but I guess deleting it is not an option. Maybe I'd better wait for Araq's next concurrency model. But in the meantime, I will do some benchmarks with the different available options. Thanks again.

Araq (orginal) [2013-12-03T17:08:49+01:00] view original

@radsoc

I don't see your problem here: You need

a global shared table (allocShared) GST

a lock for that

threads working on their own TTable/TCountTable (you can use the thread local GC for this, so the stdlib's collections work fine)

when a thread is done, it grabs the lock and merges its result into GST

You can pretend you don't have global variables and pass ptr's to them around instead but this only makes code review and static analysis harder. Since only at the end threads need to access the shared memory to store the results, this solution should really fly.

@Jehan I plan to implement allocShared as a wrapper to the lockfree memory allocator here: http://locklessinc.com/ Maybe later followed by our own implementation. We'll see.

Jehan (orginal) [2013-12-04T00:17:48+01:00] view original

@radsoc: My current solution is to use a (tuned version of) the Boehm GC for parallel programs that require shared memory, which sidesteps most of these issues (at the expense of unpredictable pause times, but I don't really need predictable pause times for the kind of batch processing that I do there).

@Araq: You may still need to think about how you're going to deal with the issue of NUMA architectures if you have a single shared heap.

Araq (orginal) [2013-12-04T00:38:02+01:00] view original

Well I don't plan to get rid of the thread local heaps altogether. But there is no reason we can't try to make allocShared as fast as reasonable.

Jehan (orginal) [2013-12-04T01:17:19+01:00] view original

What I'm talking about is that with a single shared heap you can't really control memory affinity for shared data. That's a separate issue from having thread-local heaps.

radsoc (orginal) [2013-12-04T16:42:58+01:00] view original

@Araq : you're right, and that's why I wanted to do some benchmarks. I performed a test case against a 1GB file (100M rows - 1M different keys - and value is a const). On my i7-4500U laptop, I get this:


Duration - thread:3 - 25M k/v pairs aggregated into 1M k/v pairs : 9951 ms - 1M k/v pairs saved to allocShared : 352 ms
Duration - thread:0 - 25M k/v pairs aggregated into 1M k/v pairs : 9963 ms - 1M k/v pairs saved to allocShared : 365 ms
Duration - thread:1 - 25M k/v pairs aggregated into 1M k/v pairs : 10031 ms - 1M k/v pairs saved to allocShared : 362 ms
Duration - thread:2 - 25M k/v pairs aggregated into 1M k/v pairs : 10036 ms - 1M k/v pairs saved to allocShared : 368 ms
Duration - 4M k/v pairs aggregated into 1M k/v pairs : 781 ms
Duration - 1M k/v pairs saved into a JSON file       : 576 ms
Duration - Total                                     : 11846 ms

The memory usage always stays below 1,3G (<600M for the process + memfile). There is still room for improvements but I'm quite happy with these results. The bottleneck is really the sub-aggregation done by each thread. I will try the lock free hash table.

@Jehan : Thank you for unveiling your current solution, but unfortunately I didn't manage to compile GC Boehm on my windows machine

MFlamer (orginal) [2013-12-04T17:24:42+01:00] view original

Cool, ping me on IRC if you need any help

Mirror of forum.nim-lang.org

307 :: Threads and global variables