I have been pulling my hair out over a strange slow-down when using spawn. Things worked fine on Mac, but on Linux my multithreaded Nim program was slower than my multiprocessing Python program.
Both Nim and Python versions are mostly C code, using the exact same C code in fact. When I run everything in the main thread of the a single process, Nim is a little faster than Python, probably because of start-up time. I added micro-second timing to the C code, so I am fairly confident that the slow-down for threaded code is in the actual C code, not in Nim thread setup (deep-copying etc).
The C function takes about 0.03s per call from Nim, single-process and main thread. It takes about the same from Python like that, as well as using the multiprocessing Python module. But as soon as I use Nim threads, the C code takes about .25s for each call.
I tried compiling with -O2 instead of -O3 (which I think is smart anyway). No apparent difference.
I also tried limiting the Nim threadpool to exactly 1 thread (by modifying Nim/lib/pure/concurrency/threadpool.nim:setup()). Again, no difference.
My best guess for the slow-down is a False Sharing, but I guess if someone really wants to dive into this bioinformatics example, I could try to provide a full test-case.
I'm not sure that it's worth the effort. I think we might actually be better off with a Nim version of Python's multiprocessing. It solves so many problems.
Does such a thing exist? Am I crazy to want this? Thoughts?
For reference (nearly up-to-date):
Does such a thing exist?
No, but osproc plus marshal can give you the building blocks. :-)
Interesting issue.
I've narrowed the problem down to a single line of code: calloc() on 2MB - 20MB, many times. I guess the standard malloc/calloc uses a locking call for large allocations.
So there are probably several solutions:
This was probably not a case of "False Sharing". Still, I don't quite understand why the same code is fast in the main thread. Wouldn't the large allocation still use a lock?
If you're allocating that much memory, you may be better off using straight up mmap(). That said, it may well be the case that the calloc() implementation simply dispatches to mmap() already for large allocations and it's something in the Linux kernel that's actually the underlying cause.
As for multiprocessing, I'd use fork() instead of osproc. Also, marshal has some limitations in its current incarnation: