I am trying to parallelize some code and started looking at the actors module. However I wasn't getting the expected boost from completely independent parallel tasks so I started writing a small test case:
import os, times, actors, osproc
const
num_files = 20
delay = 500
proc worker(a: int) {.thread.} =
# Does heavy stuff
os.sleep(a)
proc serial_test() =
echo "Running serial test"
for i in 1 .. num_files: worker(delay)
proc parallel_test() =
let cpus = countProcessors()
echo "Running in parallel with cpu count: ", cpus
var pool: TActorPool[int, void]
pool.createActorPool(cpus)
for i in 1 .. num_files:
pool.spawn(delay, worker)
pool.sync()
proc test() =
when defined(release): echo "Running in release mode"
else: echo "Running in debug mode"
let t1 = epoch_time()
serial_test()
let t2 = epoch_time()
parallel_test()
let t3 = epoch_time()
echo "Spent in serial queue ", $(t2 - t1)
echo "Spent in parallel queue ", $(t3 - t2)
when isMainModule:
test()
echo "Test finished successfully"
When I compile and run this in debug mode I get:
Running in debug mode
Running serial test
Running in parallel with cpu count: 8
Spent in serial queue 10.0160059928894
Spent in parallel queue 4.069833993911743
Test finished successfully
Given that there are 20 tasks, each of 500ms and the machine has 8 cores I was expecting the parallel version to run in 1.5s. Is the slowdown to 4s due to the polling mentioned in the documentation? I was indeed hoping to parallelize tasks of under 1s each, so maybe I should look at other implementations. Is there a more optimal queue parallel consumer example somewhere which runs on 0.9.4?
But more surprising was the release mode version:
Running in release mode
Running serial test
Running in parallel with cpu count: 8
Spent in serial queue 10.01532292366028
Spent in parallel queue 10.07979393005371
Test finished successfully
Polling really takes its sweet timeā¦Your problem is that os.sleep does idle waiting which doesn't utilize CPU cores at all, so any speedup at all is an implementation artifact.
Use threads that do actual work, e.g.:
proc worker(a: int) {.thread.} =
# Does heavy stuff
for i in 1..a*2000000:
var x {.volatile.} = i
Other problems you may run into: Intel's Turbo Boost and similar technologies means that you can't always get an N-fold speedup from an N-core processor. Also, if runtimes of individual tasks differ too much, you may end up with only being able to utilize part of your cores towards the end of a computation.