nimforum mirror - Parallel example for computing pi efficiently is actually slow

onde (orginal) [2020-02-10T04:17:28+01:00] view original

Dear all,

I've timed the parallel example from the docs and found that it executes much slower than a single threaded version. With top -H I can see that threads are idle or underutilized. See below for the code. I've bumped the number of iterations to 1,000,000, set the number of threads to the number of processors on my machine (also tried with the default) and compiled with nim c -d:release and --threads:on as appropriate.

Wall clock times:

no threading: 0:00.09

parallel: 0:21.85

Am I misusing this code or misunderstanding something?

Thanks, Andreas

No threading:

import strutils, math

proc term(k: float): float = 4 * math.pow(-1, k) / (2*k + 1)

proc pi(n: int): float =
  var ch = newSeq[float](n+1)
  for k in 0..ch.high:
    ch[k] = term(float(k))
  for k in 0..ch.high:
    result += ch[k]

echo formatFloat(pi(1000000))

Parallel version:

# Compute PI in an inefficient way
import strutils, math
import threadpool
import cpuinfo
{.experimental: "parallel".}

let nProc = countProcessors()
setMaxPoolSize(nProc)

proc term(k: float): float = 4 * math.pow(-1, k) / (2*k + 1)

proc pi(n: int): float =
  var ch = newSeq[float](n+1)
  parallel:
    for k in 0..ch.high:
      ch[k] = spawn term(float(k))
  for k in 0..ch.high:
    result += ch[k]

echo formatFloat(pi(1000000))

Yardanico (orginal) [2020-02-10T08:21:30+01:00] view original

Even the first line of the parallel version says "# Compute PI in an inefficient way". It doesn't really make it faster, it just shows how parallel/spawn can be used

onde (orginal) [2020-02-10T08:37:17+01:00] view original

Argh. Looks like I wanted to read this differently

DeletedUser (orginal) [2020-02-10T08:46:12+01:00] view original

Maybe try Weave. https://github.com/mratsim/weave

treeform (orginal) [2020-02-10T18:55:01+01:00] view original

It uses threads to do very little amount of work: 4 * math.pow(-1, k) / (2*k + 1) ... creating a thread is a very heavy weight operation, while doing little math is really easy. So most of the time is spent doing thread bookkeeping. Your computation needs to justify running it in a thread ... which this does not as its just an example for parallel not an efficient implementation.

DIzer (orginal) [2020-02-11T00:52:41+01:00] view original

The example is very bad... If you wanna make manythreaded apps - you should consider following facts:

At low level - threads(and processes) are OS objects and they are governed by OS (so called "parallel abilites" of any programmin' language is just a sugar above them)

Threads and processes are limited resources and there is overhead as much 1-4 mb per process (and ~100kb per thread)

It takes some time to create thread and switch between them .. as ~30000 processor tacts for creation a thread, and ~2000-3000 tacts for switchin' between

Typecally the number of threads for numerical caculations should not exceed the number of workers (cores) your computional system

onde (orginal) [2020-02-11T04:08:13+01:00] view original

Yeah I didn't think about this carefully. Thanks for all your replies

rforcen (orginal) [2021-08-09T12:36:45+02:00] view original

i create countProcessors() (CP) threads each one of them processes a chunk of ch.len/CP size


proc mt_pi(n: int): float =
  proc term(i,n:int, ch:var seq[float]) =
    let
      size = ch.len
      chunk_sz = size div n
      rfrom = i * chunk_sz
      rto = if (i+1) * chunk_sz > size: size else: (i+1) * chunk_sz
    
    for index in rfrom..<rto: # process in this thread a chunk
      ch[index] = 4 * math.pow(-1, index.float) / (2*index.float + 1)
  
  let nth = countProcessors()
  
  var ch = newSeq[float](n+1)
  
  parallel:
    for k in 0..nth:
      spawn term(k, nth, ch)
  
  for k in 0..ch.high: result += ch[k]

juancarlospaco (orginal) [2021-08-09T14:32:45+02:00] view original

If you are interested, send a pull request to the documentation IMHO...

rforcen (orginal) [2021-08-09T19:45:31+02:00] view original

first attempt to a more general solution, i'm considering rayon rust module parallel implementation which contains parallel iterators, maps, etc.


proc par_apply*[T](v:var seq[T], fnc:proc(i:int):T)=
  proc chunk_range(size, i, nth: int): Slice[int] =
    let
      chunk_sz = size div nth
      rfrom = i * chunk_sz
      rto = if (i+1) * chunk_sz > size: size else: (i+1) * chunk_sz
    rfrom..<rto
  
  proc chunk_apply(fnc:proc(i:int):T, i, n : int, v:var seq[T]) =
     for i in chunk_range(size=v.len, i, n):
        v[i] = fnc(i)
  
  let nth = countProcessors()
  
  parallel:
    for i in 0..nth:
      spawn chunk_apply(fnc, i, nth, v)

rforcen (orginal) [2021-08-09T19:53:01+02:00] view original

good idea, as mentioned below rayon's rust parallel implementation is quite convenient as it contains iterators as (0..n).into_par_iter.map(|index| <closure>).collect()

Mirror of forum.nim-lang.org

5909 :: Parallel example for computing pi efficiently is actually slow