nimforum mirror - Parallelization of unsafe code

andrea (orginal) [2015-02-02T17:24:12+01:00] view original

I am trying to experiment various sorting algorithm on big files. These file are structured into lines of fixed lengths. Hence a natural approach is to map the files into memory, split them into n chunks, where n i the number of processors (no problem, as line lengths are fixed), sort the lines in each chunk in parallel, and finally reassemble the pieces with a merge operation.

I am not sure what is the best way to do this in Nim. On the one hand, I will have to work on unsafe data - essentially a giant array of bytes. I have written some methods that expose a somewhat safer interface to this blob, taking care of proper indexing and iteration. So I have something on the lines of

type
  cell* = uint8
  UncheckedArray{.unchecked.}[T] =
    array[1, T]
  Chunk* = object
    size: int
    data: ptr UncheckedArray[cell]
  Blob* = object
    size: int
    step: int
    data: ptr UncheckedArray[cell]

template dataAt(blob: Blob, offset: int): expr =
  blob.data + (blob.step * offset * sizeof(cell))

proc chunkSize*(blob: Blob): int {.inline.} = blob.step * sizeof(cell)

proc `[]`*(blob: Blob, i: int): Chunk =
  assert i >= 0 and i < blob.size
  Chunk(size: blob.step, data: blob.data + (i * blob.step)

proc `[]`*(blob: Blob, s: Slice[int]): Blob =
  assert s.a >= 0 and s.b < blob.size
  Blob(size: s.b - s.a, step: blob.step, data: blob.dataAt(s.a))

proc `[]=`*(blob: var Blob, i: int, chunk: Chunk) =
  assert i >= 0 and i < blob.size and blob.step == chunk.size
  copyMem(source = chunk.data, dest = blob.dataAt(i), size = chunkSize(blob))

and so on. Essentially I would like to treat this blob as a collection of chunks of fixed size (the line length + 1) and redefine the various operations to make sure that client code never makes errors such as pointing halfway through a chunk.

On the other hand, it seems that the parallel and spawn commands try to check array bounds and disjointness. Is it possible to make them work together? Or is there a better way to parallelize this task?

EDIT I have tried using a parallel block:

parallel:
  for i in 0 .. < cores:
    if i < cores - 1:
      var slice = content[((i * size) div cores) .. < (((i + 1) * size) div cores)]
      spawn mysort(slice)
    else:
      var slice = content[((i * size) div cores) .. < size]
      spawn mysort(slice)

This does compile, but running it on 4 cores takes 4 times the time it takes running it single-threaded. I guess this is due to some implicit locking that Nim is putting somewhere, but I have no clue where to look

Araq (orginal) [2015-02-03T11:43:26+01:00] view original

This does compile, but running it on 4 cores takes 4 times the time it takes running it single-threaded

This is strange. Try to run it with a profiler please.

def (orginal) [2015-02-03T13:14:10+01:00] view original

Can you post the full code that runs slower?

andrea (orginal) [2015-02-04T09:18:09+01:00] view original

It is essentially what I posted above (a few more operations) and then this:

proc pass(data: var Blob, shift: int) =
  var buckets: seq[seq[Chunk]] = newSeq[seq[Chunk]](256)
  for i in 0 .. < 256:
    buckets[i] = @[]
  
  for d in data:
    let b = nthByte(d, shift).int
    buckets[b].add(copy(d))
  var count = 0
  for bucket in buckets:
    for d in bucket:
      data[count] = d
      free(d)
      count += 1

proc radix*(data: var Blob) =
  for i in countdown(chunkSize(data) - 1, 0):
    pass(data, i)

I have made a few modifications since when I posted first - in particular the intermediate structures that I use during the sort (buckets) are not sequences instead of blobs. This has improved the multithreaded running time, so it now takes 20 seconds with one core and 34 seconds with four cores.

Still, I expected a 4x speedup, since what I am doing is embarassingly parallel (at least the first step of sorting the four slices independently - I still have to merge them together).

Does Nim code use some locking when running code in a parallel block?

I am not sure how to run Nim programs through a profiler, any hint?

def (orginal) [2015-02-04T13:15:15+01:00] view original

I was looking for the full source code so I can actually run it and see what causes the slowdown. Combining all your code together doesn't even compile.

andrea (orginal) [2015-02-05T17:56:45+01:00] view original

I have tried reducing this to the bare minimum, but it is quite long for a forum, and it links to the Intel performance primitives, so I am not sure how to share it.

But in short, what it does is

allocate a Blob (defined above) by memory mapping an external file; this memory is allocated globally with malloc (i.e. outside the thread-local heaps)

choose distinct regions of the memory

for each region, start a thread that performs a sort of the lines in that region

merge in a single thread the resulting blocks (to be done)

I would expect each thread to take about 1/4 of the total time, but unexpectedly there is an improvement of only about 20%. What I would like to understand is whether the parallel/spawn mechanism of Nim in fact performs some form of locking that I do not see in my client code. Do you have any ideas how to go about this?

Varriount (orginal) [2015-02-05T21:20:23+01:00] view original

Feel free to use pastebin or github's gist service for your code (gist even has highlighting for Nim available, and mult-file support)

andrea (orginal) [2015-02-06T18:41:06+01:00] view original

I managed to work around this by avoiding sharing a structure, but I still have doubts about the threading model of Nim (I have covered them in another question, to avoid going too off-topic here)

Mirror of forum.nim-lang.org

824 :: Parallelization of unsafe code