I am trying to experiment various sorting algorithm on big files. These file are structured into lines of fixed lengths. Hence a natural approach is to map the files into memory, split them into n chunks, where n i the number of processors (no problem, as line lengths are fixed), sort the lines in each chunk in parallel, and finally reassemble the pieces with a merge operation.
I am not sure what is the best way to do this in Nim. On the one hand, I will have to work on unsafe data - essentially a giant array of bytes. I have written some methods that expose a somewhat safer interface to this blob, taking care of proper indexing and iteration. So I have something on the lines of
type
cell* = uint8
UncheckedArray{.unchecked.}[T] =
array[1, T]
Chunk* = object
size: int
data: ptr UncheckedArray[cell]
Blob* = object
size: int
step: int
data: ptr UncheckedArray[cell]
template dataAt(blob: Blob, offset: int): expr =
blob.data + (blob.step * offset * sizeof(cell))
proc chunkSize*(blob: Blob): int {.inline.} = blob.step * sizeof(cell)
proc `[]`*(blob: Blob, i: int): Chunk =
assert i >= 0 and i < blob.size
Chunk(size: blob.step, data: blob.data + (i * blob.step)
proc `[]`*(blob: Blob, s: Slice[int]): Blob =
assert s.a >= 0 and s.b < blob.size
Blob(size: s.b - s.a, step: blob.step, data: blob.dataAt(s.a))
proc `[]=`*(blob: var Blob, i: int, chunk: Chunk) =
assert i >= 0 and i < blob.size and blob.step == chunk.size
copyMem(source = chunk.data, dest = blob.dataAt(i), size = chunkSize(blob))
and so on. Essentially I would like to treat this blob as a collection of chunks of fixed size (the line length + 1) and redefine the various operations to make sure that client code never makes errors such as pointing halfway through a chunk.
On the other hand, it seems that the parallel and spawn commands try to check array bounds and disjointness. Is it possible to make them work together? Or is there a better way to parallelize this task?
EDIT I have tried using a parallel block:
parallel:
for i in 0 .. < cores:
if i < cores - 1:
var slice = content[((i * size) div cores) .. < (((i + 1) * size) div cores)]
spawn mysort(slice)
else:
var slice = content[((i * size) div cores) .. < size]
spawn mysort(slice)
This does compile, but running it on 4 cores takes 4 times the time it takes running it single-threaded. I guess this is due to some implicit locking that Nim is putting somewhere, but I have no clue where to look
This does compile, but running it on 4 cores takes 4 times the time it takes running it single-threaded
This is strange. Try to run it with a profiler please.
proc pass(data: var Blob, shift: int) =
var buckets: seq[seq[Chunk]] = newSeq[seq[Chunk]](256)
for i in 0 .. < 256:
buckets[i] = @[]
for d in data:
let b = nthByte(d, shift).int
buckets[b].add(copy(d))
var count = 0
for bucket in buckets:
for d in bucket:
data[count] = d
free(d)
count += 1
proc radix*(data: var Blob) =
for i in countdown(chunkSize(data) - 1, 0):
pass(data, i)
I have made a few modifications since when I posted first - in particular the intermediate structures that I use during the sort (buckets) are not sequences instead of blobs. This has improved the multithreaded running time, so it now takes 20 seconds with one core and 34 seconds with four cores.
Still, I expected a 4x speedup, since what I am doing is embarassingly parallel (at least the first step of sorting the four slices independently - I still have to merge them together).
Does Nim code use some locking when running code in a parallel block?
I am not sure how to run Nim programs through a profiler, any hint?
I have tried reducing this to the bare minimum, but it is quite long for a forum, and it links to the Intel performance primitives, so I am not sure how to share it.
But in short, what it does is
I would expect each thread to take about 1/4 of the total time, but unexpectedly there is an improvement of only about 20%. What I would like to understand is whether the parallel/spawn mechanism of Nim in fact performs some form of locking that I do not see in my client code. Do you have any ideas how to go about this?