I have something like the following:
..code-block::nim
let value:int32 = r0[col0].int32 + r0[col1].int32 * 2 + ...
where r0 has type ptr UncheckedArray[uint8].
How can I avoid adding .int32 each time I use one item from the unckeckedarray?
# rename to whatever, col can also be typed
template get(col): int32 = r0[col].int32
let value:int32 = get(col0) + get(col1) * 2 + ...
let value:int32 = r0[col0].int32 + r0[col1].int32 * 2 + r0[col2].int32 +
r1[col0].int32 * 2 + r1[col1].int32 * 4 + r1[col2].int32 * 2 +
r2[col0].int32 + r2[col1].int32 * 2 + r2[col2].int32
w1[col1] = (value / 16.0).uint8
I am looking how to improve performance-wise without entering into SIMD stuff (that I have never used by the way). I think that all those type conversions are killing the performance that I should achieve.
Type conversion from uint8 to uint32 (or uint64) is technically free because registers have a size of 32-bit or 64-bit.
If anything it's less costly because at a low-level loading a 8-bit value requires movzb (mov to register and zero extend the byte) which has a slightly high latency than plain mov for uint32 and uint64.
The issue in your convolutions is memory bottleneck not compute.
You are probably right. For me is the first time doing this sort of stuff. But the feeling that I have is that even small modifications in the algorithm (avoiding conversions as much as possible, for example) has a significant impact. I think that the reason is that I am interating on 100_000 frames of 640x480 pixels.
For instance I have just reached 1200fps just by replacing how the rows numbers are calculated. Changing this:
template clamp(val:cint, max_val:cint):untyped =
min( max(val, 0), max_val).uint
.....
for row1 in 0..<height:
let row0 = clamp(row1-1, h)
let row2 = clamp(row1+1, h)
......
into:
iterator `...`[T](ini:T,`end`:T):tuple[a:uint,b:uint,c:uint] =
let ini = ini.uint
let `end` = `end`.uint
yield (a:ini,b:ini,c:ini+1)
for i in ini+1..`end`-1:
yield (a:i-1,b:i,c:i+1)
yield (a:`end`-1,b:`end`,c:`end`)
...
for (row0, row1, row2) in 0...h:
....
The calculation now looks like this:
#import math
import ../vapoursynth
import math
iterator `...`[T](ini:T,`end`:T):tuple[a:uint,b:uint,c:uint] =
let ini = ini.uint
let `end` = `end`.uint
yield (a:ini,b:ini,c:ini+1)
for i in ini+1..`end`-1:
yield (a:i-1,b:i,c:i+1)
yield (a:`end`-1,b:`end`,c:`end`)
proc apply_kernel*(src:ptr VSFrameRef, dst:ptr VSFrameRef, kernel:array[9, int32], mul:int, den:int) =
#let n = (( math.sqrt(kernel.len.float).int - 1 ) / 2).int
for pn in 0..<src.numPlanes: # pn= Plane Number
# These cost 60fps (if I take them outside of this loop)
let height = src.height( pn )
let width = src.width( pn )
let h = (height-1)
let w = (width-1)
for (row0, row1, row2) in 0...h:
let r0 = src[pn, row0]
let r1 = src[pn, row1]
let r2 = src[pn, row2]
let w1 = dst[pn, row1]
for (col0, col1, col2) in 0...w:
let value:int32 = r0[col0].int32 + r0[col1].int32 * 2 + r0[col2].int32 +
r1[col0].int32 * 2 + r1[col1].int32 * 4 + r1[col2].int32 * 2 +
r2[col0].int32 + r2[col1].int32 * 2 + r2[col2].int32
w1[col1] = (value / den).uint8
and I have managed to go from the original 80fps to 1200fps. But I am still very far from what I should be able to reach.
The analogous C++ code (using float instead of int32) for reference is way faster:
auto srcp = reinterpret_cast<const float*>(vsapi->getReadPtr(frame, plane));
auto dstp = reinterpret_cast<float*>(vsapi->getWritePtr(dst, plane));
auto h = vsapi->getFrameHeight(frame, plane);
auto w = vsapi->getFrameWidth(frame, plane);
auto gauss = [&](auto y, auto x) {
auto clamp = [&](auto val, auto bound) {
return std::min(std::max(val, 0), bound - 1);
};
auto above = srcp + clamp(y - 1, h) * w;
auto current = srcp + y * w;
auto below = srcp + clamp(y + 1, h) * w;
auto conv = above[clamp(x - 1, w)] + above[x] * 2 + above[clamp(x + 1, w)] +
current[clamp(x - 1, w)] * 2 + current[x] * 4 + current[clamp(x + 1, w)] * 2 +
below[clamp(x - 1, w)] + below[x] * 2 + below[clamp(x + 1, w)];
return conv / 16;
};
for (auto y = 0; y < h; y++)
for (auto x = 0; x < w; x++)
(dstp + y * w)[x] = gauss(y, x);
When you mention memory bottleneck, what do you mean? Where should I look at?
Thanks a lot
If you have instructions to reproduce the benchmark in Nim and C++ I can help.
I just need a repo to clone, the scripts to run and the dataset. I already have Vapoursynth working.
Ideally you have a profiler like Intel Instruments or Apple VTune to dive into assembly.
For example this is my approach in debugging performance issue: https://github.com/nim-lang/Nim/issues/9514
For memory bottlenecks it's a bit different, I use the roofline model as mentioned in my convolution optimization resources: https://github.com/numforge/laser/blob/master/research/convolution_optimisation_resources.md#computational-complexity
For example I know that matrix multiplication and convolution can reach 90% of the peak CPU GFlop because their arithmetic intensity is high (i.e. you do over 10 operations (add/mul) per byte) so if you don't reach that perf it's because you are moving bytes to much instead of computing with them.
The theoretical peak of your CPU is easy to compute:
CpuGhz * VectorWidth * InstrCycle * FlopInstr
for a CPU that supports AVX2 on float32 (so packing 8 float32) that can issue 2 Fused-Multiply-Add per cycle at 3GHz we have
3 (GHz) * 8 (packed float32 in AVX) * 2 (FMA per cycle) * 2 (FMA = 1 add + 1 mul) = 96 GFlops
And then the usual way to benchmark numerical algorithm is, you know the number of operations required by your algorithm, you divide that by the time spent to do them and you have your actual flops. And you compare your actual Flops with the theoretical peak. If you only reach 20% of the peak, you have a memory bottleneck and probably need to repack before processing to optimize cache usage, if not you need to look into SIMD vectorization, prefetching, ...
All of that is quite complex so what I can do is reach the naive C++ implementation performance.
Going beyond is something that I want to do but it's time-consuming and I feel that it would be better to spend my time on an image processing compiler similar to what's discussed here: https://github.com/mratsim/Arraymancer/issues/347#issuecomment-459351890 and with a proof of concept here:
Nim stuff
I have updated the repository. I have a local package.json with:
[ { "name": "vapoursynth", "url": "https://github.com/mantielero/VapourSynth.nim", "method": "git", "tags": [ "video", "vapoursynth" ], "description": "Video processing based on Vapoursynth", "license": "BSD" } ]
With that you only need:
nimble install vapoursynth
The files to play with are custom_filter.nim which makes use of DrawFrame.nim. So you need these two files in the same folder and compile them like:
nim c -d:release -d:danger custom_filter
And in order to get the performance, just run it:
$ ./custom_filter
I will check how to do the same in C++.
C++ stuff
In order to test the C++ version:
$ g++ -Wall -O3 -shared -fPIC -I. -o libfilter.so GaussBlur.cxx
import vapoursynth as vs
core = vs.get_core()
core.std.LoadPlugin(path='./libfilter.so')
core.std.SetMaxCPU('none')
clip = core.std.BlankClip(format=vs.GRAYS, length=100000, fpsnum=24000, fpsden=1001, keep=True)
clip = core.testc.GaussBlur(clip)
clip.set_output()
$ vspipe test_filter.vpy /dev/null Output 100000 frames in 29.53 seconds (3386.27 fps)
So I am just getting 3386fps with the pure C++ version while I am getting 1200fps with the Nim version.
I am not comparing apples to apples here:
proc getPtr(frame:ptr VSFrameRef, plane:cint):uint =
## Retrieves an `uint` rather than the pointer in order to perform pointer arithmetic with it
cast[uint]( API.getWritePtr(frame, plane) )
proc stride(frame:ptr VSFrameRef, plane:cint):uint =
API.getStride( frame, plane ).uint
#type
# Row = ptr UncheckedArray[uint8]
proc `[]`*(frame:ptr VSFrameRef, plane:cint, row:uint):ptr UncheckedArray[uint8] =
let ini:uint = frame.getPtr(plane) # Don't use other thing here
let stride:uint = frame.getStride( plane )
result = cast[ptr UncheckedArray[uint8]]( ini + row * stride )
#assert(result != nil)
proc `[]`*(frame:ptr VSFrameRef, plane:cint, row:cint):ptr UncheckedArray[uint8] =
let ini:uint = frame.getPtr(plane) # Don't use other thing here
let stride:uint = frame.getStride( plane )
result = cast[ptr UncheckedArray[uint8]]( ini + row.uint * stride )
proc `[]`*(frame:ptr VSFrameRef, plane:cint ):Plane =
let ini = frame.getPtr(plane)
let stride = frame.stride(plane)
return Plane(ini:ini,stride:stride)
proc `[]`*(plane:Plane, row:cint):Row = #ptr UncheckedArray[uint8] =
cast[ptr UncheckedArray[uint8]]( plane.ini + row.uint * plane.stride )
Instead I suggest you use a scheme like this, only get the plane/query the API once and use the plane object that holds the metadata. This way the compiler can do inlining and loop hoisting on all those accesses.
The main issue with calling the VapourSynth API repeatedly is that all those function calls are just polluting your cache, requires the compiler to move the function parameter and result to and from the stack in an inner loop.
import ../src/vapoursynth
type Plane*[T: SomeNumber] = object
# T depends on the frame format:
# - sample_type (uint or float)
# - bitsPerSample (8, 16, 32, 64)
# The plane must be instantiated with
# the proper uint8 ... 64, float32, float64
data: ptr UncheckedArray[T]
rows: int
cols: int
rowStride: int
func getPlane*(frame: ptr VSFrameRef, plane: int, T: typedesc[SomeInteger]): Plane[T] {.inline.} =
## Extract a plane
result.data = cast[ptr UncheckedArray[T]](API.getWritePtr(frame, plane))
result.rows = API.getFrameHeight(frame, plane)
result.cols = API.getFrameWidth(frame, plane)
result.rowStride = API.getStride(frame, plane)
template `[]`*[T](plane: Plane[T], row: int, col: int): T =
plane.data[plane.row * plane.rowStride + plane.col]
template `[]=`*[T](plane: Plane[T], row: int, col: int, val: T) =
plane.data[plane.row * plane.rowStride + plane.col] = val
Regarding iteration strategy, your scheme is doing 3 times more memory access than a basic scheme for each row and each column (total 9x). Instead you should iterate once per row/column but in-place accumulate with dst[dstIdx] += src[srcIdx] * filter[filterIdx]
A starter point could be my basic convolution scheme in Laser https://github.com/numforge/laser/blob/d1e6ae61/benchmarks/convolution/conv2d_direct_convolution.nim#L8-L73
It's for batch of images with potentially different plane numbers (called channels) between in and out (for example RGB --> Grayscale) but the iteration should be faster than yours:
Note that here I tracked the indices manually instead of using an intermediate Plane object but both methods should be the same with an optimizing compiler.
proc conv2d_direct*[T](
oim: var Tensor[T], # Output images
iim: Tensor[T], # Input images
ishape: TensorShape, # Shape of a single image
kernel: Tensor[T], # filter kernel
kshape: KernelShape, # kernel shape (should be const)
padding: Padding, # Padding (should be const)
strides: Strides # Strides (should be const)
) =
## oim must be zero-initialized
# Reminder: convolution deep learning == cross-correlation signal processing
assert ishape.c == kshape.c_in
let out_shape = conv2d_out_shape(ishape, kshape, padding, strides)
assert oim.len == out_shape.n * out_shape.c * out_shape.h * out_shape.w
let
N = ishape.n
H = ishape.h
W = ishape.w
outH = out_shape.h
outW = out_shape.w
kdata = cast[ptr UncheckedArray[T]](kernel[0].unsafeaddr)
let # Should be const but padding.h causes problem and padding[0] indexing
# doesn't work in generic proc
C_out = kshape.c_out
C_in = kshape.c_in
kH = kshape.kH
kW = kshape.kW
pH = padding.h
pW = padding.w
sH = strides.h
sW = strides.w
let odata = cast[ptr UncheckedArray[T]](oim[0].addr)
let idata = cast[ptr UncheckedArray[T]](iim[0].unsafeaddr)
const arithmetic_intensity = 12 # number of FLOP per byte for the convolution
const flop_per_elem = arithmetic_intensity div sizeof(T)
let parallelize = OMP_MEMORY_BOUND_GRAIN_SIZE div flop_per_elem < outH * outW
for n in 0 ..< N:
for co in 0 ..< C_out:
for ci in 0 ..< C_in:
# We parallelize over the image height to deal with cases
# where we have a single image or a low number of channels
# We assume no false sharing with a proper grain size
omp_parallel_if(parallelize):
omp_for(oh, outH, use_simd = true, nowait=true): # output height
let ih = sH * oh # input height
for ow in 0 ..< outW: # output width
let iw = sH * ow # input width
# The following should be loop hoisted by the compiler
let oidx = ow + outW * (oh + outH * (co + C_out * n)) # odata[n][co][oh][ow]
# Entering conv kernel region
for krow in 0 ..< kH:
let row = ih + krow - pH
if row <% H: # Unsigned '<' does 0 < row < H.
for kcol in 0 ..< kW:
let col = iw + kcol - pW
if col <% W: # Unsigned '<' does 0 < row < H.
let iidx = col + W * (row + H * (ci + C_in * n)) # idata[n][ci][row][col]
let kidx = kcol + kW * (krow + kH * (ci + C_in * co)) # kdata[co][ci][krow][kcol]
odata[oidx] += idata[iidx] * kdata[kidx]
I used both strategies as can be seen here:
proc `[]`*(frame:ptr VSFrameRef, plane:cint ):Plane =
let ini = frame.getPtr(plane)
let stride = frame.stride(plane)
return Plane(ini:ini,stride:stride)
I will give it another shot as you suggest.
Regarding the triple iteration, I realized that, but I don't want to change that (yet) until I get similar performance to the C++ version with similar algorithm.
I tried to perform a filter with ArrayMancer but I got worst.
My short term objective is trying to reach in the order of the 3000fps using float32 and multithreading (keeping similar to the C++ algorithm).
Next step, to eliminate the triple loop on the rows and to see if I can reduce the number of times that I call getStride and getPtr.
In the long term I'd like to try vectorization.
Well now I am amazed but on the other side of the pic.
I need to check if I have done something really wrong, but I have managed to get: 6074fps using float32 (I was expecting 3000fps) and 10700fps using int32, just by taking advance of multithreading.
I am aware of this. I am just scaffolding the library. Just to avoid doing something fundamentally wrong (like reaching 80fps instead of >1000fps ;oP). In the future, the library should be more general. In this case, I take advance that it is always 1 byte per sample. To be honest I don't know why it is using float in the C++ version. The API states that both the read and write pointer are uint8. Doing floating point vector is something that I should try in the future.
I'm pretty sure it's supposed to be void * and the sampleType and bitsPerSample are telling you what to cast it to.
Anyway, very nice to hear your improvements.
Arraymancer indexing via t[i, j] would be slow because:
I.e. hopefully in the future the way to specify a convolution would be similar to Halide (https://halide-lang.org/assets/lectures/Halide_CVPR_intro.pdf) and that would lead to optimal code:
proc convolve(A, Filter: Function): Function =
# Generator for the convolution function
var i, j: Domain
var convH, convW: Function
# Definition of the result function
# Filter F = [1, 2, 1]
# So 2D application is:
# [[1, 2, 1],
# [2, 4, 2],
# [1, 2, 1]]
#
convH[i, j] = A[i-1, j] * F[0] + A[i, j] * F[1] + A[i+1, j] * F[2]
convW[i, j] = convH[i, j-1] * F[0] + convH[i, j] * F[1] + convH[i, j] * F[2]
# Optimization of the definition - completely optional
convH.compute_at(convW, j) # Compute the required region of convH before convW start (i.e. merged)
# convH.compute_root() # Alternatively do the height convolve before the width convolve (i.e. 2 steps)
convW.tile(i, ii, 64) # Split i in tile of size 64, with ii as the tile iterator
convW.parallel(i) # Iterating on tiles is done in parallel
convW.tile(j, jj, 64) # Split j in tile of size 64 as well with jj as the tile iterator
convW.vectorize(jj, 8) # Iterating on jj is done with size 8 vectors if possible (i.e. AVX on x86)
return convW
generate convolve:
proc foobar(a: Tensor[float32], filter: array[3, float32]): Tensor[float32]
(From Lux: https://github.com/numforge/laser/blob/d1e6ae61/laser/lux_compiler/lux_dsl.nim#L43-L58)
The importance of loop tiling (also called loop blocking) is highlighted in this simple transposition benchmark: https://github.com/numforge/laser/blob/d1e6ae61/benchmarks/transpose/transpose_bench.nim
There is a factor 3x between simple parallel transposition (|| is the OpenMP parallel for loop)
proc benchNaive(a: seq[float32], nb_samples: int) =
var output = newSeq[float32](out_size)
withCompilerOptimHints()
let pa{.restrict.} = cast[ptr UncheckedArray[float32]](a[0].unsafeAddr)
let po{.restrict.} = cast[ptr UncheckedArray[float32]](output[0].addr)
bench("Naive transpose"):
discard
do:
for i in `||`(0, M-1):
for j in `||`(0, N-1, "simd"): # This only add "#pragma omp simd"
po[i+j*M] = pa[j+i*N]
1D tiling
proc benchCacheBlocking(a: seq[float32], nb_samples: int) =
var output = newSeq[float32](out_size)
withCompilerOptimHints()
let pa{.restrict.} = a[0].unsafeAddr
let po{.restrict.} = output[0].addr
const blck = 64
bench("Cache blocking"):
discard
do:
{.emit: """
// No min function in C ...
#define min(a,b) (((a)<(b))?(a):(b))
#pragma omp parallel for
for (int i = 0; i < `M`; i+=`blck`)
for (int j = 0; j < `N`; ++j)
#pragma omp simd
for (int ii = i; ii < min(i+`blck`,`M`); ++ii)
`po`[ii+j*`M`] = `pa`[j+ii*`N`];
""".}
And 2D tiling
proc bench2Dtiling(a: seq[float32], nb_samples: int) =
var output = newSeq[float32](out_size)
withCompilerOptimHints()
let pa{.restrict.} = a[0].unsafeAddr
let po{.restrict.} = output[0].addr
const blck = 128
bench("2D Tiling"):
discard
do:
{.emit: """
// No min function in C ...
#define min(a,b) (((a)<(b))?(a):(b))
#pragma omp parallel for collapse(2)
for (int i = 0; i < `M`; i+=`blck`)
for (int j = 0; j < `N`; j+=`blck`)
for (int ii = i; ii<i+`blck` && ii<`M`; ii++)
#pragma omp simd
for (int jj = j; jj<min(j+`blck`,`N`); jj++)
`po`[ii+jj*`M`] = `pa`[jj+ii*`N`];
""".}
And you can gain about 20% performance with prefetching
proc bench2DtilingExchangedPrefetch(a: seq[float32], nb_samples: int) =
var output = newSeq[float32](out_size)
withCompilerOptimHints()
let pa{.restrict.} = a[0].unsafeAddr
let po{.restrict.} = output[0].addr
const blck = 32
bench("2D Tiling + Prefetch - input row iteration"):
discard
do:
{.emit: """
#pragma omp parallel for collapse(2)
for (int j = 0; j < `N`; j+=`blck`)
for (int i = 0; i < `M`; i+=`blck`)
for (int jj = j; jj<j+`blck` && jj<`N`; jj++)
#pragma omp simd
for (int ii = i; ii<min(i+`blck`,`M`); ii++)
`po`[ii+jj*`M`] = `pa`[jj+ii*`N`];
__builtin_prefetch(&`pa`[(i+1)*`N`], 0, 1); // Prefetch read with low temporal locality
""".}
Most of the low hanging fruits to accelerate convolution are similar to accelerating transposition so I suggest you try to understand those benchmarks and techniques i.e.:
Beyond the the convolution research I already linked, the research in previous post, if you really want to go in the rabbit hole, additional reads are:
- 6 layers of tiling, 2 for each of the M,N,K dimensions (since we do M-by-K * K-by-N -> M-by-N matrices.
- repack the data to a temporary storage and change the data layout so that all tile accesses are contiguous on the M and N dimensions (i.e. transpose so that we access K-by-M and K-by-N
- Extra tiling for vectorizations, and careful track of low-level register usage.
- all the tiling is carefully tuned to properly use L1 and L2 cache and also limit TLB faults.
- prefetching
- in actual code: https://github.com/numforge/laser/blob/d1e6ae61/laser/primitives/matrix_multiplication/gemm_tiling.nim#L284-L305