Just for the record the C++ version is here using templates.
And this is by using the low level API.
Let me try asking less general questions:
What is preferred, using UncheckedArray, using these templates? I guess all this is better than performing a memcopy into a Nim structure.
I would prefer UncheckedArray, since it's integrated directly into Nim and also because I personally it's semantics.
If you want to expose an API in a safer way UncheckedArrays also have the advantage that they can be cast into openArrays
At this stage I can only recommend NOT to use VapourSynth.nim. It is too slow (not because of nim, but I cannot find out why).
Testing the convolution filter in Python:
import vapoursynth as vs
core = vs.get_core()
core.std.SetMaxCPU('none')
clip = core.std.BlankClip(format=vs.GRAYS, length=100000, fpsnum=24000, fpsden=1001, keep=True)
clip = core.std.Convolution(clip, matrix=[1,2,1,2,4,2,1,2,1])
clip.set_output()
So I get:
$ vspipe test.vpy /dev/null Output 100000 frames in 26.73 seconds (3740.91 fps)
My version:
import ../vapoursynth
import options
BlankClip( format=pfGrayS.int.some,
width=640.some,
height=480.some,
length=100000.some,
fpsnum=24000.some,
fpsden=1001.some, keep=1.some).Convolution(@[1.0,2.0,1.0,2.0,4.0,2.0,1.0,2.0,1.0]).Null
so:
$ nim c -f --gc:none -d:release -d:danger modifyframe $ time ./modifyframe real 0m37,872s user 0m38,989s sys 0m1,997s
which is: 2640.47fps
On the other hand you can create your own filters. In that regard, I have managed to apply a simple Gauss filter to 100000frames in:
$ time ./modifyframe real 8m25,425s user 8m24,112s sys 0m5,422s
which is 198fps. Way too slow when compared with the C++ version.
At this stage I can only recommend NOT to use VapourSynth.nim. It is too slow
Please note that your first task would be to make it work with default refc GC and with arc GC.
For performance you would never use a seq[seq] for your filter matrix, but a continues block of memory which lives in the cache permanently. And you would like to use SIMD instructions. For SIMD you may try to code it manually, I think we have a simd module provided by mratsim, or you would write Nim code that can be converted by the Nim compiler in something for which the C compiler can apply SIMD instructions.
Working with C filters
When I work with C filters (like Convolution) I works fine with refc GC. In that case, I don't know why I get 2640fps vs 3740fps on the python version. They should be roughly the same. When I call a filter within an existing plugin I just call one function:
...
result = API.invoke(plug, "Convolution".cstring, args)
I don't do any calculation with the frames. The only thing that should be different with regard to python, is that in order to test the filter, they use: vspipe. I created a Null filter in pure Nim which is really simple (just ask for a frame -a pointer- and frees it):
proc Null*(vsmap:ptr VSMap) =
let node = getFirstNode(vsmap)
let vinfo = API.getVideoInfo(node) # video info pointer
for i in 0..<vinfo.numFrames:
let frame = node.getFrame(i)
#let frame = API.getFrame(i.cint, node, nil, 0.cint)
freeFrame( frame )
API.freeMap(vsmap)
API.freeNode(node)
Working with Nim filters
I have two problems here :
SIGSEGV: Illegal storage access. (Attempt to read from nil?)
which is not a very helpful message. So I don't know where to start.
This is the source of your slowness
let kernel = @[@[1, 2, 1],
@[2, 4, 2],
@[1, 2, 1] ]
Instead you should use
let kernel = [[1, 2, 1],
[2, 4, 2],
[1, 2, 1]]
You absolutely want all your frequently used data to be laid out contiguously in memory, seq[seq[T]] is the worse thing you can do for speed.
On current CPUs, the bottleneck in image processing code is the memory accesses, optimize memory accesses and you will be fast.
Also you want to use ptr UncheckedArray to avoid the bound-checks on every array access.
A basic convolution would be written this way: https://github.com/numforge/laser/blob/d1e6ae61/benchmarks/convolution/conv2d_direct_convolution.nim#L8-L73 (note that this is a convolution for a batch of images in NCHW format for N batch, C color channels, H height, W width)
In terms of performance it reaches about 1.5% of the theoretical peak.
Further speed improvements are much more involved, you can reach about 10% of theoretical maximum by implementing your convolution using matrix multiplication from an optimized BLAS library: https://github.com/numforge/laser/blob/d1e6ae61/benchmarks/convolution/conv2d_im2col.nim.
Achieving about 80%~90% of theoretical maximum is something I didn't manage yet but the steps are detailed there: https://github.com/numforge/laser/blob/master/research/convolution_optimisation_resources.md in particular in the Intel paper Anatomy of High-performance Deep Learning Convolutions on SIMD Architecture (2018-08).
Looking into the C++ code, it looks like a naive implementation that would use about 1.5% of theoretical peak.
For the absolute max performance, you can reach over 110% of the theoretical peak by using Winograd convolutions as your kernel is of shape 3x3. Winograd convolution "cheats" by using this special 3x3 form to avoid doing useless operations hence going over 100% of the theoretical CPU peak, small high-level explanation here: https://blog.usejournal.com/understanding-winograd-fast-convolution-a75458744ff
I've planned to write an image processing and deep learning compiler to make it easy to write high-performance deep learning and image processing code but not enough time :/.
Still slow but getting better.
import ../vapoursynth
import math
template clamp(val:int, max_val:int):untyped =
min( max(val, 0), max_val)
proc apply_kernel*(src:ptr VSFrameRef, dst:ptr VSFrameRef, kernel:array[9, int32], mul:int, den:int) =
let fi = API.getFrameFormat(src) # Format information
let n = (( math.sqrt(kernel.len.float).int - 1 ) / 2).int
for i in 0..<fi.numPlanes:
var srcPlane = src[i]
var dstPlane = dst[i]
let height = srcPlane.height
let width = srcPlane.width
for row in 0..<height:
for col in 0..<width:
let row0 = clamp(row-1, height-1)
let row2 = clamp(row+1, height-1)
let col0 = clamp(col-1, width-1)
let col2 = clamp(col+1, width-1)
let value:int32 = srcPlane[row0,col0] + srcPlane[row0,col] * 2 + srcPlane[row0,col2] +
srcPlane[row,col0] * 2 + srcPlane[row,col] * 4 + srcPlane[row,col2] * 2 +
srcPlane[row2,col0] + srcPlane[row2,col] * 2 + srcPlane[row2,col2]
dstPlane[row, col] = (value * mul / den).uint8
(I know I am not using kernel or n)