nimforum mirror - Vapoursynth

mantielero (orginal) [2020-03-30T19:47:04+02:00] view original

Couldn't render post #37891.

mantielero (orginal) [2020-03-30T19:51:44+02:00] view original

Just for the record the C++ version is here using templates.

And this is by using the low level API.

mantielero (orginal) [2020-03-31T20:52:46+02:00] view original

Let me try asking less general questions:

What is preferred, using UncheckedArray, using these templates? I guess all this is better than performing a memcopy into a Nim structure.

doofenstein (orginal) [2020-03-31T21:52:55+02:00] view original

I would prefer UncheckedArray, since it's integrated directly into Nim and also because I personally it's semantics.

If you want to expose an API in a safer way UncheckedArrays also have the advantage that they can be cast into openArrays

mantielero (orginal) [2020-04-02T21:51:19+02:00] view original

At this stage I can only recommend NOT to use VapourSynth.nim. It is too slow (not because of nim, but I cannot find out why).

Testing the convolution filter in Python:

import vapoursynth as vs
core = vs.get_core()
core.std.SetMaxCPU('none')
clip = core.std.BlankClip(format=vs.GRAYS, length=100000, fpsnum=24000, fpsden=1001, keep=True)
clip = core.std.Convolution(clip, matrix=[1,2,1,2,4,2,1,2,1])
clip.set_output()

So I get:

$ vspipe test.vpy /dev/null
Output 100000 frames in 26.73 seconds (3740.91 fps)

My version:

import ../vapoursynth
import options
BlankClip( format=pfGrayS.int.some,
           width=640.some,
           height=480.some,
           length=100000.some,
           fpsnum=24000.some,
           fpsden=1001.some, keep=1.some).Convolution(@[1.0,2.0,1.0,2.0,4.0,2.0,1.0,2.0,1.0]).Null

so:

$ nim c -f --gc:none -d:release -d:danger modifyframe
$ time ./modifyframe

real	0m37,872s
user	0m38,989s
sys	0m1,997s

which is: 2640.47fps

On the other hand you can create your own filters. In that regard, I have managed to apply a simple Gauss filter to 100000frames in:

$ time ./modifyframe

real	8m25,425s
user	8m24,112s
sys	0m5,422s

which is 198fps. Way too slow when compared with the C++ version.

Stefan_Salewski (orginal) [2020-04-03T05:17:17+02:00] view original

At this stage I can only recommend NOT to use VapourSynth.nim. It is too slow

Please note that your first task would be to make it work with default refc GC and with arc GC.

--gc:none was only suggested for tests, as you reported that it crash with refc GC.

For performance you would never use a seq[seq] for your filter matrix, but a continues block of memory which lives in the cache permanently. And you would like to use SIMD instructions. For SIMD you may try to code it manually, I think we have a simd module provided by mratsim, or you would write Nim code that can be converted by the Nim compiler in something for which the C compiler can apply SIMD instructions.

mantielero (orginal) [2020-04-03T11:14:19+02:00] view original

Working with C filters

When I work with C filters (like Convolution) I works fine with refc GC. In that case, I don't know why I get 2640fps vs 3740fps on the python version. They should be roughly the same. When I call a filter within an existing plugin I just call one function:

...
result = API.invoke(plug, "Convolution".cstring, args)

I don't do any calculation with the frames. The only thing that should be different with regard to python, is that in order to test the filter, they use: vspipe. I created a Null filter in pure Nim which is really simple (just ask for a frame -a pointer- and frees it):

proc Null*(vsmap:ptr VSMap) =
  let node = getFirstNode(vsmap)
  let vinfo = API.getVideoInfo(node) # video info pointer
  for i in 0..<vinfo.numFrames:
    let frame = node.getFrame(i)
    #let frame =  API.getFrame(i.cint, node, nil, 0.cint)
    freeFrame( frame )
  
  API.freeMap(vsmap)
  API.freeNode(node)

Working with Nim filters

I have two problems here :

For some reason it fails with the GC:

SIGSEGV: Illegal storage access. (Attempt to read from nil?)

which is not a very helpful message. So I don't know where to start.

On the other hand, in terms of performance, I am comparing with C++ code which is not using SIMD. In fact, they are using float32 (while I am using int32). The algorithm is pretty similar.

mratsim (orginal) [2020-04-03T15:44:30+02:00] view original

This is the source of your slowness

let kernel = @[@[1, 2, 1],
               @[2, 4, 2],
               @[1, 2, 1] ]

Instead you should use

let kernel = [[1, 2, 1],
               [2, 4, 2],
               [1, 2, 1]]

You absolutely want all your frequently used data to be laid out contiguously in memory, seq[seq[T]] is the worse thing you can do for speed.

On current CPUs, the bottleneck in image processing code is the memory accesses, optimize memory accesses and you will be fast.

Also you want to use ptr UncheckedArray to avoid the bound-checks on every array access.

A basic convolution would be written this way: https://github.com/numforge/laser/blob/d1e6ae61/benchmarks/convolution/conv2d_direct_convolution.nim#L8-L73 (note that this is a convolution for a batch of images in NCHW format for N batch, C color channels, H height, W width)

In terms of performance it reaches about 1.5% of the theoretical peak.

Further speed improvements are much more involved, you can reach about 10% of theoretical maximum by implementing your convolution using matrix multiplication from an optimized BLAS library: https://github.com/numforge/laser/blob/d1e6ae61/benchmarks/convolution/conv2d_im2col.nim.

Achieving about 80%~90% of theoretical maximum is something I didn't manage yet but the steps are detailed there: https://github.com/numforge/laser/blob/master/research/convolution_optimisation_resources.md in particular in the Intel paper Anatomy of High-performance Deep Learning Convolutions on SIMD Architecture (2018-08).

Looking into the C++ code, it looks like a naive implementation that would use about 1.5% of theoretical peak.

For the absolute max performance, you can reach over 110% of the theoretical peak by using Winograd convolutions as your kernel is of shape 3x3. Winograd convolution "cheats" by using this special 3x3 form to avoid doing useless operations hence going over 100% of the theoretical CPU peak, small high-level explanation here: https://blog.usejournal.com/understanding-winograd-fast-convolution-a75458744ff

I've planned to write an image processing and deep learning compiler to make it easy to write high-performance deep learning and image processing code but not enough time :/.

mantielero (orginal) [2020-04-03T17:17:12+02:00] view original

It has improved a bit. Regarding the Nim filter:

308fps by tuning seq[seq[int]], into array[9,int32]

442fps by changing the bound checks into min( max(val, 0), max_val)

Still slow but getting better.

import ../vapoursynth
import math

template clamp(val:int, max_val:int):untyped =
   min( max(val, 0), max_val)

proc apply_kernel*(src:ptr VSFrameRef, dst:ptr VSFrameRef, kernel:array[9, int32], mul:int, den:int) =
   let fi = API.getFrameFormat(src)  # Format information
   let n = (( math.sqrt(kernel.len.float).int - 1 ) / 2).int
   for i in 0..<fi.numPlanes:
      var srcPlane = src[i]
      var dstPlane = dst[i]
      
      let height = srcPlane.height
      let width  = srcPlane.width
      for row in 0..<height:
         for col in 0..<width:
               let row0 = clamp(row-1, height-1)
               let row2 = clamp(row+1, height-1)
               let col0 = clamp(col-1, width-1)
               let col2 = clamp(col+1, width-1)
               let value:int32  = srcPlane[row0,col0]    + srcPlane[row0,col] * 2 + srcPlane[row0,col2] +
                                  srcPlane[row,col0] * 2 + srcPlane[row,col] * 4  + srcPlane[row,col2] * 2 +
                                  srcPlane[row2,col0]    + srcPlane[row2,col] * 2 + srcPlane[row2,col2]
               dstPlane[row, col] = (value * mul / den).uint8

(I know I am not using kernel or n)

Mirror of forum.nim-lang.org

6134 :: Vapoursynth - optimization