Hey all, @guzba and I have been working on a 2D graphics library for an even bigger project we're working on. It's taken us some time so we're excited to have a version 1.0!
Pixie is a 2D graphics library similar to Cairo and Skia written entirely in Nim.
https://github.com/treeform/pixie
Features:
Supports butt, round and square line caps as well as miter, bevel and round line joins for paths.
For blending, we support many different blend modes including Normal, Darken, Multiply, Overlay, etc. For masking, we support various modes including subtract and intersect.
Many operations are SIMD accelerated and we're happy with the performance we have today. It'll only get better from here too.
One interesting thing we are doing to test complex path fills, strokes and anti-aliasing is rendering some complete icon sets. Here are 4 of them:
Bootstrap: https://raw.githubusercontent.com/treeform/pixie/master/tests/images/svg/flat-color-icons.png
Free Flat Color Icons: https://raw.githubusercontent.com/treeform/pixie/master/tests/images/svg/simple-icons.png
Ionicons: https://raw.githubusercontent.com/treeform/pixie/master/tests/images/svg/ionicons.png
Tabler Icons: https://raw.githubusercontent.com/treeform/pixie/master/tests/images/svg/tabler-icons.png
Simple Icons: https://raw.githubusercontent.com/treeform/pixie/master/tests/images/svg/twbs-icons.png
What's cool is these entire repos of SVG files are loaded and then rasterized by 100% Nim code, then compressed as PNG using again 100% Nim code. Fun stuff.
Regarding performance, I have seen that you use tiling for time to time https://github.com/treeform/pixie/blob/d459f26/src/pixie/images.nim#L656-L715 but not always https://github.com/treeform/pixie/blob/d459f26/src/pixie/images.nim#L376-L427
Simple tiling is an easy optimization that accelerates speed by 3x~50x by improving cache usage and optimizing data movement that are always the bottlenecks in stencil computations like blur or convolution. See various variations on a transposition kernels:
Results https://github.com/numforge/laser/blob/d1e6ae6/benchmarks/transpose/transpose_bench.nim#L698-L817
# Collapsed OpenMP - input row iteration
# Collected 250 samples in 9.003 seconds
# Average time: 36.012 ms
# Stddev time: 1.561 ms
# Min time: 34.149 ms
# Max time: 50.441 ms
# Perf: 0.222 GMEMOPs/s
# ...
# 2D Tiling
# Collected 250 samples in 3.232 seconds
# Average time: 12.927 ms
# Stddev time: 0.782 ms
# Min time: 12.036 ms
# Max time: 19.736 ms
# Perf: 0.619 GMEMOPs/s
In terms of easy optimization for 2D blur for example, if your convolution kernel as central symmetry consider using separable convolution so you would apply
# | 1 2 1 |
# | 2 4 2 |
# | 1 2 1 |
as
# | 1 |
# can be applied as | 1 2 1 | * | 2 |
# | 1 |
And this enable new optimizations by reusing registers (paper: https://hal.inria.fr/hal-01094906/document) and some experiments: https://github.com/SciNim/impulse/blob/26e25e7/benchmarks/image_filters/filter2d_separable_parallel.nim#L73-L229
Unfortunately implementing the absolute fastest convolution kernel is a huge timesink (theoretically I can improve my experiments by about 25x) but if at one point those become a large bottleneck, all my research is there, including optimizations techniques for register L1 L2 and TLB cache usage:
Feel free to ask me if you want to help find quick wins to accelerate processing certain input. For example I can help you check whether you use 1% or 10% or over 90% of the CPU capabilities: https://github.com/numforge/laser/blob/d1e6ae6/benchmarks/convolution/conv2d_bench.nim#L112-L170
You right we have not gotten around to SIMDing the blur operation yet. What do you mean by tiling? For blur we do the thing were we blur everything in the X-direction then by Y-direction. Instead of doing like a NxN blur we do 2N blur which is already much faster. I think this is what you mean by your matrix diagram. We have decomposed the blur convolution kernel.
If you have a simpler faster way to do blur we would take it! It would be great to optimize 2d shadow blur as it might be the most common blur for 2d UIs, and its just a single channel 8 bit image mask.
Original (in C)
#pragma omp parallel for simd collapse(2)
for (int i = 0; i < `M`; i++)
for (int j = 0; j < `N`; j++)
`po`[i+j*`M`] = `pa`[j+i*`N`];
1D tiling blck = 64
#pragma omp parallel for
for (int i = 0; i < `M`; i+=`blck`)
for (int j = 0; j < `N`; ++j)
#pragma omp simd
for (int ii = i; ii < min(i+`blck`,`M`); ++ii)
`po`[ii+j*`M`] = `pa`[j+ii*`N`];
2D tiling
#pragma omp parallel for collapse(2)
for (int i = 0; i < `M`; i+=`blck`)
for (int j = 0; j < `N`; j+=`blck`)
for (int ii = i; ii<min(i+`blck`,`M`); ii++)
#pragma omp simd
for (int jj = j; jj<min(j+`blck`,`N`); jj++)
`po`[ii+jj*`M`] = `pa`[jj+ii*`N`];
Regarding bluring X then Y yes, you can actually merge the passes as well as further optimizations but those are lower impact iirc.
For fast blur the reference is Halide https://halide-lang.org/:
It seems your 3 C examples (from haldie?) do an image copy. I tried to port them to nim:
import pixie, benchy
let
w = 100
h = 1000
var a = newImage(w, h)
var b = newImage(w, h)
var ad = cast[ptr UncheckedArray[uint32]](a.data[0].addr)
var bd = cast[ptr UncheckedArray[uint32]](b.data[0].addr)
timeIt "simple image":
for y in 0 ..< h:
for x in 0 ..< w:
a[x, y] = b[x, y]
keep(a)
timeIt "simple image unsafe":
for y in 0 ..< h:
for x in 0 ..< w:
a.setRgbaUnsafe(x, y, b.getRgbaUnsafe(x, y))
keep(a)
timeIt "simple unchecked array":
for y in 0 ..< h:
for x in 0 ..< w:
ad[y * w + x] = bd[y * w + x]
keep(a)
timeIt "1d tile":
let blck = 64
for y in 0 ..< h:
for x in countup(0, w-1, blck):
# 1d tile
for bx in x ..< min(x+blck, w):
ad[y * w + bx] = bd[y * w + bx]
keep(a)
timeIt "2d tile":
let blck = 64
for y in countup(0, h-1, blck):
for x in countup(0, w-1, blck):
# 2d tile
for by in y ..< min(y+blck, h):
for bx in x ..< min(x+blck, w):
ad[by * w + bx] = bd[by * w + bx]
keep(a)
At first I was not getting any improvements but then with the UncheckedArray, I saw a huge speedup. I don't see 1d or 2d tile going faster? Do I need to use OpenMP with the #pragma omp parallel to really see the tile benefit?
nim c -r -d:danger --passC:"-march=native -O3" tiles.nim
name ............................... min time avg time std dv times
simple image ....................... 0.373 ms 0.378 ms ±0.007 x1000
simple image unsafe ................ 0.343 ms 0.347 ms ±0.004 x1000
simple unchecked array ............. 0.082 ms 0.083 ms ±0.002 x1000
1d tile ............................ 0.083 ms 0.089 ms ±0.003 x1000
2d tile ............................ 0.087 ms 0.090 ms ±0.003 x1000
No that shouldn't be needed. But you need the image to be too big to fit in cache. I used 2000x4000 in my bench so if everything you use is 32x32 obviously tiling doesn't help but if you work on large wallpapers or photo it's worth it. it is especially useful if you can get images that are with equal probability 200x2000 and 2000x200 as one will be less cache friendly if you tile only in one direction.
Yes it just does copy to study and showcase that just data movement is the bottleneck.
You can repro my benchmark with:
git clone https://github.com/numforge/laser
nim c -d:danger -r benchmarks/transpose/transpose_bench.nim
On my desktop machine with overclocked i9-9980XE I get the following, GMEMOPs/s means billions of a[j, i] = b[i, j] per second (float32 memory copy per second)
Warmup: 0.9517 s, result 224 (displayed to avoid compiler optimizing warmup away)
A matrix shape: (M: 4000, N: 2000)
Output shape: (M: 2000, N: 4000)
Required number of operations: 8.000 millions
Required bytes: 32.000 MB
Arithmetic intensity: 0.250 FLOP/byte
Laser ForEachStrided
Collected 250 samples in 8.620 seconds
Average time: 34.181 ms
Stddev time: 0.504 ms
Min time: 33.855 ms
Max time: 39.033 ms
Perf: 0.234 GMEMOPs/s
Display output[1] to make sure it's not optimized away
0.7808474898338318
Naive transpose
Collected 250 samples in 7.587 seconds
Average time: 30.347 ms
Stddev time: 0.385 ms
Min time: 29.916 ms
Max time: 32.548 ms
Perf: 0.264 GMEMOPs/s
Display output[1] to make sure it's not optimized away
0.7808474898338318
Naive transpose - input row iteration
Collected 250 samples in 8.189 seconds
Average time: 32.757 ms
Stddev time: 0.326 ms
Min time: 32.342 ms
Max time: 34.886 ms
Perf: 0.244 GMEMOPs/s
Display output[1] to make sure it's not optimized away
0.7808474898338318
Collapsed OpenMP
Collected 250 samples in 7.422 seconds
Average time: 29.687 ms
Stddev time: 0.230 ms
Min time: 29.333 ms
Max time: 31.006 ms
Perf: 0.269 GMEMOPs/s
Display output[1] to make sure it's not optimized away
0.7808474898338318
Collapsed OpenMP - input row iteration
Collected 250 samples in 8.166 seconds
Average time: 32.665 ms
Stddev time: 0.265 ms
Min time: 32.331 ms
Max time: 34.674 ms
Perf: 0.245 GMEMOPs/s
Display output[1] to make sure it's not optimized away
0.7808474898338318
Cache blocking
Collected 250 samples in 1.669 seconds
Average time: 6.677 ms
Stddev time: 0.196 ms
Min time: 6.463 ms
Max time: 7.944 ms
Perf: 1.198 GMEMOPs/s
Display output[1] to make sure it's not optimized away
0.7808474898338318
Cache blocking - input row iteration
Collected 250 samples in 3.348 seconds
Average time: 13.392 ms
Stddev time: 0.102 ms
Min time: 13.243 ms
Max time: 14.154 ms
Perf: 0.597 GMEMOPs/s
Display output[1] to make sure it's not optimized away
0.7808474898338318
2D Tiling
Collected 250 samples in 2.763 seconds
Average time: 11.052 ms
Stddev time: 0.450 ms
Min time: 10.762 ms
Max time: 14.452 ms
Perf: 0.724 GMEMOPs/s
Display output[1] to make sure it's not optimized away
0.7808474898338318
2D Tiling - input row iteration
Collected 250 samples in 1.785 seconds
Average time: 7.139 ms
Stddev time: 0.169 ms
Min time: 6.863 ms
Max time: 8.109 ms
Perf: 1.121 GMEMOPs/s
Display output[1] to make sure it's not optimized away
0.7808474898338318
Cache blocking with Prefetch
Collected 250 samples in 1.881 seconds
Average time: 7.522 ms
Stddev time: 0.163 ms
Min time: 7.314 ms
Max time: 8.560 ms
Perf: 1.064 GMEMOPs/s
Display output[1] to make sure it's not optimized away
0.7808474898338318
2D Tiling + Prefetch - input row iteration
Collected 250 samples in 2.090 seconds
Average time: 8.359 ms
Stddev time: 0.158 ms
Min time: 8.138 ms
Max time: 9.398 ms
Perf: 0.957 GMEMOPs/s
Display output[1] to make sure it's not optimized away
0.7808474898338318
Production implementation
Collected 250 samples in 2.017 seconds
Average time: 8.068 ms
Stddev time: 0.343 ms
Min time: 7.804 ms
Max time: 10.085 ms
Perf: 0.992 GMEMOPs/s
Display output[1] to make sure it's not optimized away
0.7808474898338318
I guess we can continue on Pixie tracker the perf discussion because others might want to grab the stage to ask about how to use pixie, build it or build stuff with it. Don't hesitate to ping me.
Update!
We have added GIF support. Now you can load those images from the 90s!
We have also made blurs way faster, @mratsim made us take a closer look at their performance. Right now we blur everything in the Y direction then in the X direction. But its much faster to read and write in the X direction because memory is linear. The the first Y direction blur went against the "grain" of memory. But if you just flip the first step 90 degrees it aligns with the grain again! So we still blur in the Y and X direction but the first Y step is rotated for speed:
before
name ................................... min time avg time std dv runs
images blur ............................ 1113.121 ms 1122.303 ms ±9.992 x5
masks blur .............................. 386.647 ms 389.918 ms ±2.851 x13
after
images blur ............................. 834.241 ms 839.677 ms ±8.135 x6
masks blur .............................. 242.845 ms 244.885 ms ±1.822 x21
They are the same thing, we just came up with a better name and wanted to rename flippy, but also totally change the API. Pixie is just flippy 2.0 kind of with way more functionality. It started as flippy 2.0 but it ended as flippy 20.0 :)
Don't use flippy, use pixie instead.