nimforum mirror - Show Nim: Pixie, a fast 2D vector graphics library we are excited to share!

treeform (orginal) [2021-02-26T02:26:44+01:00] view original

Hey all, @guzba and I have been working on a 2D graphics library for an even bigger project we're working on. It's taken us some time so we're excited to have a version 1.0!

Pixie is a 2D graphics library similar to Cairo and Skia written entirely in Nim.

https://github.com/treeform/pixie

Features:

Drawing paths, shapes and curves with even-odd and non-zero windings.

Pixel-perfect AA quality.

Supported file formats are PNG, BMP, JPG, SVG + more in development.

Strokes with joins and caps.

Shadows, glows and blurs.

Complex masking: Subtract, Intersect, Exclude.

Complex blends: Darken, Multiply, Color Dodge, Hue, Luminosity... etc.

Supports butt, round and square line caps as well as miter, bevel and round line joins for paths.

For blending, we support many different blend modes including Normal, Darken, Multiply, Overlay, etc. For masking, we support various modes including subtract and intersect.

Many operations are SIMD accelerated and we're happy with the performance we have today. It'll only get better from here too.

mikebelanger (orginal) [2021-02-26T04:10:31+01:00] view original

cool! I'll definitely want to play around with this. If I may ask, what is the even bigger project that this is part of?

guzba (orginal) [2021-02-26T09:12:28+01:00] view original

One interesting thing we are doing to test complex path fills, strokes and anti-aliasing is rendering some complete icon sets. Here are 4 of them:

Bootstrap: https://raw.githubusercontent.com/treeform/pixie/master/tests/images/svg/flat-color-icons.png

Free Flat Color Icons: https://raw.githubusercontent.com/treeform/pixie/master/tests/images/svg/simple-icons.png

Ionicons: https://raw.githubusercontent.com/treeform/pixie/master/tests/images/svg/ionicons.png

Tabler Icons: https://raw.githubusercontent.com/treeform/pixie/master/tests/images/svg/tabler-icons.png

Simple Icons: https://raw.githubusercontent.com/treeform/pixie/master/tests/images/svg/twbs-icons.png

What's cool is these entire repos of SVG files are loaded and then rasterized by 100% Nim code, then compressed as PNG using again 100% Nim code. Fun stuff.

mratsim (orginal) [2021-02-26T12:05:03+01:00] view original

Regarding performance, I have seen that you use tiling for time to time https://github.com/treeform/pixie/blob/d459f26/src/pixie/images.nim#L656-L715 but not always https://github.com/treeform/pixie/blob/d459f26/src/pixie/images.nim#L376-L427

Simple tiling is an easy optimization that accelerates speed by 3x~50x by improving cache usage and optimizing data movement that are always the bottlenecks in stencil computations like blur or convolution. See various variations on a transposition kernels:

before https://github.com/numforge/laser/blob/d1e6ae6/benchmarks/transpose/transpose_bench.nim#L145-L148

after: https://github.com/numforge/laser/blob/d1e6ae6/benchmarks/transpose/transpose_bench.nim#L237-L243

Results https://github.com/numforge/laser/blob/d1e6ae6/benchmarks/transpose/transpose_bench.nim#L698-L817

# Collapsed OpenMP - input row iteration
# Collected 250 samples in 9.003 seconds
# Average time: 36.012 ms
# Stddev  time: 1.561 ms
# Min     time: 34.149 ms
# Max     time: 50.441 ms
# Perf:         0.222 GMEMOPs/s
# ...
# 2D Tiling
# Collected 250 samples in 3.232 seconds
# Average time: 12.927 ms
# Stddev  time: 0.782 ms
# Min     time: 12.036 ms
# Max     time: 19.736 ms
# Perf:         0.619 GMEMOPs/s

In terms of easy optimization for 2D blur for example, if your convolution kernel as central symmetry consider using separable convolution so you would apply


# | 1 2 1 |
# | 2 4 2 |
# | 1 2 1 |


#                               | 1 |
# can be applied as | 1 2 1 | * | 2 |
#                               | 1 |

And this enable new optimizations by reusing registers (paper: https://hal.inria.fr/hal-01094906/document) and some experiments: https://github.com/SciNim/impulse/blob/26e25e7/benchmarks/image_filters/filter2d_separable_parallel.nim#L73-L229

Unfortunately implementing the absolute fastest convolution kernel is a huge timesink (theoretically I can improve my experiments by about 25x) but if at one point those become a large bottleneck, all my research is there, including optimizations techniques for register L1 L2 and TLB cache usage:

https://github.com/numforge/laser/wiki/Convolution-optimisation-resources

https://github.com/numforge/laser/blob/master/research/automatic_loop_nest_scheduling.md (look for image or stencil papers)

https://github.com/numforge/laser/blob/d1e6ae6/laser/primitives/matrix_multiplication/gemm_tiling.nim (this is for matrix multiplication but the tiling techniques apply to image processing as well).

Feel free to ask me if you want to help find quick wins to accelerate processing certain input. For example I can help you check whether you use 1% or 10% or over 90% of the CPU capabilities: https://github.com/numforge/laser/blob/d1e6ae6/benchmarks/convolution/conv2d_bench.nim#L112-L170

treeform (orginal) [2021-02-26T19:05:31+01:00] view original

You right we have not gotten around to SIMDing the blur operation yet. What do you mean by tiling? For blur we do the thing were we blur everything in the X-direction then by Y-direction. Instead of doing like a NxN blur we do 2N blur which is already much faster. I think this is what you mean by your matrix diagram. We have decomposed the blur convolution kernel.

If you have a simpler faster way to do blur we would take it! It would be great to optimize 2d shadow blur as it might be the most common blur for 2d UIs, and its just a single channel 8 bit image mask.

mratsim (orginal) [2021-02-26T19:22:46+01:00] view original

Original (in C)

#pragma omp parallel for simd collapse(2)
    for (int i = 0; i < `M`; i++)
      for (int j = 0; j < `N`; j++)
        `po`[i+j*`M`] = `pa`[j+i*`N`];

1D tiling blck = 64

#pragma omp parallel for
    for (int i = 0; i < `M`; i+=`blck`)
      for (int j = 0; j < `N`; ++j)
        #pragma omp simd
        for (int ii = i; ii < min(i+`blck`,`M`); ++ii)
          `po`[ii+j*`M`] = `pa`[j+ii*`N`];

2D tiling

#pragma omp parallel for collapse(2)
    for (int i = 0; i < `M`; i+=`blck`)
      for (int j = 0; j < `N`; j+=`blck`)
        for (int ii = i; ii<min(i+`blck`,`M`); ii++)
          #pragma omp simd
          for (int jj = j; jj<min(j+`blck`,`N`); jj++)
            `po`[ii+jj*`M`] = `pa`[jj+ii*`N`];

Regarding bluring X then Y yes, you can actually merge the passes as well as further optimizations but those are lower impact iirc.

For fast blur the reference is Halide https://halide-lang.org/:

https://youtu.be/UeyWo42_PS8 (Making a blur faster in Halide)

treeform (orginal) [2021-02-26T20:28:14+01:00] view original

It seems your 3 C examples (from haldie?) do an image copy. I tried to port them to nim:

import pixie, benchy

let
  w = 100
  h = 1000
var a = newImage(w, h)
var b = newImage(w, h)
var ad = cast[ptr UncheckedArray[uint32]](a.data[0].addr)
var bd = cast[ptr UncheckedArray[uint32]](b.data[0].addr)

timeIt "simple image":
  for y in 0 ..< h:
    for x in 0 ..< w:
      a[x, y] = b[x, y]
  keep(a)

timeIt "simple image unsafe":
  for y in 0 ..< h:
    for x in 0 ..< w:
      a.setRgbaUnsafe(x, y, b.getRgbaUnsafe(x, y))
  keep(a)

timeIt "simple unchecked array":
  for y in 0 ..< h:
    for x in 0 ..< w:
      ad[y * w + x] = bd[y * w + x]
  keep(a)

timeIt "1d tile":
  let blck = 64
  for y in 0 ..< h:
    for x in countup(0, w-1, blck):
      # 1d tile
      for bx in x ..< min(x+blck, w):
          ad[y * w + bx] = bd[y * w + bx]
  keep(a)

timeIt "2d tile":
  let blck = 64
  for y in countup(0, h-1, blck):
    for x in countup(0, w-1, blck):
      
      # 2d tile
      for by in y ..< min(y+blck, h):
        for bx in x ..< min(x+blck, w):
          ad[by * w + bx] = bd[by * w + bx]
  keep(a)

At first I was not getting any improvements but then with the UncheckedArray, I saw a huge speedup. I don't see 1d or 2d tile going faster? Do I need to use OpenMP with the #pragma omp parallel to really see the tile benefit?


nim c -r -d:danger --passC:"-march=native -O3" tiles.nim


name ............................... min time      avg time    std dv  times
simple image ....................... 0.373 ms      0.378 ms    ±0.007  x1000
simple image unsafe ................ 0.343 ms      0.347 ms    ±0.004  x1000
simple unchecked array ............. 0.082 ms      0.083 ms    ±0.002  x1000
1d tile ............................ 0.083 ms      0.089 ms    ±0.003  x1000
2d tile ............................ 0.087 ms      0.090 ms    ±0.003  x1000

mratsim (orginal) [2021-02-26T22:03:30+01:00] view original

No that shouldn't be needed. But you need the image to be too big to fit in cache. I used 2000x4000 in my bench so if everything you use is 32x32 obviously tiling doesn't help but if you work on large wallpapers or photo it's worth it. it is especially useful if you can get images that are with equal probability 200x2000 and 2000x200 as one will be less cache friendly if you tile only in one direction.

Yes it just does copy to study and showcase that just data movement is the bottleneck.

You can repro my benchmark with:


git clone https://github.com/numforge/laser
nim c -d:danger -r benchmarks/transpose/transpose_bench.nim

On my desktop machine with overclocked i9-9980XE I get the following, GMEMOPs/s means billions of a[j, i] = b[i, j] per second (float32 memory copy per second)


Warmup: 0.9517 s, result 224 (displayed to avoid compiler optimizing warmup away)

A matrix shape: (M: 4000, N: 2000)
Output shape: (M: 2000, N: 4000)
Required number of operations:     8.000 millions
Required bytes:                   32.000 MB
Arithmetic intensity:              0.250 FLOP/byte

Laser ForEachStrided
Collected 250 samples in 8.620 seconds
Average time: 34.181 ms
Stddev  time: 0.504 ms
Min     time: 33.855 ms
Max     time: 39.033 ms
Perf:         0.234 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

Naive transpose
Collected 250 samples in 7.587 seconds
Average time: 30.347 ms
Stddev  time: 0.385 ms
Min     time: 29.916 ms
Max     time: 32.548 ms
Perf:         0.264 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

Naive transpose - input row iteration
Collected 250 samples in 8.189 seconds
Average time: 32.757 ms
Stddev  time: 0.326 ms
Min     time: 32.342 ms
Max     time: 34.886 ms
Perf:         0.244 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

Collapsed OpenMP
Collected 250 samples in 7.422 seconds
Average time: 29.687 ms
Stddev  time: 0.230 ms
Min     time: 29.333 ms
Max     time: 31.006 ms
Perf:         0.269 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

Collapsed OpenMP - input row iteration
Collected 250 samples in 8.166 seconds
Average time: 32.665 ms
Stddev  time: 0.265 ms
Min     time: 32.331 ms
Max     time: 34.674 ms
Perf:         0.245 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

Cache blocking
Collected 250 samples in 1.669 seconds
Average time: 6.677 ms
Stddev  time: 0.196 ms
Min     time: 6.463 ms
Max     time: 7.944 ms
Perf:         1.198 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

Cache blocking - input row iteration
Collected 250 samples in 3.348 seconds
Average time: 13.392 ms
Stddev  time: 0.102 ms
Min     time: 13.243 ms
Max     time: 14.154 ms
Perf:         0.597 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

2D Tiling
Collected 250 samples in 2.763 seconds
Average time: 11.052 ms
Stddev  time: 0.450 ms
Min     time: 10.762 ms
Max     time: 14.452 ms
Perf:         0.724 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

2D Tiling - input row iteration
Collected 250 samples in 1.785 seconds
Average time: 7.139 ms
Stddev  time: 0.169 ms
Min     time: 6.863 ms
Max     time: 8.109 ms
Perf:         1.121 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

Cache blocking with Prefetch
Collected 250 samples in 1.881 seconds
Average time: 7.522 ms
Stddev  time: 0.163 ms
Min     time: 7.314 ms
Max     time: 8.560 ms
Perf:         1.064 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

2D Tiling + Prefetch - input row iteration
Collected 250 samples in 2.090 seconds
Average time: 8.359 ms
Stddev  time: 0.158 ms
Min     time: 8.138 ms
Max     time: 9.398 ms
Perf:         0.957 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

Production implementation
Collected 250 samples in 2.017 seconds
Average time: 8.068 ms
Stddev  time: 0.343 ms
Min     time: 7.804 ms
Max     time: 10.085 ms
Perf:         0.992 GMEMOPs/s

Display output[1] to make sure it's not optimized away
0.7808474898338318

I guess we can continue on Pixie tracker the perf discussion because others might want to grab the stage to ask about how to use pixie, build it or build stuff with it. Don't hesitate to ping me.

michy (orginal) [2021-02-28T17:27:43+01:00] view original

Great stuff! BTW: Are you aware of the blend2d-project? https://blend2d.com/ I think he is the absolute winner when it comes to benchmarking.

treeform (orginal) [2021-02-28T19:29:13+01:00] view original

Yes I have looked at blend2d (I did not find a nim wrapper thought). It looks pretty cool. I really like the C++ and C api variants. And yes it claims speed. I want to know how they do it so that we can replicated that in pixie. We are still a little bit slower then the blend2d, skia, cairo but we are catching up fast.

michy (orginal) [2021-02-28T20:13:20+01:00] view original

Comparisons show huge differences in speed between blend2d and cairo, at least for most examples: https://blend2d.com/performance.html So still a bit slower than blend2d" does not really harm. But *slower than Cairo is a different thing and asks for optimizations. Speed in b2d is mainly due to the jit-compilation step, if I got it right.

treeform (orginal) [2021-02-28T20:25:38+01:00] view original

I find that blend2d's jit-compilation kind of wearied. Program knows what its going to blend and how its going to blend compile time, why jit? I don't understand. But I want to understand. I feel like the big wins going to be on the GPU side, but we have not achieved them yet.

michy (orginal) [2021-02-28T21:23:40+01:00] view original

Petr Kobalicek is very responsive, when it comes to discussions about his approaches. He is an enthusiast and optimization freak and loves to share his experiences.

treeform (orginal) [2021-03-02T18:34:29+01:00] view original

Update!

We have added GIF support. Now you can load those images from the 90s!

We have also made blurs way faster, @mratsim made us take a closer look at their performance. Right now we blur everything in the Y direction then in the X direction. But its much faster to read and write in the X direction because memory is linear. The the first Y direction blur went against the "grain" of memory. But if you just flip the first step 90 degrees it aligns with the grain again! So we still blur in the Y and X direction but the first Y step is rotated for speed:


before
name ................................... min time      avg time    std dv   runs
images blur ............................ 1113.121 ms   1122.303 ms ±9.992   x5
masks blur .............................. 386.647 ms   389.918 ms  ±2.851   x13

after
images blur ............................. 834.241 ms   839.677 ms  ±8.135   x6
masks blur .............................. 242.845 ms   244.885 ms  ±1.822   x21

juancarlospaco (orginal) [2021-03-04T04:08:27+01:00] view original

Whats the difference between Flippy and Pixie ?, why they are not the same thing.

treeform (orginal) [2021-03-04T05:54:22+01:00] view original

They are the same thing, we just came up with a better name and wanted to rename flippy, but also totally change the API. Pixie is just flippy 2.0 kind of with way more functionality. It started as flippy 2.0 but it ended as flippy 20.0 :)

Don't use flippy, use pixie instead.

Mirror of forum.nim-lang.org

7559 :: Show Nim: Pixie, a fast 2D vector graphics library we are excited to share!