Hi all, new user here, so apologies if I'm missing something obvious, or if I'm posting this question in the wrong place. I've been playing with an old (no updates in the last two years) project on github: https://github.com/jesvedberg/tpix
This is a nim program that takes an an image from a file and outputs it to stdin in a particular format with special escape characters, which results in the kitty terminal emulator drawing the image on screen. It uses the pixie library (which I've heard good things about) for image processing. What has surprised me is how slow it is, compared to viu, a program coded in Rust that does the same thing. I'm wondering if image processing really is this much slower in Nim, or if (as wouldn't at all surprise me) I'm doing something wrong. Note that I am compiling with the -d:release flag on the latest version of Nim (2.0.2).
In greater detail, the program does the following:
This process takes ~180ms, whereas the viu program (which may not follow the same steps, I haven't looked at its source) does this in 30ms.
I tried using the profiler and discovered that a large majority of the time was in step 3, encoding the image as a png. This isn't necessary, as you can send raw pixel data to kitty instead. So I replaced this step with a simple function that turns the image's pixel data into a string. With this change, the program instead takes ~85ms.
So we got a major speed boost, but we're still taking far longer than viu. Rerunning the profiler suggests that most of the remaining time is in steps 1 and 2: decoding and resizing the input image. I don't think there's much I can do about those steps.
Any pointers to what I might be doing wrong or who I could ask would be greatly appreciated. Thanks!
Resizing can't be the only issue because if I tell it not to resize at all, it still takes 80-90ms, whereas the other program consistently takes 25-30ms regardless of size.
...actually, it's pretty interesting that the nim program takes 85ms with no resize, but it takes 30ms if I resize it to very small. That would suggest that the majority of the time _isn't being spent on resizing or encoding the original image. That time deficit has to be going towards either generating the output or printing it to stdin, even though the profiler says most of the time isn't spent doing that. So now I'm doubting the profiler.
We tried to make pixie really fast, I doubt your problem is with pixie speed.
After looking at viu program in rust, sorry I don't know rust well, but it it looks like it uses the viuer create to actually print the image. The way it appears to do it is to write the image to a tmp file and have the kitty terminal read it? So entire viu program eventually just prints
echo "\x1b_Gf=32,s=100,v=100,c=10,r=10,a=T,t=t;path/to/file.png\x1b\\"
See: https://docs.rs/viuer/latest/src/viuer/printer/kitty.rs.html#135-145
A single line program that just write a short string to stdout will be faster then the one that decodes/resizes/encodes and then sends its block by block through the terminal codes.
The code I have, in its current format, shares these first two steps with viu, but I suspect it doesn't do them as efficiently as it could. I know you're a primary pixie developer, so would you mind if I picked your brain on this? This is what the code is doing now:
1) Get a pixie Image from a file or from stdin filename.readImage or stdin.readline.readImage
2) Extract raw rgba values from the pixie Image let rawData = encode(imgData(img)) ## Where imageData is a function that I wrote. It's very possible there's already a better way to do this?
proc imgData(img: Image): string =
for d in img.data:
result.add(char d.r)
result.add(char d.g)
result.add(char d.b)
result.add(char d.a)
Thanks.
That loop
proc imgData(img: Image): string =
for d in img.data:
result.add(char d.r)
result.add(char d.g)
result.add(char d.b)
result.add(char d.a)
is a crazy slow way of doing this.
If you just need raw RGBA values, just use image.data. If you need a pointer to it, use image.data[0].addr, and use copyMem if you need it copied somewhere.
The implementation details make an enormous difference here.
As pointed out by guzba, using a a sequence the way you are using it here will be slow. It's a very common way to go about things if you are coming from a language like Python.
A very naive way of implementing a dynamically sized array would do something like this under the hood:
When you initialize an array the CPU allocates memory for that new array.
When you add an item to the array the CPU then
So every time a new item is added there is:
As you can see that is a lot of work being done every time a new item is added.
To avoid this you instead initialize the original array with a length of the number of items you would need it to hold.
var pixelArray: array[imageWidth * imageHeight * 4, char]
Here I'm just multiplying by 4 for the RGBA values.
This way there would only be 1 memory allocation and zero deallocations
If you want to use a sequence you can initialize it with type and length with:
I'm actually not sure if there is any performance penalty for using a seq instead of an array in Nim as long as you initialize with a length and don't delete from it.
But as guzba pointed out, you already have the pixels in the image array so this was only to add a little more explanation.
(Disclaimer: my explanation of the implementation of a dynamic length array here is just theoretical and only for a base understanding)
P.S. I told you treeform and guzba would get on it if you posted it here instead of Reddit :-)
This is a much better explanation of what I was trying to say.
One other tip is if you need to write raw pixels to a file, you can use the pointer + len version of writeFile along with the image.data[0].addr pointer + image.data.len * 4 byte len.
Thanks for the suggestions. I will play around with this over the weekend.
I had considered initializing the data structure at the desired final size, but then didn't do it. I guess what happened was that the nim profiler told me this part of the code was taking up a pretty small percentage of the total time, so I just didn't worry about it. Later, when I had reason to distrust the profiler, I tried removing one step at a time, and it turned out this step is taking nearly 25% of the total time on my test case (~20ms out of ~85ms) (some of that might also be in the call to encode, I'd have to check). So some improvement here would be nice.
Hi. I wrote tpix. I haven't updated tpix in a long time for two reasons: the first is that I bought a Macbook Air and I have mostly been using iTerm2 instead of Kitty since then, and the second is that tpix was more or less feature complete as far as I was concerned (though I have been considering adding support for the iTerm2 image protocol as well).
I didn't really spend much time optimizing tpix, but as you've shown there is definitely room to do so, especially when running it on a local computer. The big reason why I never felt that it was necessary was because my main use case was running it on remote systems and in those cases transferring the data is generally much slower than any other step, so it made sense to me to keep things simple and just convert the image as a compressed png file, split it into chunks and send it to Kitty.
In your list of steps that tpix performs, you forgot to mention converting the data to base64. I suspect that that is faster than converting the image to a png, though I don't really know.
The conversion to base64 does take some time, but yeah, it's not the slowest operation.