I have recently recompiled my app, ported to nimrod, fpc, etc. and ran it on my core2duo E6300 machine (64-bit xubuntu 12.04.4, 5GB ram)
https://github.com/exhu/alimg/tree/master/nimrod -- any comments and improvement are welcome!
So the results are measured is real seconds returned by time command:
All the sources are available at https://github.com/exhu/alimg/
The app is a picture dithering tool to optimize images from 32-bit RGBA to 16-bit RGBA color. Originally written in java it was used to optimize textures for a commercial cellphone game.
What results? The most recent running times are published here, with the most recent Nimrod.
All the sources for fpc, c++, nimrod, java are in the aforementioned git repository. To have the input data you need a 32-bit RGBA PNG image (with transparency areas) converted into a .buf file with https://github.com/exhu/alimg/tree/master/java/img2buf utility. I usually run tests on a 800x600 png image.
Much more useful would be to use a profiler, tell us where the bottlenecks are, how the generated assembler looks for C++ and for Nimrod and suggest what code sequences GCC's / Clang's optimizer dislike... I will do that eventually, but we have also 238 open bugs ...
For now, I can only guess that Java's bump pointer allocator improves cache locality for your code and Java ends up being fastest.
bufdither_fpc: real 0m4.609s real 0m5.246s real 0m5.269s real 0m4.933s real 0m5.572s bufdither_nimrod_clang: real 0m6.047s real 0m5.408s real 0m5.381s real 0m6.621s real 0m5.396s bufdither_nimrod_gcc: real 0m5.721s real 0m6.898s real 0m6.709s real 0m6.809s real 0m5.827s bufdither_cpp_clang: real 0m5.467s real 0m6.947s real 0m5.605s real 0m5.930s real 0m7.221s bufdither_cpp_gcc: real 0m8.296s real 0m8.209s real 0m6.946s real 0m7.948s real 0m6.986s bufdither_java: real 0m3.366s real 0m3.164s real 0m4.031s real 0m3.851s real 0m3.806s
Each executable is run 5 times, you know, today's OS can't give exact time in a single run. It seems like nimrod generates C code that both gcc/clang likes similarly. However, the standalone C++ code is differently treated by clang++ and g++. Java still wins here, the VM has indeed improved crazily. You can get the precompiled executable and runner script here: https://dl.dropboxusercontent.com/u/22124591/Programs/mytest.zip Just execute ./run to regenerate my result
@Mścigniew The other ones are not finished or optimized or just inadequate for the task (e.g. python or lua).
@leledumbo Those are 32-bit executables, I tested 64-bit ones.
@exhu
If something is not optimised, yes, it shouldn't be published. But I don't see a reason not to publish timings of what you consider inadequate. They are slow? OK. But how slow? And some people do prototypes of data crunching code in Python.
Those are 32-bit executables, I tested 64-bit ones.
Oops, OK. Unfortunately, I don't have 64-bit OS installed :(
java is great at small benchmarks but fails miserably in real applications both in perceived performance and ram usage =)
Can't agree more =)
java is great at small benchmarks but fails miserably in real applications both in perceived performance and ram usage
It would be cool if you redid the benchmarks in numerous conditions (huge files, looping the program for an hourish [not starting it over to avoid startup costs and avoiding GC passes], etc). Write a script/program to monitor resource usage and print out the results in graph form of how much CPU/RAM was used over time for each version. Then we'd have a real (well... at least more meaningful in terms of analyzing long-term performance & resource consumption) metric rather than a small benchmark that means almost nothing.
@NewGuy -- desktop speed is what makes sense for me, i'm not against java on servers. I look at startup time and memory usage because i don't care about servers. JIT cannot give you a constant perceived performance, it always jaggy for a markable time.
@zahary, the methods are for some way of conveying abstraction otherwise it all goes to an unmaintainable tightly cryptic code. Well probably templates can be useful in nimrod here.
If you are to believe Adobe, image libraries are better served by generic code:
http://www.boost.org/doc/libs/1_55_0/libs/gil/doc/index.html
Any code featuring dynamic dispatch is almost by definition more cryptic than the equivalent statically bound code. If the compiler can't figure out what to do when you use "go to definition", then it must be harder for a human to follow the code as well. Dynamic dispatch is used to enable polymorphic collections and to reduce the generated code size, not to aid comprehension.
Ok, optimized nimrod version to use all inline, no dynamic dispatch etc. Now it runs as fast as java 6 version (3.6 sec java, 3.7 sec nimrod), however java 7 manages to run at 2.5 sec =) Probably, gcc optimizer lags behind java's...