nimforum mirror - Nimrod benchmark (compare to c++, fpc)

exhu (orginal) [2014-01-19T21:09:28+01:00] view original

I have recently recompiled my app, ported to nimrod, fpc, etc. and ran it on my core2duo E6300 machine (64-bit xubuntu 12.04.4, 5GB ram)

https://github.com/exhu/alimg/tree/master/nimrod -- any comments and improvement are welcome!

So the results are measured is real seconds returned by time command:

java = 5.470s (time java -jar bufdither.jar src.buf dst.buf) -- 28 megs rss

g++ 4.6.3 = 6.146s (-O3 -march=native) - 2872kb rss

nimrod (master) = 7.479s (-d:release) - 2488kb rss

fpc 2.6.2 = 11.094s - 2100kb rss

All the sources are available at https://github.com/exhu/alimg/

The app is a picture dithering tool to optimize images from 32-bit RGBA to 16-bit RGBA color. Originally written in java it was used to optimize textures for a commercial cellphone game.

leledumbo (orginal) [2014-01-20T07:18:29+01:00] view original

This would be interesting for fpc developers. What fpc options do you use? I've only seen the code briefly, perhaps some things can be improved.

exhu (orginal) [2014-01-20T19:18:04+01:00] view original

fpc -O3 -OWall -Ow/W -Fw/W -XX, i.e. with whole program optimization in two passes as described in the manual.

Mścigniew (orginal) [2014-01-20T20:54:35+01:00] view original

Are there results of all implementations written somewhere?

exhu (orginal) [2014-01-20T21:08:05+01:00] view original

What results? The most recent running times are published here, with the most recent Nimrod.

All the sources for fpc, c++, nimrod, java are in the aforementioned git repository. To have the input data you need a 32-bit RGBA PNG image (with transparency areas) converted into a .buf file with https://github.com/exhu/alimg/tree/master/java/img2buf utility. I usually run tests on a 800x600 png image.

leledumbo (orginal) [2014-01-21T03:14:06+01:00] view original

I'll try to reproduce on my machine. You should describe the PNG conversion step in the README IMHO.

Araq (orginal) [2014-01-21T09:52:13+01:00] view original

Much more useful would be to use a profiler, tell us where the bottlenecks are, how the generated assembler looks for C++ and for Nimrod and suggest what code sequences GCC's / Clang's optimizer dislike... I will do that eventually, but we have also 238 open bugs ...

For now, I can only guess that Java's bump pointer allocator improves cache locality for your code and Java ends up being fastest.

leledumbo (orginal) [2014-01-21T18:58:27+01:00] view original

This is the result on my machine:

bufdither_fpc:
real    0m4.609s
real    0m5.246s
real    0m5.269s
real    0m4.933s
real    0m5.572s
bufdither_nimrod_clang:
real    0m6.047s
real    0m5.408s
real    0m5.381s
real    0m6.621s
real    0m5.396s
bufdither_nimrod_gcc:
real    0m5.721s
real    0m6.898s
real    0m6.709s
real    0m6.809s
real    0m5.827s
bufdither_cpp_clang:
real    0m5.467s
real    0m6.947s
real    0m5.605s
real    0m5.930s
real    0m7.221s
bufdither_cpp_gcc:
real    0m8.296s
real    0m8.209s
real    0m6.946s
real    0m7.948s
real    0m6.986s
bufdither_java:
real    0m3.366s
real    0m3.164s
real    0m4.031s
real    0m3.851s
real    0m3.806s

Each executable is run 5 times, you know, today's OS can't give exact time in a single run. It seems like nimrod generates C code that both gcc/clang likes similarly. However, the standalone C++ code is differently treated by clang++ and g++. Java still wins here, the VM has indeed improved crazily. You can get the precompiled executable and runner script here: https://dl.dropboxusercontent.com/u/22124591/Programs/mytest.zip Just execute ./run to regenerate my result

Mścigniew (orginal) [2014-01-21T19:13:50+01:00] view original

"What results? " There are 9 implementations on GitHub, but you provided 4 results.

exhu (orginal) [2014-01-21T20:30:49+01:00] view original

@Mścigniew The other ones are not finished or optimized or just inadequate for the task (e.g. python or lua).

@leledumbo Those are 32-bit executables, I tested 64-bit ones.

Araq (orginal) [2014-01-21T21:37:41+01:00] view original

It may not affect this benchmark, but Java's GC is also parallel and so uses all your cores. This can make quite a difference for single threaded benchmarks but stops being an advantage as soon as the mutator already keeps all the cores busy...

Mścigniew (orginal) [2014-01-22T05:39:46+01:00] view original

@exhu

If something is not optimised, yes, it shouldn't be published. But I don't see a reason not to publish timings of what you consider inadequate. They are slow? OK. But how slow? And some people do prototypes of data crunching code in Python.

leledumbo (orginal) [2014-01-22T07:45:09+01:00] view original

Those are 32-bit executables, I tested 64-bit ones.

Oops, OK. Unfortunately, I don't have 64-bit OS installed :(

java is great at small benchmarks but fails miserably in real applications both in perceived performance and ram usage =)

Can't agree more =)

NewGuy (orginal) [2014-01-22T09:37:59+01:00] view original

java is great at small benchmarks but fails miserably in real applications both in perceived performance and ram usage

It would be cool if you redid the benchmarks in numerous conditions (huge files, looping the program for an hourish [not starting it over to avoid startup costs and avoiding GC passes], etc). Write a script/program to monitor resource usage and print out the results in graph form of how much CPU/RAM was used over time for each version. Then we'd have a real (well... at least more meaningful in terms of analyzing long-term performance & resource consumption) metric rather than a small benchmark that means almost nothing.

zahary (orginal) [2014-01-22T11:40:59+01:00] view original

I'm curious about the cost of all these methods. It seems easy to refactor the code to use statically bound regular procs, so you could try this as well.

exhu (orginal) [2014-01-22T14:12:15+01:00] view original

@NewGuy -- desktop speed is what makes sense for me, i'm not against java on servers. I look at startup time and memory usage because i don't care about servers. JIT cannot give you a constant perceived performance, it always jaggy for a markable time.

@zahary, the methods are for some way of conveying abstraction otherwise it all goes to an unmaintainable tightly cryptic code. Well probably templates can be useful in nimrod here.

zahary (orginal) [2014-01-22T15:16:07+01:00] view original

If you are to believe Adobe, image libraries are better served by generic code:

http://www.boost.org/doc/libs/1_55_0/libs/gil/doc/index.html

Any code featuring dynamic dispatch is almost by definition more cryptic than the equivalent statically bound code. If the compiler can't figure out what to do when you use "go to definition", then it must be harder for a human to follow the code as well. Dynamic dispatch is used to enable polymorphic collections and to reduce the generated code size, not to aid comprehension.

exhu (orginal) [2014-01-22T16:43:49+01:00] view original

@zahary, in my utility dynamic dispatch is used to define an interface/protocol to follow, and i usually use virtual functions exactly for this reason. Thanks for the link, looked it briefly, but boost is a terrible library because it looks cryptic, and the code that uses it takes ages to compile no matter how expensive and fast hardware is, but this is probably the feature of c++...

exhu (orginal) [2014-02-23T08:35:15+01:00] view original

Ok, optimized nimrod version to use all inline, no dynamic dispatch etc. Now it runs as fast as java 6 version (3.6 sec java, 3.7 sec nimrod), however java 7 manages to run at 2.5 sec =) Probably, gcc optimizer lags behind java's...

https://github.com/exhu/alimg/tree/master/nimrod2

Mirror of forum.nim-lang.org

348 :: Nimrod benchmark (compare to c++, fpc)