nimforum mirror - Help me optimize this small Nim port to the speed of the original C version

r3c (orginal) [2019-08-25T17:46:37+02:00] view original

I'm almost finished with the porting of the original Quake map compiling tools from 1996, written in C.

To be more precise, the tool that calculates the light for the map, nothing special, it a straight 1:1 port. After initial benchmarks i ran on various maps, it turned out that the Nim version is almost twice slower than the original C version.

I found the hotspot, its in the function called TestRay. Basicly it checks if a straith line could be drawn between light and a sample point on face from the map. To check if there is a blocking face between these two 3d points, the BSP tree is walked recursively until it comes to the leafs, or it hits some other condition for breaking. There are 10000 -20000 checks flips and flops only for one test, and there are 1296 sample points on a face, and tens of thousands faces on a map in average.

Same as in the C version, Im using ptrmath, the original code is here

, nim version is here

Stefan_Salewski (orginal) [2019-08-25T20:56:27+02:00] view original

Have you converted it by c2nim initially?

I have no idea about your goal of course -- low level code is still low level even when coded in low level Nim...

First a remark about your proc

proc `toref`[T](x: var T): ref T =
  cast[ref typeof(x)](x.addr)

that can not really work, you can not cast an arbitrary variable of type T to a ref. In Nim a ref is a GC traced reference to an object, there is a refcount and type information involved, which is much more than a plain pointer. So delete this proc now!

Have you verified size of

type
  tnode_t* = object
    nodetype*: int32
    normal*: array[3, float32]
    dist*: float32
    children*: array[2, int32]
    pad*: int32

I am not sure, but I would guess that when you not mark this object with pure pragma, then there may be additional type information included in the object instance -- making it larger and decreasing number of instances in cache.

I can't see other issues on a first look, you may have to compare line by line. Maybe inspect Nim's intermediate C code. Tiny differences may make big differences in performance, for example the original C code may be optimized that the C compiler generates fine SIMD instructions, which may not work with Nim's C output.

But of course it is great that your Nim version works at all!

mratsim (orginal) [2019-08-26T09:58:26+02:00] view original

Do you compile with -d:release and -d:danger?

r3c (orginal) [2019-08-26T13:41:13+02:00] view original

@Stefan_Salewski thanks for the feedback. I compared the sizes of the structs, they same as the C version,


("Tnode_t size:      ", 32)
("Tracestack_t size: ", 20)

Stefan_Salewski (orginal) [2019-08-26T15:24:12+02:00] view original

One more idea: What is the option for gcc? Nim passes -O3 to gcc by default, while C programs most of the times use only -O2. -O3 can significantly grow the executable size -- generally O3 should be not slower than O2, but for rare cases it may be slower.

And you may try option -flto for link time optimation, or use clang instead of gcc.

Can you test with --gc:nome ?

mratsim (orginal) [2019-08-26T17:19:55+02:00] view original

You need to compile with both -d:release (removes stacktraces and uses -O3) and -d:danger.

Using -d:danger only is like compiling C with no optimization.

I'll have a look later.

In my experience Nim is as fast as C, especially for low-level stuff.

r3c (orginal) [2019-08-26T18:11:22+02:00] view original

Im using MinGW 4.9.2, used TDM GCC 5.1, but its even slower with it. 4.9.2 compiles faster, produces faster code and smaller binary.

Tried with: -O2, no difference -flto, speeded up the parser, wich uses .split(), but did nothing to TestLine --gc:nome, did nothing except raising the memory usage from 5 to 330MB

Stefan_Salewski (orginal) [2019-08-26T18:48:19+02:00] view original

Nim uses half of the memory

That is strange.

You do both of your tests with 32 bit OS, or both with 64 bit OS?

One more remark: You used int32 and float32 data types in your Nim version -- that seems to answer my initial question, you have not used c2nim for code transfer.

I think there is a good reason why Araq invented cint and cfloat, so why not use them?

And generally, I would recomment using c2nim, it mostly works even for C++ sources. After using c2nim I generally still do a line by line compare manually, but it prevents me from typing errors, and from much stupid typing.

Stefan_Salewski (orginal) [2019-08-26T19:04:10+02:00] view original

One more short look...

tstack_p--;

tstack_p -= 1

You know that C pointer arithmetic is very different from plain integer arithmetic?

r3c (orginal) [2019-08-26T19:34:21+02:00] view original

Yes, I have hidden converters :) Its in ptrmath.nim from @Jenah

About c2nim, yes i converted the code initialy, but i changed the types like cint, cfloat, cuint, if I remember correctly, the object types had some issues with the sizes. Also they had {.bycopy.} pragma

mratsim (orginal) [2019-08-26T19:36:04+02:00] view original

So after a first look through the C and Assembly code generated, there is a bug in the - implementation for float32 literals.

Test case:

proc main() =
  
  let z = 10.0'f32
  
  if z > -0.1'f32:
    echo "more"
  
  if z < 0.1'f32:
    echo "less"

main()

Generated C code

N_LIB_PRIVATE N_NIMCALL(void, main__WT5bdlPWc6VHEZkxs56sUA)(void) {
        NF32 z;
        z = 1.0000000000000000e+01f;
        {
                if (!(-1.0000000000000001e-01 < z)) goto LA3_;
                echoBinSafe(TM__tJFdVcCXt79a8K7xYCw4R7g_2, 1);
        }
        LA3_: ;
        {
                if (!(z < 1.0000000000000001e-01f)) goto LA7_;
                echoBinSafe(TM__tJFdVcCXt79a8K7xYCw4R7g_4, 1);
        }
        LA7_: ;
}

Notice how the minus uses a double "-1.0000000000000001e-01" instead of a single precision float "1.0000000000000001e-01f".

Benchmarking shows that a lot of time is spent on single to double conversion via cvtss2sd instruction.

I also expect that several parts of your code should use 0.0f or 0.0'f32 instead of 0.0 to ensure that float32 is used.

Unfortunately I can't compile the original code on Linux easily due to the MSVC build system. Note that your filename/imports have casing issues on Linux as well.

Feel free to raise the - issue on the tracker, otherwise I'll do it later when I have more time.

r3c (orginal) [2019-08-26T20:31:58+02:00] view original

I can confirm this: withouth 'f32:

 Time [LightWorld   ] 3.094s
CPU Time [WriteEntities] 0.256s

Jehan (orginal) [2019-08-27T14:46:26+02:00] view original

For what it's worth, I wrote ptrmath primarily for C/C++ interop. Modern compilers are generally smart enough to generate identical code, and using pointer arithmetic can even inhibit optimizations in some cases. You'll see performance benefits only for certain edge cases and may lose performance in other edge cases. Obviously, it can also help with porting C/C++ code, but one shouldn't expect performance gains out of such a port.

mratsim (orginal) [2019-08-27T15:59:27+02:00] view original

Fix for negative float32 literals incoming - https://github.com/nim-lang/Nim/pull/12063/files

r3c (orginal) [2019-08-27T22:09:29+02:00] view original

Tested on devel, its fixed now. That was fast :P

Mirror of forum.nim-lang.org

5124 :: Help me optimize this small Nim port to the speed of the original C version