proc main =
var s = 0
for i in 0 .. 1000000000:
if i * i == 49:
#if i == 7:
s += 1
echo s
main()
For this test, using "if i == 7" instead of "if i * i == 49" gives half runtime on my box. This is a bit surprising as multiply is considered very fast generally.
Nim Compiler Version 0.18.0 [Linux: amd64]
On my machine and debug mode i == 7 takes 12 seconds to run and i * i == 49 takes 20 seconds. With release mode it was like 1.3 for i * i == 49 and 0.8 for the other. A little faster than double the speed, which means it's at least faster than i == -7 or i == 7.
Based on this online benchmark, you'll see multiplication is about 3 times slower in C++
Hlaaftana, thanks for confirming. I was really tired yesterday and though I was doing something wrong.
[EDIT] : Thanks for C++ example, really interesting, I was not aware of that page.
cdome, your message makes not too much sense for me currently.
Indeed I tried using unlikely() for this test, which made no difference, as expected, as automatic branch prediction should work fine here.
Indeed I wonder if Nim supports branchless cmov instructions at all currently, maybe not because of so much goto use? At least I have not seen a cmov in Nims assembler listings yet.
My feeling is, that mul is indeed slow in this special loop -- will inspect assembly listing when I have some more time.
First version 500ms
LBB0_1: # =>This Inner Loop Header: Depth=1
movdqa xmm3, xmm5
movdqa xmm6, xmm5
paddd xmm5, xmmword ptr [__xmm@00000008000000080000000800000008]
add esi, -8
pcmpeqd xmm3, xmm0
pcmpeqd xmm6, xmm2
psubd xmm4, xmm3
psubd xmm1, xmm6
jne LBB0_1
With mutiplication 2500ms(5x slower):
LBB0_3: # =>This Inner Loop Header: Depth=1
movdqu xmm6, xmmword ptr [esp + 16] # 16-byte Reload
add esi, -8
movdqa xmm4, xmm6
pshufd xmm5, xmm6, 245 # xmm5 = xmm6[1,1,3,3]
movdqa xmm3, xmm6
paddd xmm6, xmmword ptr [__xmm@00000008000000080000000800000008]
pmuludq xmm4, xmm4
pmuludq xmm5, xmm5
paddd xmm3, xmm1
pshufd xmm4, xmm4, 232 # xmm4 = xmm4[0,2,2,3]
pshufd xmm5, xmm5, 232 # xmm5 = xmm5[0,2,2,3]
punpckldq xmm4, xmm5 # xmm4 = xmm4[0],xmm5[0],xmm4[1],xmm5[1]
pshufd xmm5, xmm3, 245 # xmm5 = xmm3[1,1,3,3]
pmuludq xmm3, xmm3
pmuludq xmm5, xmm5
pshufd xmm3, xmm3, 232 # xmm3 = xmm3[0,2,2,3]
pcmpeqd xmm4, xmm2
movdqu xmmword ptr [esp + 16], xmm6 # 16-byte Spill
pshufd xmm5, xmm5, 232 # xmm5 = xmm5[0,2,2,3]
psubd xmm7, xmm4
punpckldq xmm3, xmm5 # xmm3 = xmm3[0],xmm5[0],xmm3[1],xmm5[1]
pcmpeqd xmm3, xmm2
psubd xmm0, xmm3
jne LBB0_3
If compiled with tcc perfomance almost the same.cmov will likely be slower than a branch misprediction in the 49 case because it happens much less often. See here for some benchmark: https://github.com/xiadz/cmov.
Also Nim is able to generate cmov, it depends on GCC Clang but rewriting the code like this can help:
proc main =
var s = 0
for i in 0 .. 1000000000:
s += if i * i == 49: 1 else: 0
echo s
main()
Now if you really want the fastest speed don't generate CMOV or if/else, generate ADC:
proc main =
var s = 0
for i in 0 .. 1000000000:
s += int(i * i == 49)
echo s
main()
Some timing and latency information: http://www.agner.org/optimize/instruction_tables.pdf
I tried on my machine.
The ASM for the == 7 is completely different from the other (80 ms run on my i5-5257U):
ASM for the 3 propositions I did is the same though (350-400ms ms run on my i5-5257U):
So I think the 7/49 should be a command-line supplied value to avoid the compiler optimizing it away (unless it's really hardcoded in your library.)
Thanks for your comments.
Starting point for these investigations was
https://github.com/StefanSalewski/nim-chess4/blob/master/engine.nim#L522
it.df.abs == BishopID or it.df.abs == QueenID
I was thinking of avoiding the abs() -- and then discovered the strange behaviour of above loops.