nimforum mirror - Why is integer multiplication so slow here?

Stefan_Salewski (orginal) [2018-04-22T23:57:10+02:00] view original

proc main =
  var s = 0
  for i in 0 .. 1000000000:
    if i * i == 49:
    #if i == 7:
      s += 1
  echo s

main()

For this test, using "if i == 7" instead of "if i * i == 49" gives half runtime on my box. This is a bit surprising as multiply is considered very fast generally.

Nim Compiler Version 0.18.0 [Linux: amd64]

DeletedUser (orginal) [2018-04-23T01:52:15+02:00] view original

On my machine and debug mode i == 7 takes 12 seconds to run and i * i == 49 takes 20 seconds. With release mode it was like 1.3 for i * i == 49 and 0.8 for the other. A little faster than double the speed, which means it's at least faster than i == -7 or i == 7.

Based on this online benchmark, you'll see multiplication is about 3 times slower in C++

cdome (orginal) [2018-04-23T09:52:06+02:00] view original

with "if i == 7" branch misprediction in CPU happens much more often then with "if i == 49". Try to implement in the branchless way, it does help sometimes.

Stefan_Salewski (orginal) [2018-04-23T10:14:19+02:00] view original

Hlaaftana, thanks for confirming. I was really tired yesterday and though I was doing something wrong.

[EDIT] : Thanks for C++ example, really interesting, I was not aware of that page.

cdome, your message makes not too much sense for me currently.

Indeed I tried using unlikely() for this test, which made no difference, as expected, as automatic branch prediction should work fine here.

Indeed I wonder if Nim supports branchless cmov instructions at all currently, maybe not because of so much goto use? At least I have not seen a cmov in Nims assembler listings yet.

My feeling is, that mul is indeed slow in this special loop -- will inspect assembly listing when I have some more time.

sleepyqt (orginal) [2018-04-23T10:43:43+02:00] view original

First version 500ms


LBB0_1:                                 # =>This Inner Loop Header: Depth=1
  movdqa  xmm3, xmm5
  movdqa  xmm6, xmm5
  paddd xmm5, xmmword ptr [__xmm@00000008000000080000000800000008]
  add esi, -8
  pcmpeqd xmm3, xmm0
  pcmpeqd xmm6, xmm2
  psubd xmm4, xmm3
  psubd xmm1, xmm6
  jne LBB0_1

With mutiplication 2500ms(5x slower):


LBB0_3:                                 # =>This Inner Loop Header: Depth=1
  movdqu  xmm6, xmmword ptr [esp + 16] # 16-byte Reload
  add esi, -8
  movdqa  xmm4, xmm6
  pshufd  xmm5, xmm6, 245         # xmm5 = xmm6[1,1,3,3]
  movdqa  xmm3, xmm6
  paddd xmm6, xmmword ptr [__xmm@00000008000000080000000800000008]
  pmuludq xmm4, xmm4
  pmuludq xmm5, xmm5
  paddd xmm3, xmm1
  pshufd  xmm4, xmm4, 232         # xmm4 = xmm4[0,2,2,3]
  pshufd  xmm5, xmm5, 232         # xmm5 = xmm5[0,2,2,3]
  punpckldq xmm4, xmm5      # xmm4 = xmm4[0],xmm5[0],xmm4[1],xmm5[1]
  pshufd  xmm5, xmm3, 245         # xmm5 = xmm3[1,1,3,3]
  pmuludq xmm3, xmm3
  pmuludq xmm5, xmm5
  pshufd  xmm3, xmm3, 232         # xmm3 = xmm3[0,2,2,3]
  pcmpeqd xmm4, xmm2
  movdqu  xmmword ptr [esp + 16], xmm6 # 16-byte Spill
  pshufd  xmm5, xmm5, 232         # xmm5 = xmm5[0,2,2,3]
  psubd xmm7, xmm4
  punpckldq xmm3, xmm5      # xmm3 = xmm3[0],xmm5[0],xmm3[1],xmm5[1]
  pcmpeqd xmm3, xmm2
  psubd xmm0, xmm3
  jne LBB0_3

If compiled with tcc perfomance almost the same.

mratsim (orginal) [2018-04-24T09:02:38+02:00] view original

cmov will likely be slower than a branch misprediction in the 49 case because it happens much less often. See here for some benchmark: https://github.com/xiadz/cmov.

Also Nim is able to generate cmov, it depends on GCC Clang but rewriting the code like this can help:

proc main =
  var s = 0
  for i in 0 .. 1000000000:
    s += if i * i == 49: 1 else: 0
  echo s

main()

Now if you really want the fastest speed don't generate CMOV or if/else, generate ADC:

proc main =
  var s = 0
  for i in 0 .. 1000000000:
    s += int(i * i == 49)
  echo s

main()

Some timing and latency information: http://www.agner.org/optimize/instruction_tables.pdf

I tried on my machine.

The ASM for the == 7 is completely different from the other (80 ms run on my i5-5257U):

https://imgur.com/a/s64XHqp

ASM for the 3 propositions I did is the same though (350-400ms ms run on my i5-5257U):

https://imgur.com/a/8AvcRsm

So I think the 7/49 should be a command-line supplied value to avoid the compiler optimizing it away (unless it's really hardcoded in your library.)

Stefan_Salewski (orginal) [2018-04-24T10:02:46+02:00] view original

Thanks for your comments.

Starting point for these investigations was

https://github.com/StefanSalewski/nim-chess4/blob/master/engine.nim#L522


it.df.abs == BishopID or it.df.abs == QueenID

I was thinking of avoiding the abs() -- and then discovered the strange behaviour of above loops.

Mirror of forum.nim-lang.org

3775 :: Why is integer multiplication so slow here?