I still wonder what the disadvantages of static proc parameters are.
Is it increase in compile time, or are there other issues?
Because, I recently suggested in
https://github.com/nim-lang/Nim/issues/10910#issuecomment-490977890
proc `^`*[T](x: T, y: static[Natural]): T {.inline.} =
when y < 7:
when y == 0:
result = T(1)
when y == 1:
result = x
when y == 2:
result = x * x
when y == 3:
result = x * x * x
when y == 4:
result = x * x
result *= result
when y == 5:
result = x * x
result *= (result * x)
when y == 6:
result = x * x
result *= (result * result)
else:
result = math.`^`(x, y)
# or
proc `^`*[T](x: T, y: static[Natural]): T {.inline.} =
when y < 10:
result = T(1)
var i = y
while i > 0:
result *= x
dec(i)
else:
result = math.`^`(x, y)
to make small integer powers more nimish.
Miran' s choise was instead adding this code for small powers:
case y
of 0: result = 1
of 1: result = x
of 2: result = x * x
of 3: result = x * x * x
in https://github.com/nim-lang/Nim/blob/devel/lib/pure/math.nim#L966
Which seems to be an improvement, but I can not imagine that it leads to optimal code. (Maybe the case statement is covered by a cmov instruction avoiding a slow branch, but still the ^ proc is not inlined, as long as we do not compile with -flto.)
Well, I have tested running time, without -flto but with -d:release of course.
Mirans fix was an improvement for ^ 2 (square) but still slower than plain "*". My suggested fix was equal to "*" for both of my suggestions, and assembler was perfect.
So I assumed there just is a drawback of my solution.
OK, will look at assembler.
OK, here it is:
First my well known Nim test code:
# nim c -d:release k.nim
import random, math
proc main =
var s, j: int
for i in 0 .. 10000000:
j = rand(7)
#s += j * j # j ^ 2
s += j ^ 2
echo s
main()
Compiled with nim c -d:release k.nim with gcc9.1
.file "k.c"
.L33:
movl $7, %edi
call rand_v7jZDEs4VOsrcpvk0yo8Rg
movq %rax, %rdi
movl $2, %esi
call roof__e6fgxN584SyDK8XF8s1uig
addq %rax, %rbp
decq %rbx
jne .L33
.file "stdlib_math.c"
.text
.p2align 4
.globl roof__e6fgxN584SyDK8XF8s1uig
.hidden roof__e6fgxN584SyDK8XF8s1uig
.type roof__e6fgxN584SyDK8XF8s1uig, @function
roof__e6fgxN584SyDK8XF8s1uig:
.LFB3:
.cfi_startproc
movq %rdi, %rax
cmpq $2, %rsi
je .L2
jg .L3
testq %rsi, %rsi
je .L9
cmpq $1, %rsi
jne .L5
ret
.p2align 4,,10
.p2align 3
.L3:
cmpq $3, %rsi
jne .L5
movq %rdi, %rdx
imulq %rdi, %rdx
imulq %rdx, %rax
ret
.p2align 4,,10
.p2align 3
.L9:
movl $1, %eax
ret
.p2align 4,,10
.p2align 3
.L5:
movq %rax, %rdx
movl $1, %eax
jmp .L7
.p2align 4,,10
.p2align 3
.L22:
imulq %rdx, %rdx
.L7:
testb $1, %sil
je .L8
imulq %rdx, %rax
.L8:
shrq %rsi
jne .L22
ret
.p2align 4,,10
.p2align 3
.L2:
imulq %rdi, %rax
ret
.cfi_endproc
.LFE3:
.size roof__e6fgxN584SyDK8XF8s1uig, .-roof__e6fgxN584SyDK8XF8s1uig
.ident "GCC: (Gentoo 9.1.0 p1.0) 9.1.0"
.section .note.GNU-stack,"",@progbits
Seems to be not surprising. As roof proc is not inlined, full assembler proc is called.
And to verify here timing with your roof proc:
$ time ./k
175032899
real 0m0.172s
Plain j * j gives
$ time ./k
175032899
real 0m0.164s
It is not a big difference of course, but note the rand() and loop overhead. For Python we would be happy with the roof proc, but this is Nim. And as I wrote above, my static suggestion gives same as j * j. Of course we can use -flto, and maybe we should make that the default, then all is inlined and your roof proc works perfectly.
The main disadvantage is that you have one proc per static instantiation value which usually increase the code size (and is less efficient on the instruction cache).
It doesn't really matter in this case as the proc is inline and a single machine instruction proc.
compilers nowadays often do call specialization for constants arguments - basically create alternate functions for specific constant values based on metrics collected during the compile, instead of doing it blindly, meaning that static ends up being a noisy premature optimization if used for this reason.
the story is similar to inline in c - obsoleted by better compilers, effectively.
this will work better if you also enable LTO - which incidentally removes the need for the nim {.inline.} noise in your code too.
In that case ^ is a library function and will be in a different module from where it's used. So you need either static or {.inline.} so that respectively Nim VM or the C compiler does constant folding.
Also while C compilers certainly does constant propagation of add, mul, or/and/xor, shifts, I'm not sure they do that for the power function.