Recently I read: http://lemire.me/blog/2014/02/14/getting-good-performance-in-go-by-rewriting-parts-in-c/
So I thought it was time to try it in Nim:
import random
import times
# https://gcc.gnu.org/onlinedocs/gcc-4.5.3/gcc/Other-Builtins.html
# Nim/lib/system/arithm.nim
when sizeof(clong) == 8:
proc popcount[T: int64|int](a: T): cint {.
importc: "__builtin_popcountl", nodecl, nosideeffect.}
proc test =
var
s: array[1024, int]
c: cint
d: int
t: float
for i in 0 .. s.high: s[i] = random(8) # = 3
t = cpuTime()
for i in s:
c += popcount(i)
echo "Time for popcount(): ", cpuTime() - t
echo c
t = cpuTime()
for i in s:
d += i
echo "Time for add: ", cpuTime() - t
echo d
test()
$ nim c -d:release pc.nim
$ ./pc
Time for popcount(): 3.999999999999989e-06
1492
Time for add: 1.000000000000024e-06
3449
Less than 4 ns for a popcount operation.
[EDIT]
As Prof Lemire just told me:
So on a 4GHz processor, you ought to get down to 0.25 ns/popcnt.
And indeed, I forgot gcc option -march=native on my box. With that and -O3 popcount loop is more than two times faster than above, very close to plain add operation.