Dear All,
I'm trying to compare a simple discrete-time simulation in Nim and Julia; I've put a gist here. I wrote the code to be as similar as possible between the two languages, but Nim is about 3-4x slower than my Julia code (0.21s vs 0.06s). I'm including the Nim code below - any idea of why there might be such a difference? In addition, when I add profiling, the simulation time goes down - I would have thought that it would add overhead.
import random, random.xorshift, random.common
import math
import times
proc randbn(n: int64; p: float64; rng: var RNG): int64 =
let logq:float64 = math.ln(1.0-p)
var x:int64 = 0
var sum:float64 = 0.0
while true:
sum = math.ln(rng.random())/float64(n - x)
if sum < log_q:
return x
x=x+1
proc sir(t:int64;u:array[4,int64];du: var array[4,int64]; parms:array[5,float64]; r: var RNG): int64 {.discardable.}=
let S=u[0]
let I=u[1]
let R=u[2]
let Y=u[3]
let beta=parms[0]
let gamma=parms[1]
let iota=parms[2]
let N=parms[3]
let dt=parms[4]
let lambd=beta*(float64(I)+iota)/N
let ifrac=1.0-exp(-lambd*dt)
let rfrac=1.0-exp(-gamma*dt)
let infection=int(randbn(S,ifrac,r))
let recovery=int(randbn(I,rfrac,r))
du[0]=S-infection
du[1]=I+infection-recovery
du[2]=R+recovery
du[3]=Y+infection
result = 1
proc simulate():float64 =
let parms:array[5,float64] = [2.62110617498984, 0.5384615384615384, 0.5, 403.0, 0.1]
var seed:uint64 = 123
var r = initXorshift128Plus(seed)
let tf:int64=540
let nsims:int64 = 1000
var yvec:array[1000,int64]
for i in 1..nsims:
var u:array[4,int64] = [60.int64,1,342,0]
var du:array[4,int64] = [0.int64,0,0,0]
for j in 1..tf:
discard sir(j,u,du,parms,r)
u=du
yvec[i-1] = u[3]
echo yvec
result = float64(sum(yvec))/float64(nsims)
let t0=cpuTime()
let m = simulate()
let t1=cpuTime()
echo t1-t0
echo m
Additionally: 32bit ints might be faster than 64, Julia might opt to use them by default while Nim uses 64bit ints on x86-64 by default.
But locally, compiling Nim with -d:release, I see Nim being slightly faster than Julia: 0.22 s instead of 0.28 s.
And can you please try to avoid the ugly cast:
#var u:array[4,int64] = cast[array[4,int64]]([60,1,342,0])
var u: array[4, int64] = [60.int64, 1, 342, 0] # should work -- if not tell Araq
Hi all,
A few things:
You have to tell us if your box is 32 or 64 bit. Size of data (4 or 8 byte) can make a difference.
And maybe tell us gcc version and optimize level of gcc. Is it default -O3
And you may prove if your random() proc is inlined. Recently we had a case where a plain proc from std lib was not inlined, which can make serious slowdown. You may compile c code with -flto to ensure inline by link time optimization.
(edit: this post just appeared on 12/14 because of the delay of moderation)
Hello, I played a bit with it (config: i5-2675QM Linux 64-bit gcc 64-bit gcc 5.4.0 nim 0.17.2 julia 0.6.1), and I found also that, with the -d:release flag, that the Nim version is 3-4x slower than the Julia one.
However, with different flags to the C compiler, the results varied considerably. With the flag --passC:"-ffast-math -march=native" the Nim version became slighty faster...
This reminded me of past troubles, so I tried to replace the system libm with openlibm (incidentally the math lib used by julia) and this time the Nim version was as fast as the Julia one.
So, as the code uses math functions log and exp quite often in the main loop, I would say that this is mainly an issue with math library/math opts.
Hi @stefan_salewski
My box is 64 bit (Intel Core i7-5500U CPU @ 2.40GHz x 4), with gcc 5.4.0. AFAIK default optimisation is O3; how does one check via Nim?
I compiled using the following:
nim c -d:release --passC:"-flto" sir
but it made no difference in the runtime. I also made a mistake in the above code (now corrected, plus getting rid of the cast), but it didn't make a difference either.
Fast-math is not a magic "break everything" compiler option.
For example one of the biggest speedup fast-math brings is assuming that (a + b) + c is equivalent to a + (b + c) (associativity) which is not true in floating point math due to rounding.
This is key for reductions like sum([float32 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) without this assumption the CPU has to wait for the previous result (also named "loop-carried dependency").
With this assumption, the compiler can unroll the loop and do result0 += input[i]; result1 += input[i+1] improving throughput 2x. Note that compilers do that automatically for integers because integer operations are associative.
You can read a very thorough benchmark of the effect of increasing the number of accumulators here, and see that when compiling with fast-math this is exactly what the compiler does.
In terms of speed, in my point of view, the only advantage that Julia has over Nim is that it is always JITted through LLVM.
This means that you can use the GCC vector extension and make sure that they compile to optimized SSE, AVX or NEON (for ARM) without having to deal with runtime CPU detection, Microsoft Visual C++ stuff.
It's much more work in Nim to replicate that (unless you constrains the compiler to GCC/Clang).
Example of vector extension in Nim
{.emit:"typedef float Float32x8 __attribute__ ((vector_size (32)));".}
# auto-fallback if the CPU only supports 16-bytes/128-bit wide vectors.
type Float32x8 {.importc, bycopy.} = object
raw: array[8, float32]
import math, times, random
func `+`(a, b: Float32x8): Float32x8 =
{.emit: "`result` = `a` + `b`;".} # <--- Notice that it uses the natural C syntax, no _mm256_add_ps
func double(v: Float32x8): Float32x8 =
result = v + v
proc main() =
var b = [0'f32, 0, 0, 0, 0, 0, 0, 0]
var start = cpuTime()
for i in 0..<1000000:
let a = cast[Float32x8]([float32 rand(1.0), 2, rand(1.0), 4, rand(1.0), 6, rand(1.0), 8])
b = cast[array[8, float32]](double(double(a)))
var stop = cpuTime()
echo b
echo stop - start
when isMainModule:
main()