nimforum mirror - Nim versus Julia benchmark comparison

sdwfrost (orginal) [2017-12-01T15:56:27+01:00] view original

Dear All,

I'm trying to compare a simple discrete-time simulation in Nim and Julia; I've put a gist here. I wrote the code to be as similar as possible between the two languages, but Nim is about 3-4x slower than my Julia code (0.21s vs 0.06s). I'm including the Nim code below - any idea of why there might be such a difference? In addition, when I add profiling, the simulation time goes down - I would have thought that it would add overhead.

import random, random.xorshift, random.common
import math
import times

proc randbn(n: int64; p: float64; rng: var RNG): int64 =
  let logq:float64 = math.ln(1.0-p)
  var x:int64 = 0
  var sum:float64 = 0.0
  while true:
    sum = math.ln(rng.random())/float64(n - x)
    if sum < log_q:
      return x
    x=x+1

proc sir(t:int64;u:array[4,int64];du: var array[4,int64]; parms:array[5,float64]; r: var RNG): int64 {.discardable.}=
    let S=u[0]
    let I=u[1]
    let R=u[2]
    let Y=u[3]
    let beta=parms[0]
    let gamma=parms[1]
    let iota=parms[2]
    let N=parms[3]
    let dt=parms[4]
    let lambd=beta*(float64(I)+iota)/N
    let ifrac=1.0-exp(-lambd*dt)
    let rfrac=1.0-exp(-gamma*dt)
    let infection=int(randbn(S,ifrac,r))
    let recovery=int(randbn(I,rfrac,r))
    du[0]=S-infection
    du[1]=I+infection-recovery
    du[2]=R+recovery
    du[3]=Y+infection
    result = 1

proc simulate():float64 =
    let parms:array[5,float64] = [2.62110617498984, 0.5384615384615384, 0.5, 403.0, 0.1]
    var seed:uint64 = 123
    var r = initXorshift128Plus(seed)
    let tf:int64=540
    let nsims:int64 = 1000
    var yvec:array[1000,int64]
    
    for i in 1..nsims:
      var u:array[4,int64] = [60.int64,1,342,0]
      var du:array[4,int64] = [0.int64,0,0,0]
      for j in 1..tf:
        discard sir(j,u,du,parms,r)
        u=du
      yvec[i-1] = u[3]
    echo yvec
    result = float64(sum(yvec))/float64(nsims)

let t0=cpuTime()
let m = simulate()
let t1=cpuTime()
echo t1-t0
echo m

miran (orginal) [2017-12-01T16:10:49+01:00] view original

A wild guess, have you used -d:release flag when compiling?

def (orginal) [2017-12-01T16:20:42+01:00] view original

Additionally: 32bit ints might be faster than 64, Julia might opt to use them by default while Nim uses 64bit ints on x86-64 by default.

But locally, compiling Nim with -d:release, I see Nim being slightly faster than Julia: 0.22 s instead of 0.28 s.

Stefan_Salewski (orginal) [2017-12-01T16:29:15+01:00] view original

And can you please try to avoid the ugly cast:

#var u:array[4,int64] = cast[array[4,int64]]([60,1,342,0])
var u: array[4, int64] = [60.int64, 1, 342, 0] # should work -- if not tell Araq

sdwfrost (orginal) [2017-12-01T20:02:02+01:00] view original

Hi all,

A few things:

Yes, @def @miran, I compiled with -d:release.

I'm using Julia 0.6.0 vs Nim 0.17.2 on Linux, and I'm getting way better times on Julia.

Apologies for the cast! I didn't like it, and didn't know about setting a single member working (BTW this works)

Stefan_Salewski (orginal) [2017-12-01T20:20:35+01:00] view original

You have to tell us if your box is 32 or 64 bit. Size of data (4 or 8 byte) can make a difference.

And maybe tell us gcc version and optimize level of gcc. Is it default -O3

And you may prove if your random() proc is inlined. Recently we had a case where a plain proc from std lib was not inlined, which can make serious slowdown. You may compile c code with -flto to ensure inline by link time optimization.

guibar (orginal) [2017-12-01T20:27:41+01:00] view original

(edit: this post just appeared on 12/14 because of the delay of moderation)

Hello, I played a bit with it (config: i5-2675QM Linux 64-bit gcc 64-bit gcc 5.4.0 nim 0.17.2 julia 0.6.1), and I found also that, with the -d:release flag, that the Nim version is 3-4x slower than the Julia one.

However, with different flags to the C compiler, the results varied considerably. With the flag --passC:"-ffast-math -march=native" the Nim version became slighty faster...

This reminded me of past troubles, so I tried to replace the system libm with openlibm (incidentally the math lib used by julia) and this time the Nim version was as fast as the Julia one.

So, as the code uses math functions log and exp quite often in the main loop, I would say that this is mainly an issue with math library/math opts.

sdwfrost (orginal) [2017-12-01T20:30:37+01:00] view original

Hi @stefan_salewski

My box is 64 bit (Intel Core i7-5500U CPU @ 2.40GHz x 4), with gcc 5.4.0. AFAIK default optimisation is O3; how does one check via Nim?

sdwfrost (orginal) [2017-12-02T07:59:27+01:00] view original

I compiled using the following:


nim c -d:release --passC:"-flto" sir

but it made no difference in the runtime. I also made a mistake in the above code (now corrected, plus getting rid of the cast), but it didn't make a difference either.

Stefan_Salewski (orginal) [2017-12-02T11:26:16+01:00] view original

One more explanation for a factor of 3 - 4 in performance can be of course SIMD instructions. Maybe latest Julia is very good in using SIMD? Maybe you can try clang instead of gcc to see if clang can better apply SIMD and related parallel instructions to your code. (Or Julia may pre-compute some of your code?)

sendell (orginal) [2017-12-15T03:31:47+01:00] view original

-d:release should use LTO whenever available on the underlying compiler if you ask me :) Any downside on making this the default behavior?

mratsim (orginal) [2017-12-15T05:18:33+01:00] view original

Exponential link-time

guibar (orginal) [2017-12-15T10:12:17+01:00] view original

@luked2 I see, this confirms what I suspected in my post above (which appeared recently due to moderation).

Tiberium (orginal) [2017-12-18T17:20:30+01:00] view original

@mratsim Why does Nim need warm-up? I've never heard about that.

def (orginal) [2017-12-18T17:43:48+01:00] view original

Branch prediction, CPU powersaving I guess.

mratsim (orginal) [2017-12-19T12:55:44+01:00] view original

Yes, it's to make sure that if the cpu governor is "ondemand", benchmarking starts with the CPU running at full speed and not at ~800Mhz for the first few iterations.

skan (orginal) [2018-12-06T12:00:29+01:00] view original

But you could also use fastmath on Julia, it would also be faster. Anyway you shouldn't use that option on any language because can produce unexpected results. Your Julia program can also be optimized.

sdwfrost (orginal) [2018-12-06T21:48:30+01:00] view original

In the comments to the original gist, I do use fastmath (although I acknowledge the potential problem), and I also spent some time optimising...

mratsim (orginal) [2018-12-07T10:58:54+01:00] view original

Fast-math is not a magic "break everything" compiler option.

For example one of the biggest speedup fast-math brings is assuming that (a + b) + c is equivalent to a + (b + c) (associativity) which is not true in floating point math due to rounding.

This is key for reductions like sum([float32 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) without this assumption the CPU has to wait for the previous result (also named "loop-carried dependency").

With this assumption, the compiler can unroll the loop and do result0 += input[i]; result1 += input[i+1] improving throughput 2x. Note that compilers do that automatically for integers because integer operations are associative.

You can read a very thorough benchmark of the effect of increasing the number of accumulators here, and see that when compiling with fast-math this is exactly what the compiler does.

In terms of speed, in my point of view, the only advantage that Julia has over Nim is that it is always JITted through LLVM.

This means that you can use the GCC vector extension and make sure that they compile to optimized SSE, AVX or NEON (for ARM) without having to deal with runtime CPU detection, Microsoft Visual C++ stuff.

It's much more work in Nim to replicate that (unless you constrains the compiler to GCC/Clang).

Example of vector extension in Nim

{.emit:"typedef float Float32x8 __attribute__ ((vector_size (32)));".}
  # auto-fallback if the CPU only supports 16-bytes/128-bit wide vectors.

type Float32x8 {.importc, bycopy.} = object
  raw: array[8, float32]

import math, times, random

func `+`(a, b: Float32x8): Float32x8 =
  {.emit: "`result` = `a` + `b`;".} # <--- Notice that it uses the natural C syntax, no _mm256_add_ps

func double(v: Float32x8): Float32x8 =
  result = v + v

proc main() =
  var b = [0'f32, 0, 0, 0, 0, 0, 0, 0]
  var start = cpuTime()
  for i in 0..<1000000:
    let a = cast[Float32x8]([float32 rand(1.0), 2, rand(1.0), 4, rand(1.0), 6, rand(1.0), 8])
    b = cast[array[8, float32]](double(double(a)))
  var stop = cpuTime()
  echo b
  echo stop - start

when isMainModule:
  main()

Mirror of forum.nim-lang.org

3388 :: Nim versus Julia benchmark comparison