nimforum mirror - Nim GC Performance

JohnS (orginal) [2016-12-02T17:04:39+01:00] view original

I was reading this: Golang's Real-time GC in Theory and Practice and thought I'd try the same code in Nim.

Here's my initial cut, anyone see anything that should be changed/improved to make the benchmark more comparable/meaningful?

# run with 'nim -c -r -d:release bm.nim'

import strutils
import times

const windowSize = 200000
const msgCount   = 1000000

type msg = ref array[1024, byte]
type buffer = ref array[windowSize, msg]

var worst: float

proc mkMessage( n: int ): msg =
    result = new( msg )
    for i in items( result[] ):
        result[i] = byte( n )

proc pushMsg( b: var buffer, highID: int ) =
    let start = cpuTime()
    let m = mkMessage( highID )
    b[highID mod windowSize] = m
    let elapsed = cpuTime() - start
    if elapsed > worst:
        worst = elapsed

proc main() =
    var b = new( buffer )
    for i in 0..<msgCount:
        pushMsg( b, i )
    
    let worst_ms = formatFloat( worst*1000, format=ffDecimal, precision=2 )
    echo( "Worst push time: ", worst_ms, "ms" )

when isMainModule:
    main()

I also don't know enough about the various garbage collectors to ID which would produce the best numbers but I thought that would be good information to include in the top comment block, e.g. something like:

# Try with nim -c -r -d:release --gc:stack bm.nim for best performance!

Once there are no concerns/objections I'll submit it back to their repo for inclusion. Thanks!

Stefan_Salewski (orginal) [2016-12-02T17:47:13+01:00] view original

Thanks for the link, sounds interesting, will read soon...

For your test code, I am not really sure what the purpose is. Nim's realtime GC is tunable, so you can get very short pause time with a tradeoff in throughput. See

http://nim-lang.org/docs/gc.html

You seems to test only only pause time for the default, which may be more tuned to throughput. (For my understanding, for most software pause times are not critical, or at least pause times of a few ms are fully OK, as for action games. For hardware control pauses may be critical, but if we do not allocate GC memory in critical parts, then the GC will not run in that critical part.) And when you are testing, you may test the boehm Gc as well.

I hope my understanding is not too wrong, I have newer spent time investigationg or understanding the GC.

For your code itself there is some room for improvement of course, for example

for i in items( result[] ):
        result[i] = byte( n )

should be writable as

for i in mitems( result[] ):
        i = byte( n )

dom96 (orginal) [2016-12-02T18:12:11+01:00] view original

I was planning on porting these benchmarks to Nim myself, happy to see that you beat me to it :)

I will play around with your version and see if I can speed it up.

dom96 (orginal) [2016-12-02T21:00:54+01:00] view original

Here is my version:

# Compile and run with 'nim c -r -d:useRealtimeGC -d:release main.nim'

import strutils
import times

const
  windowSize = 200000
  msgCount   = 1000000

type
  Msg = seq[byte]
  Buffer = seq[Msg]

var worst: float

proc mkMessage(n: int): Msg =
  result = newSeq[byte](1024)
  for i in 0 .. <result.len:
    result[i] = byte(n)

proc pushMsg(b: var Buffer, highID: int) =
  let start = epochTime()
  
  let m = mkMessage(highID)
  b[highID mod windowSize] = m
  
  GC_step(10_000) # 10ms
  
  let elapsed = epochTime() - start
  if elapsed > worst:
    worst = elapsed

proc main() =
  GC_disable()
  
  var b = newSeq[Msg](windowSize)
  for i in 0 .. <msgCount:
    pushMsg(b, i)
  
  let worst_ms = formatFloat(worst * 1000, format=ffDecimal, precision=2)
  echo("Worst push time: ", worst_ms, "ms")
  
  echo(GC_getStatistics())

when isMainModule:
  main()

Things i've changed:

Removed extraneous spaces after ( and before ).

Rewrote code to make it more idiomatic.
- Using seq[byte] instead of ref array[1024, byte] (this is not idiomatic at all)

Switched from cpuTime to epochTime, the former only measures CPU time(!).

Added some real-time GC API calls.

Strangely, switching to seq causes an overall slowdown (the max push time is unaffected, but the application takes longer to complete).

On my machine the typical worst push time is 10ms, both for my version of the code and yours. The addition of GC_step doesn't really help, but it is nice to show off. Go 1.5 (maybe I should upgrade) gets 8ms. So Nim is 2ms off, which isn't bad. As pointed out by /u/matthieum on Reddit, the power of Nim is the fact that you can specify exactly where the GC should run (as I've done using the GC_step call).

BTW, stack isn't a GC.

Araq (orginal) [2016-12-02T22:15:24+01:00] view original

Here is my version, I switched to measuring nano seconds to get numbers different from 0:

# Compile and run with 'nim c -r -d:useRealtimeGC -d:release main.nim'

import strutils
#import times

include "$lib/system/timers"

const
  windowSize = 200000
  msgCount   = 1000000

type
  Msg = seq[byte]
  Buffer = seq[Msg]

var worst: Nanos

proc mkMessage(n: int): Msg =
  result = newSeq[byte](1024)
  for i in 0 .. <result.len:
    result[i] = byte(n)

proc pushMsg(b: var Buffer, highID: int) =
  let start = getTicks()
  
  let m = mkMessage(highID)
  # seqs copy, we don't want a copy here, so I used 'shallowCopy'.
  # Alternatively the 'ref array' version could be used.
  # Note that due to Nim's design we could also *re-use* the memory
  # completely, eliminating the GC runs but this would make this
  # benchmark pointless.
  shallowCopy(b[highID mod windowSize], m)
  
  GC_step(500) # 0.5ms, but GC never takes this long.
  
  let elapsed = getTicks() - start
  if elapsed > worst:
    worst = elapsed

proc main() =
  GC_disable()
  # The MS backup collector is only required for cylic data structures,
  # but in this program we generate no cycles. Nim allows us to specify this
  # behaviour directly, so IMO it's fair to disable it. Nim's GC docs also
  # mention this fact, it's not undocumented voodoo:
  GC_disableMarkAndSweep()
  
  var b = newSeq[Msg](windowSize)
  for i in 0 .. <msgCount:
    pushMsg(b, i)
  
  # nano seconds into ms:
  echo("Worst push time: ", worst, "nano seconds")
  
  echo(GC_getStatistics())

when isMainModule:
  main()

On my machine I get to roughly 23000 nano seconds.

Take with a big grain of salt!

Read my comments in the code.

I am not sure if "lib/system/timers" is reliable.

I have a fast CPU. :-)

jlp765 (orginal) [2016-12-03T02:05:16+01:00] view original

on my win7 64-bit system

running @Araq's version 25 times


        25 	Count
   
   204,940 	Min
   975,141 	Mean
 8,422,127 	Max
 
 1,601,464 	StdDev

So a fairly wide deviation of values (and a much slower system).

JohnS (orginal) [2016-12-03T04:30:09+01:00] view original

Follow on comments:

Removed extraneous spaces after ( and before ).

You'll take the spaces from my cold, dead fingers. ;)

Rewrote code to make it more idiomatic.
Using seq[byte] instead of ref array[1024, byte] (this is not idiomatic at all)

Tracking. I used the array/ref array because that seemed a more comparable data structure to Go's array & slice types but I agree that idiomatic code is probably better for a benchmark like this.

Switched from cpuTime to epochTime, the former only measures CPU time(!).

I originally used epochTime but the STL docs say:

This may be more useful for benchmarking than epochTime.

Thanks for all the feedback everyone!

Araq (orginal) [2016-12-03T12:04:23+01:00] view original

Ok, on my Mac this one produces better timings.

# Compile and run with 'nim c -r -d:useRealtimeGC -d:release main.nim'

import strutils
#import times

include "$lib/system/timers"

const
  windowSize = 200000
  msgCount   = 1000000

type
  Msg = seq[byte]
  Buffer = seq[Msg]

var worst: Nanos

proc mkMessage(n: int): Msg =
  result = newSeq[byte](1024)
  for i in 0 .. <result.len:
    result[i] = byte(n)

proc pushMsg0(b: var Buffer, highID: int) =
  # warmup:
  let m = mkMessage(highID)
  shallowCopy(b[highID mod windowSize], m)

proc pushMsg1(b: var Buffer, highID: int) =
  # with benchmarking:
  let start = getTicks()
  
  let m = mkMessage(highID)
  shallowCopy(b[highID mod windowSize], m)
  
  let elapsed = getTicks() - start
  if elapsed > worst:
    worst = elapsed

proc main() =
  # Don't use GC_disable() and GC_step(). Instead use GC_setMaxPause().
  GC_disableMarkAndSweep()
  GC_setMaxPause(300)
  
  var b = newSeq[Msg](windowSize)
  # we need to warmup Nim's memory allocator so that not most
  # of the time is spent in mmap()... Hopefully later versions of Nim
  # will be smarter and allocate larger pieces from the OS:
  for i in 0 .. <msgCount:
    pushMsg0(b, i)
  
  # now after warmup, we can measure things:
  for i in 0 .. <msgCount:
    pushMsg1(b, i)
  
  echo("Worst push time: ", worst, " nano seconds")
  
  echo(GC_getStatistics())

when isMainModule:
  main()

On my slower machine, I get latencies of max 0.3ms.

jlp765 (orginal) [2016-12-04T21:58:40+01:00] view original

For comparison purposes, @Araq's second solution gives me


        25 	Count
   
   272,765 	Min
 1,271,322 	Mean
 7,401,332 	Max
 
 1,503,168 	StdDev

which over 25 tests is not much different to the first test.

Araq (orginal) [2016-12-05T09:03:28+01:00] view original

@jlp765 On what OS do you get that much jitter?

dom96 (orginal) [2016-12-06T12:12:59+01:00] view original

@Varriount what version of Go are you using?

zolern (orginal) [2016-12-06T12:14:20+01:00] view original

Nim CG also is thread independent

Varriount (orginal) [2016-12-06T14:07:22+01:00] view original

Turning on cycle detection doesn't seem to affect the pause times for me. I still get sub-millisecond pauses for Araq's Nim snippet.

This is the snippet I'm using:

# Compile and run with 'nim c -r -d:useRealtimeGC -d:release main.nim'

import strutils
#import times

include "$lib/system/timers"

const
  windowSize = 200000
  msgCount   = 1000000

type
  Msg = seq[byte]
  Buffer = seq[Msg]

var worst: Nanos

proc mkMessage(n: int): Msg =
  result = newSeq[byte](1024)
  for i in 0 .. <result.len:
    result[i] = byte(n)

proc pushMsg0(b: var Buffer, highID: int) =
  # warmup:
  let m = mkMessage(highID)
  shallowCopy(b[highID mod windowSize], m)

proc pushMsg1(b: var Buffer, highID: int) =
  # with benchmarking:
  let start = getTicks()
  
  let m = mkMessage(highID)
  shallowCopy(b[highID mod windowSize], m)
  
  let elapsed = getTicks() - start
  if elapsed > worst:
    worst = elapsed

proc main() =
  # Don't use GC_disable() and GC_step(). Instead use GC_setMaxPause().
  # GC_disableMarkAndSweep()
  GC_setMaxPause(300)
  
  var b = newSeq[Msg](windowSize)
  # we need to warmup Nim's memory allocator so that not most
  # of the time is spent in mmap()... Hopefully later versions of Nim
  # will be smarter and allocate larger pieces from the OS:
  for i in 0 .. <msgCount:
    pushMsg0(b, i)
  
  # now after warmup, we can measure things:
  for i in 0 .. <msgCount:
    pushMsg1(b, i)
  
  echo("Worst push time: ", worst, " nano seconds")

when isMainModule:
  main()

Araq (orginal) [2016-12-06T14:07:56+01:00] view original

Well you can always switch it off and see what happens, "complex real world example" or not. You can also disable it selectively per thread or per time critical code part. you can also break up cycles manually via ptr and that's where Nim should be improving. Swift too embraces this model.

Stefan_Salewski (orginal) [2016-12-06T16:34:55+01:00] view original

@Varriount

You are right, for your example I get pause times between 44 and 110us on Linux. While for Araqs first example GC_disableMarkAndSweep() was necessary to get so small values.

@Araq

Well you can always switch it off and see what happens,

And what may happen? I assume that some cyclic mem may not be freed, so one may get a slow memory leak, with maybe out of mem errors after some months of uptime?

Breaking up cycles via ptr -- well before that we have to detect the cycles of course :-)

dom96 (orginal) [2017-01-14T23:08:23+01:00] view original

I finally remembered to create a PR for this code: https://gitlab.com/gasche/gc-latency-experiment/merge_requests/16

JohnS (orginal) [2017-01-15T20:08:55+01:00] view original

Thanks @dom96!

It kept sliding down my todo list unfortunately.

dom96 (orginal) [2017-04-14T14:12:40+02:00] view original

For reference, here is a comparison of the performance of most of the language's in this benchmark:

https://gist.github.com/dom96/77b32e36b62377b2e7cadf09575b8883

evacchi (orginal) [2017-04-14T15:29:47+02:00] view original

impressive results!

Mirror of forum.nim-lang.org

2646 :: Nim GC Performance