Of course not, there is much allocation work involved. But about 500 ns (a few thousand CPU cycles) is indeed more as my expectations. I discovered that when tuning my chess engine -- using one more tiny seq noticeable reduced performance. Of course seq's are convenient, because of existing add() operation. A plain array allocated on the stack is faster, fixed size is OK, because I have to store only the moves or captures for current position, but I would have to track current add position. Chess is recursive, so I can not use one single global seq. My current idea is to use a global seq as as first buffer for collecting captures, and only swap() that buffer with a new allocated seq when it turns out that there are possible captures at all available. (I guess swap will not copy data, but only pointers, so it may be fast) Or I may use my own defined container which is allocated on the stack. For that case I would have to define my own add() and sort() operations. I don't think that I will get trouble with stack size...
import times, random
type
MyBuffer = tuple
d: array[128, int]
len: int
proc test(): int =
var s = newSeqOfCap[int](128)
s.len
proc tx(): int =
var s: MyBuffer
s.len
proc t0(): int =
random(7)
proc main =
var t: float # cpuTime()
var i: int
t = cpuTime()
i = test()
i = test()
i = test()
i = test()
i = test()
i = test()
i = test()
i = test()
i = test()
i = test()
echo "10 * test: ", cpuTime() - t
t = cpuTime()
i = tx()
i = tx()
i = tx()
i = tx()
i = tx()
i = tx()
i = tx()
i = tx()
i = tx()
i = tx()
echo "10 * tx: ", cpuTime() - t
t = cpuTime()
i = t0()
i = t0()
i = t0()
i = t0()
i = t0()
i = t0()
i = t0()
i = t0()
i = t0()
i = t0()
echo "10 * t0: ", cpuTime() - t
main()
# nim c -d:release t.nim
# 10 * test: 5.000000000000013e-06
# 10 * tx: 0.0
# 10 * t0: 0.0
This is a problem for me too. My solution is to re-use seqs as often as possible.
I have this (in my Ubuntu14 VM):
10 * test: 2.599999999999998e-05
10 * tx: 2.000000000000049e-06
10 * t0: 2.999999999999965e-06
But in your benchmark, you need to use i so the compiler cannot elide any code. When I accumulate i:
10 * test: 4.799999999999987e-05
10 * tx: 1.999999999999832e-06
10 * t0: 2.000000000000265e-06
i=21
A better way to test Nim's allocator is to compare it with the system malloc implementation. Of course it's going to be slow when compared with a stack allocation.
Below is my version of the above benchmark. I've not tested it for Windows users - the worst that might happen is that you get a rather large file called 'nul' full of numbers.
import times, random
proc malloc(size: uint): pointer {.header: "<stdlib.h>", importc: "malloc".}
proc free(p: pointer) {.header: "<stdlib.h>", importc: "free".}
const bufferSize = 128
type
MyBuffer = array[bufferSize, int]
proc testStackAllocation(): int =
var s: MyBuffer
result = cast[int](addr s)
proc testSequenceAllocation(): int =
var s = newSeqOfCap[int](bufferSize)
result = cast[int](addr s)
proc testMallocAllocation(): int =
var res = malloc(uint(sizeof(MyBuffer)))
free(res)
result = cast[int](res)
proc testRandom(): int =
result = random(7)
proc main =
# Use writing to /dev/null to prevent compiler optimizations
when defined(posix):
let nullfh = open("/dev/null", fmReadWrite)
else:
let nullfh = open("nul") # untested!
var baseline: float
# Establish a baseline time of a really simple operation + writing to stdout
# This way we can esentially measure how fast stdout can be written to, and
# factor that out of other measurements.
let z = cpuTime()
for i in 0..10_000_000:
nullfh.write(i)
baseline = cpuTime() - z
echo "Baseline time:", baseline
# Template to run test procedures.
template runProc(testProc: typed, testName: string): untyped =
let t = cpuTime()
for _ in 0..10_000_000:
var i = testProc()
nullfh.write(i)
echo "Time for ", testName, ": ", (cpuTime() - t) - baseline
# Test:
# - Stack allocation (which is usually a bump allocator)
# - Malloc allocation (system defined)
# - Nim allocation
# - Random number generation
runProc(testStackAllocation, "stack allocation test")
runProc(testSequenceAllocation, "sequence allocation test")
runProc(testMallocAllocation, "malloc allocation test")
runProc(testRandom, "random number generation test")
main()
Output using various GC backends (Mac OS Sierra, 2.5 GHz Intel Core i7) :
# nim c -d:release --passC:"-flto" --passL:"-flto" --gc:markAndSweep benchmark.nim && ./benchmark
Baseline time: 1.072926
Time for stack allocation test: 0.1643110000000001
Time for sequence allocation test: 1.142116
Time for malloc allocation test: 0.8133139999999999
Time for random number generation test: -0.1466420000000002
# nim c -d:release --passC:"-flto" --passL:"-flto" benchmark.nim && ./benchmark
Baseline time: 1.053752
Time for stack allocation test: 0.1456770000000001
Time for sequence allocation test: 1.227938
Time for malloc allocation test: 0.8247639999999994
Time for random number generation test: -0.1094160000000002
# Version of the benchmark with mark and sweep cycle collection disabled
# nim c -d:release --passC:"-flto" --passL:"-flto" benchmark.nim && ./benchmark
Baseline time: 1.075432
Time for stack allocation test: 0.146002
Time for sequence allocation test: 1.189319
Time for malloc allocation test: 0.7623239999999996
Time for random number generation test: -0.1765280000000002
As you can see, Nim's allocator is slower than the system malloc allocator, but not by much (I chalk some of this up to the compiler being able to use better inlining and intrinsics for malloc). Neither comes close to stack allocation, but again, that's expected. Turning off cycle detection lowers the amount of time taken up by deallocation.
Neither comes close to stack allocation, but again, that's expected.
@Varriount, I added a benchmark with alloca and using your compiler args it was consistently slightly faster than stack allocation via array for those sizes. Maybe because alloca doesn't zero the array on initialization and array does?
s = newSeqOfCap[int](128)
s = newSeq[int]()
For me, a newSeq[int]() takes about 23 nanoseconds and a newSeqOfCap[int](128) takes about 90 nanoseconds (2.5 GHz Intel Core i7), including the time for garbage collection, averaged over 100,000,000 allocations.
This sounds about right. You have some unavoidable costs doing the allocation proper, and then for newSeqOfCap[int](128), you have to also initialize about 1K of additional memory.
If zeroed memory is required semantically, then it's better to compare against calloc not malloc in an apples-to-apples sense.
Also, at least on Linux on you need to be careful in benchmarking this sort of thing. As you get to larger allocations the OS can actually give the process many linked copies of a copy-on-write 4096 byte page of all zero bytes. Because of the C-O-W, the initial allocation alone can seem much cheaper than it really is. The work is deferred to being done lazily as memory is written to.
This zero page business can also cause confusion with calloc-never-write read-only workloads in benchmarks (say calloc and then sum all the ints) because on x86 the cache is physically mapped and that single zero page can be fully L1 resident. Such L1 residence can give a "false" speed advantage compared to more aged/populated memory. [ This latter point is not relevant here, but seemed worth pointing out anyway -- it was how I personally discovered Linux was finally doing this optimization. ]