Hi, I am experiencing that the sample function is more than two times slower than manually indexing into rand(size-1):
import sequtils
import random
import times
proc main() =
const
size = 10_000_000
times = 100_000_000
let
xs = newSeqWith(size, rand(0..10))
var start = cpuTime()
for i in 0..<times:
let x = xs.sample
echo "time taken sample: ", cpuTime() - start
start = cpuTime()
for i in 0..<times:
let x = xs[rand(size - 1)]
echo "time taken rand: ", cpuTime() - start
main()
Compiling with -d=danger on my PC:
time taken sample: 4.956046000000001
time taken rand: 1.924525
Looking at the implementation I can't see how that would make sense, other than it being generic. Any ideas on why or if this is expected?
I can reproduce the slowness by approximately the same factor.
Though when building with lto enabled (--passC:"-flto") sample is for some reason even a bit faster.
I am getting even larger difference between sample and rand(size-1) with flto passed to C
time taken sample: 1.021981
time taken rand: 0.098158
nim c -r -d=danger --passC:-flto --passC:-ffast-math hue.nim
Nim Compiler Version 1.4.4 [MacOSX: amd64]
In random.nim, sample is function which consists of
result = a[r.rand(a.low..a.high)]
So, I think it is the function call overhead which makes the difference, especially since there is no inline pragma. it's that size is const
for i in 0..<times:
sum +=xs[rand(xs.len - 1)]
is slow as well.Note that if you look at the implementation for proc rand*(r: var Rand; max: Natural), it is possible to do this more quickly in a loop if max does not change in said loop. The mod is constant over the loop and randMax mod Ui(max) can be computed just once instead of every function call.
There is still a final range reduction mod. I am pretty sure that can be turned into a floating point multiply (probably making the result even faster than the gcc-const-optimized variant). Of course, dirtying the FP registers/state can make all your context switches slower, but that probably will not matter in numerics heavy workloads.
This may get someone started on a PR or at least get @shirleyquirk a-benchmarkin':
import random # Lemire2018 https://r-libre.teluq.ca/1437/
proc mul(a, b: uint64; hi, lo: var uint64) =
# Store hi & low parts of full product a*b;
# Note VM & JS branches need doing
when defined(gcc)or defined(llvm_gcc)or defined(clang):
{.emit: """__uint128_t r = `a`; r *= `b`;
*`hi` = r >> 64; *`lo` = r;""".}
elif defined(windows) and not defined(tcc):
proc umul128(a, b: uint64, c: ptr uint64): uint64
{.importc: "_umul128", header: "intrin.h".}
lo = umul128(a, b, addr hi)
# else: some fallback impl using narrower arithmetic
proc randExclusive*(r: var Rand; s: uint64): uint64 =
## Sample on range [0, s) given [0, 2^64) rng.
var x = r.next
var lo, hi: uint64
mul(x, s, hi, lo)
if lo < s:
let t = (not s) mod s # not s == -s in 2's compl
while lo < t:
x = r.next
mul(x, s, hi, lo)
hi
proc rand*(r: var Rand; max: Natural): int =
## Much faster than same in `random` module.
int(r.randExclusive uint64(max + 1))
when isMainModule:
var sum = 0
for trial in 1 .. 1_000_000_000:
sum += randState.rand(10)
echo sum
The paper says the Go stdlib uses this method, FWIW.I've just added under when isMainModule
var randState:Rand
Then your code goes into an endless loop (even for for trial in 1..1) Also, I am pretty sure that, besides being slow, proc rand(r: var Rand; max: Natural) is also buggy/biased. Git VC history says it used to be correct, but then someone hastily re-did some logic to make it range-inclusive. Briefly,
if x <= randMax - (randMax mod Ui(max)):
return int(x mod (uint64(max) + 1u64))
should be
if x < randMax - (randMax mod (Ui(max) + Ui(1))):
return int(x mod (Ui(max) + Ui(1)))
This does have follow-on impact for many derived algorithms, but it would be better still to just replace it with my faster version above, filled out for JavaScript BigInts & 32-bit fallbacks a la hashes.nim:proc hash(int).Nice shirleyquirk!
I guess I should just use arrays when performance really matters and I know the size compile time. Would you do it like this?
import sequtils
import random
import times
const
size = 10_000_000
iterations = 100_000_000
type
MyArray = array[size, int]
proc main() =
var xs: MyArray
for i in 0..<size:
xs[i] = rand(0..10)
var start = cpuTime()
for i in 0..<iterations:
let x = xs.sample
echo "time taken sample: ", cpuTime() - start
start = cpuTime()
for i in 0..<iterations:
let x = xs[rand(size - 1)]
echo "time taken rand: ", cpuTime() - start
main()