nimforum mirror - array sample slower than indexing into rand(size-1)

peheje (orginal) [2021-03-21T16:22:27+01:00] view original

Hi, I am experiencing that the sample function is more than two times slower than manually indexing into rand(size-1):

import sequtils
import random
import times


proc main() =
    const
        size = 10_000_000
        times = 100_000_000
    let
        xs = newSeqWith(size, rand(0..10))
    
    var start = cpuTime()
    for i in 0..<times:
        let x = xs.sample
    echo "time taken sample: ", cpuTime() - start
    
    start = cpuTime()
    for i in 0..<times:
        let x = xs[rand(size - 1)]
    echo "time taken rand: ", cpuTime() - start


main()

Compiling with -d=danger on my PC:

time taken sample: 4.956046000000001

time taken rand: 1.924525

Looking at the implementation I can't see how that would make sense, other than it being generic. Any ideas on why or if this is expected?

doofenstein (orginal) [2021-03-21T16:28:10+01:00] view original

I can reproduce the slowness by approximately the same factor.

Though when building with lto enabled (--passC:"-flto") sample is for some reason even a bit faster.

peheje (orginal) [2021-03-21T16:54:59+01:00] view original

I am getting even larger difference between sample and rand(size-1) with flto passed to C

time taken sample: 1.021981

time taken rand: 0.098158

nim c -r -d=danger --passC:-flto --passC:-ffast-math hue.nim

Nim Compiler Version 1.4.4 [MacOSX: amd64]

doofenstein (orginal) [2021-03-21T17:05:16+01:00] view original

opps right I have similar results, for me it's 0.7 and 0.07 I misread the number by one magnitude haha

HJarausch (orginal) [2021-03-21T18:14:48+01:00] view original

In random.nim, sample is function which consists of

result = a[r.rand(a.low..a.high)]

So, I think it is the function call overhead which makes the difference, especially since there is no inline pragma.

shirleyquirk (orginal) [2021-03-21T21:22:55+01:00] view original

it's that size is const

for i in 0..<times:
  sum +=xs[rand(xs.len - 1)]

is slow as well.

cblake (orginal) [2021-03-22T12:33:00+01:00] view original

Note that if you look at the implementation for proc rand*(r: var Rand; max: Natural), it is possible to do this more quickly in a loop if max does not change in said loop. The mod is constant over the loop and randMax mod Ui(max) can be computed just once instead of every function call.

There is still a final range reduction mod. I am pretty sure that can be turned into a floating point multiply (probably making the result even faster than the gcc-const-optimized variant). Of course, dirtying the FP registers/state can make all your context switches slower, but that probably will not matter in numerics heavy workloads.

cblake (orginal) [2021-03-22T15:27:17+01:00] view original

Also, for what it's worth, I think proc rand*(r: var Rand; max Natural) could be profitably upgraded/sped up with the techniques of Lemire 2018. But you can also just use random.next() and do that on your own. { It would also be unsurprising if @mratsim had that implemented deep in the corners of one of his large libraries... :-) }

cblake (orginal) [2021-03-22T17:19:32+01:00] view original

This may get someone started on a PR or at least get @shirleyquirk a-benchmarkin':

import random # Lemire2018 https://r-libre.teluq.ca/1437/

proc mul(a, b: uint64; hi, lo: var uint64) =
  # Store hi & low parts of full product a*b;
  # Note VM & JS branches need doing
  when defined(gcc)or defined(llvm_gcc)or defined(clang):
    {.emit: """__uint128_t r = `a`; r *= `b`;
               *`hi` = r >> 64; *`lo` = r;""".}
  elif defined(windows) and not defined(tcc):
    proc umul128(a, b: uint64, c: ptr uint64): uint64
      {.importc: "_umul128", header: "intrin.h".}
    lo = umul128(a, b, addr hi)
# else: some fallback impl using narrower arithmetic

proc randExclusive*(r: var Rand; s: uint64): uint64 =
  ## Sample on range [0, s) given [0, 2^64) rng.
  var x = r.next
  var lo, hi: uint64
  mul(x, s, hi, lo)
  if lo < s:
    let t = (not s) mod s # not s == -s in 2's compl
    while lo < t:
      x = r.next
      mul(x, s, hi, lo)
  hi

proc rand*(r: var Rand; max: Natural): int =
  ## Much faster than same in `random` module.
  int(r.randExclusive uint64(max + 1))

when isMainModule:
  var sum = 0
  for trial in 1 .. 1_000_000_000:
    sum += randState.rand(10)
  echo sum

The paper says the Go stdlib uses this method, FWIW.

HJarausch (orginal) [2021-03-22T18:02:43+01:00] view original

I've just added under when isMainModule

var randState:Rand

Then your code goes into an endless loop (even for for trial in 1..1)

cblake (orginal) [2021-03-22T19:11:38+01:00] view original

Your state cannot start as all zero which is the default without init. See usage of the random module.

cblake (orginal) [2021-03-23T14:48:14+01:00] view original

Also, I am pretty sure that, besides being slow, proc rand(r: var Rand; max: Natural) is also buggy/biased. Git VC history says it used to be correct, but then someone hastily re-did some logic to make it range-inclusive. Briefly,

if x <= randMax - (randMax mod Ui(max)):
  return int(x mod (uint64(max) + 1u64))

should be

if x < randMax - (randMax mod (Ui(max) + Ui(1))):
  return int(x mod (Ui(max) + Ui(1)))

This does have follow-on impact for many derived algorithms, but it would be better still to just replace it with my faster version above, filled out for JavaScript BigInts & 32-bit fallbacks a la hashes.nim:proc hash(int).

peheje (orginal) [2021-03-23T19:00:48+01:00] view original

Nice shirleyquirk!

I guess I should just use arrays when performance really matters and I know the size compile time. Would you do it like this?

import sequtils
import random
import times

const
    size = 10_000_000
    iterations = 100_000_000

type
    MyArray = array[size, int]

proc main() =
    
    var xs: MyArray
    
    for i in 0..<size:
        xs[i] = rand(0..10)
    
    var start = cpuTime()
    for i in 0..<iterations:
        let x = xs.sample
    echo "time taken sample: ", cpuTime() - start
    
    start = cpuTime()
    for i in 0..<iterations:
        let x = xs[rand(size - 1)]
    echo "time taken rand: ", cpuTime() - start


main()

Mirror of forum.nim-lang.org

7673 :: array sample slower than indexing into rand(size-1)