nimforum mirror - collecting keys and randomly picking values from tables

archnim (orginal) [2022-09-16T19:29:12+02:00] view original

Hello world !

What is the most direct way to assemble all the keys of a table in a seq ? I mean, is there a proc in the std lib that does it ?

Concerning the proc sample in the random module, it doesn't apply to a table. Is there a proc elsewhere that does ?

Yardanico (orginal) [2022-09-16T20:10:11+02:00] view original

There's no stdlib proc that does this "directly", but you can of course always compose different things together :)

One simple way to do what you want is:

import std/[tables, sequtils, random]

# initialize the random seed so we get different results each time
randomize()

let myTable = {"a": "hello", "b": "world", "c": "forum"}.toTable()

# toSeq takes an iterator and saves its results into a seq
let keys = toSeq(myTable.keys)

# get random value
echo myTable[sample(keys)]

tcheran (orginal) [2022-09-16T20:28:48+02:00] view original

what about collect (you need to import sugar) and iteration?

import std/tables, sugar

var a = {
  'y': 1,
  'm': 2,
  'c': 3,
  'a': 4,
  }.toOrderedTable

let keyseq = collect(newSeq):
    for key in a.keys: key

assert keyseq == @['y','m','c','a']

Yardanico (orginal) [2022-09-16T20:31:02+02:00] view original

Yeah this works too, it's the same as using toSeq, just a more "explicit" way of doing it.

tcheran (orginal) [2022-09-16T20:53:34+02:00] view original

actually I like

let keys = toSeq(myTable.keys)

even more, I was writing my reply before seeing your solution

cblake (orginal) [2022-09-16T22:08:33+02:00] view original

You can also sample many values with Yardanico's 2nd idea with no unneeded string allocations if you build a sets.HashSet[int] or intsets.IntSet or even set if Table.len < 65536. Call it nums. All allow a fast-ish i in nums test.

That is for sampling without replacement. For sampling with replacement you probably want to make nums into a tables.CountTable and repeat myTable[key] however many duplicates the count of random indices implies.

Depending upon how you are using the samples, you might also structure things as an iterator just yield -ing sample values instead of creating a seq.

alexeypetrushin (orginal) [2022-09-17T13:23:51+02:00] view original

You can do it like this, nice and clean.

 Nim
let keys = mytable.keys
let sample = mytable.values.shuffle.take(10)

With some helpers.

 Nim
proc keys*[K, V](table: Table[K, V] | ref Table[K, V]): seq[K] =
  for k in table.keys: result.add k

proc values*[K, V](table: Table[K, V] | ref Table[K, V]): seq[V] =
  for v in table.values: result.add v

proc take*[T](s: openarray[T], n: int): seq[T] =
  if n < s.len: s[0..(n - 1)] else: s.to_seq

func shuffle*[T](list: openarray[T], seed = 1): seq[T] =
  var rand = random.init_rand(seed)
  result = list.to_seq
  random.shuffle(rand, result)

func shuffle*[T](list: openarray[T], rand: random.Rand): seq[T] =
  result = list.to_seq
  random.shuffle(rand, result)

Of course there's overhead doing it like that, but I don't care about performance (except in 1% codebase, the bottleneck code).

Zoom (orginal) [2022-09-17T15:49:22+02:00] view original

I think this is neither clearer nor nicer than let keys = toSeq(myTable.keys); let smpl = myTable[sample(keys)]. At the very least it wastes memory and allocations (seq->seq->seq) in the process. Instead of bunch of helper procs which may never be reused I'd focus on solving the issue as it's stated.

import std/[random, tables, intsets, math]

let myTable = {"a": "hello", "b": "world", "c": "forum", "d": "bye", "e": "cu"}.toTable()

proc sampleValues[K, V](t: Table[K, V]; size: Natural = 1): seq[V] =
  ## Returns a random seq with a set of values of table `t`
  
  var i = size
  assert i <= t.len() # replace i with `min(len, size)` to force never failing
  let high = t.len()-1
  let lenSqrt = floor(sqrt(float(t.len()))).Natural()
  var sampleIds = initIntSet()
  
  if size <= lenSqrt:
    # Dumb rejection sampling. Perhaps don't use this for crypto
    while i > 0:
      let sample = rand(high)
      if not sampleIds.containsOrIncl(sample):
        i.dec(1)
  else:
    var sampleIdSeq = newSeqOfCap[int](i)
    for id in 0..high:
      sampleIdSeq.add(id)
    shuffle(sampleIdSeq)
    sampleIdSeq.setLen(i)
    sampleIds = toIntSet(sampleIdSeq)
    i = 0
  
  for v in values(t): # i == 0
    if i in sampleIds:
      result.add(v)
    i.inc()
  #debugEcho sampleIds


randomize()
echo myTable.sampleValues(1)
echo myTable.sampleValues(5)

alexeypetrushin (orginal) [2022-09-17T16:17:36+02:00] view original

I think this is neither clearer nor nicer than let keys = toSeq(myTable.keys)

That's an old story about red/blue functions and it's already being discussed here.

At the very least it wastes memory and allocations

Of course, and I said so. I did that consciously, to sacrifice performance in non-critical paths and gain in reducing cost of development and ease maintenance.

Instead of bunch of helper procs which may never be reused

True, but not in my case. As I use my own nim std, which is designed in that way (basically a clone of Ruby STD library). It was surprisingly easy to implement and has very little code, I spent maybe a two or tree days to write it. So, in my case, I constantly do reuse these (and others) helpers.

Zoom (orginal) [2022-09-17T17:58:14+02:00] view original

Well, if developers are consciously sacrificing anything to reduce cost of development and casually mentioning their personal stdlibs when answering forum posts then we're really doomed. :D

alexeypetrushin (orginal) [2022-09-17T20:05:00+02:00] view original

Ha ha, good point, you got me :)

cblake (orginal) [2022-09-17T21:11:50+02:00] view original

I make no claim to this being the ultimate best solution, but the direction of this thread seems more concrete than my earlier vague comments. A simple seq will actually do with some wasted space (probably a favorable space-time trade-off.) So, presented purely for your consideration/to make those more concrete/expand The Menu:

import std/[random, tables, sugar]

iterator samples[K, V](t: Table[K, V]; size=1): (K, V) =
  ## Yields a random sample (with replacement) of pairs
  ## in table `t`.  (This assumes keys are unique, not
  ## using `.add` multi-map functionality.)
  var reps = newSeq[int](t.len) # Fill w/size samples
  for i in 1..size:             #..of range 0..<t.len.
    reps[rand(t.len - 1)].inc
  var i = 0                    # Yield samples w/correct
  for (k, v) in pairs(t):      #..repeated value count.
    for j in 1..reps[i]:
      yield (k, v)  # Repeats do always appear together
    inc i           #..BUT order is also Table-order --
                    #..only as random as hash is good.
randomize()
let myTable = {"a": "hello", "b": "world", "c": "forum",
               "d": "bye", "e": "cu"}.toTable()
for (k, v) in myTable.samples(4):
  echo v

let someKeys = collect(newSeq):
  for (k, v) in myTable.samples(4): k
echo "someKeys: ", someKeys

let someVals = collect(newSeq):
  for (k, v) in myTable.samples(4): v
echo "someVals: ", someVals

If one were doing this a lot, then one could have the iterator take a var seq and zero it each time to realize zero allocations.

In terms of what space is allocated, that can be reduced by using a newSeq[uint8]. The width of that counter can be based on the requested sample size since you cannot have a value > size show up in a reps slot. Even with newSeq[int], it should already be a small fraction of the space used to store a Table[string,string]. ( @archnim never even said what types he had in his Table though or if sampling was with/without replacement. )

As a terminology point that might help someone someday, I try to use the terms "random sample" for a "with replacement style" and "random subset" for the "without replacement" style. The lack of care in the wider world will probably always leave a need for clarification, though.

Mirror of forum.nim-lang.org

9473 :: collecting keys and randomly picking values from tables