I have the following procs which I think could be added to algorithms.nim and/or other libraries.
What do you think about that?
Me for example don't like that I have no plug-able "random".
I am also not sure if the iterators aren't not already there somewhere?
import os
import math
#import random # uncomment to use BlaXpirit's random module
{.push hint[XDeclaredButNotUsed]:off.}
# create an Iterator over lines in a file
proc genIter*(filename: string): iterator (): string =
result = iterator (): string =
for line in filename.lines:
yield line
# create an Iterator over lines in a sequence of files
proc genIter*(filenames: seq[string]): iterator (): string =
result = iterator (): string =
for filename in filenames:
for line in filename.lines:
yield line
{.pop.}
# Reservoir sampling
# This gets 'n' random non repeating values from
# a collection 'N' with an unknown number of elements
proc reservoirSamples*[T](iter: iterator():T, n: Natural): seq[T] =
result = newSeq[T](n)
var idx = 0
for e in iter():
if idx < n:
result[idx] = e
else:
when compiles(randomInt):
let r = randomInt(idx)
else:
let r = random(idx)
if r < n:
result[r] = e
inc idx
when isMainModule:
when not compiles(randomInt):
randomize()
let wordlist_1 = "/Users/nothearaq/Documents/wordlist.txt"
let wordlist_2 = "/usr/share/dict/words"
let wordlist_3 = "/Volumes/SSD/10-million-combos.txt" # thats about enough for tests :)
let wordlists = @[wordlist_1, wordlist_2, wordlist_3]
# select 10 random lines from about 10.5 million lines in three files
for x in reservoirSamples(genIter(wordlists), 10):
echo x
discard wordlists
@BlaXpirit That is why I said I want some way of a plug-able random for it. Without your "random" being in the set of the standard library one can not use it in one of its modules.
EDIT: I updated the code to use your module as alternative version "just" by including it. But I don't think this would get accepted as PR for algorithm.nim
I'm also doubtful that sampling algorithms and file iteration should be part of the same module.
@Jehan
Yeah. But then the algorithm is about unknown length data whereas everything I see in the algorithms.nim uses openarray which is would need to load my example files into ram.
From that point they actually belong more together than not. Iterators are the only way to construct unknown length data without allocating ram for all elements. Hence my question about existing iterators like that?
And I 100% agree that there should to be some RNG interface (soon)!
Iterators are the only way to construct unknown length data without allocating ram for all elements.
Er no. what about Streams?
OderWat: Yeah. But then the algorithm is about unknown length data whereas everything I see in the algorithms.nim uses openarray which is would need to load my example files into ram.
I don't dispute that. My question is whether it belongs in the stdlib or in a separate library.
OderWat: From that point they actually belong more together than not. Iterators are the only way to construct unknown length data without allocating ram for all elements.
Well, the first iterator is basically a closure version of system.lines. This seems to belong in some sort of I/O module, not in a collection of RNG-related algorithms. Remember that reservoir sampling can also be used, e.g., with algorithmically generated data. And genIter is probably too generic a name for the functionality of iterating over the lines in a file or list of files, too.
As an aside, there should probably be a template/macro to turn an inline iterator into a closure version, something like:
template closureIterator[T](iter: iterator(): T): auto {.immediate.} =
(iterator(): type(iter()) =
for item in iter: yield item)
(If there isn't something like that already and I'm missing it.)
This way you could just write reservoirSamples(closureIterator(file.lines)).
@Araq Right. I forgot about Steams as it didn't occur as use case to me. Actually I am not clear about the difference between a stream of objects and an iterator.
@Jehan That template is cool but was not obvious to me. Something like that probably should be in the manual (+ stdlib?)
The template works great, Jehan! Do want it in the standard library.
OderWat, how about we just add this reservoirSamples to my random library?