How coud I echo the first 10 values from the wordFequencies.sort() hereunder?
import strutils
import tables
proc main() =
var wordFequencies = initCountTable[string]()
# Populating dict from file:
for line in lines "100-0.txt":
for word in line.split(" "):
wordFequencies.inc(word)
wordFequencies.sort()
when isMainModule:
main()
If you only want to do "top 10" like extract activity, you can do less work with a HeapQueue[tuple[count: int; key: K]], as in https://github.com/c-blake/cligen/blob/master/examples/newest.nim (see also the heapqueue module).
Using the above you just iterate over key,count in the CountTable[K] and retain the top 10 (or M) in a heap you can pop out in the correct order (or possibly its reverse) which is O(N * log(M)) where N=CountTable.len= num of distinct keys.
Doing the sort and then filter way, as you were requesting, you have the whole N*log(N) sort. So, if log(N) is much greater than log(M) it will be much slower.
Basically rolling your own filtering sort is not hard and faster.
Here. This is the worked out logic for your above program { with your spelling error "wordFequencies" fixed ;-) }
import strutils, tables, heapqueue
var wordFrequencies = initCountTable[string]()
for line in lines "100-0.txt":
for word in line.split(" "):
wordFrequencies.inc(word)
# min-heap with q[0]=min
var q = initHeapQueue[tuple[cnt: int; word: string]]()
for word, cnt in wordFrequencies:
if q.len < 10: # 10 could/should be abstracted
q.push((cnt, word))
elif cnt > q[0].cnt: # retain 1st on a tied count
discard q.replace((cnt, word))
while q.len > 0: # q now has top n times
echo q.pop # Print in ascending order
That second "bundle" of code could just become some
iterator topN[T](ct: CountTable[T], N=10) = ...
with the echo turned into a yield. Then you could instead just say:
for entry in wordFrequencies.topN(10):
echo entry
Oh, and though you did not say/ask, since this kind of thing is often a little "test drive benchmark", I should say that besides compiling with -d:danger, you would be better off using the split iterator and another loop level/nest.
Beyond that, with string keys as you have here it is likely faster to use a Table with a histo.mgetOrPut(word, 0).inc since Table saves hash codes to compare before doing the final string comparison on a successful lookup/update or no string comparison at all on novel keys.
Depending upon the shape of your distribution, Table could be dramatically faster, but those hash codes use a little space, too. So, if the shape is such that the CountTable can stay in L1/L2 cache, but regular Table cannot, it might not not win (or win by as much), for example.
There are other things you can do, like using cligen/mslice stuff, but the above ideas are probably the lowest hanging fruit.
With zero_functional, one would add
wordFrequencies.pairs() --> take(10).foreach(echo it[0] & " " & $it[1])
after the sort.Ok. I realize I also didn't tell @Serge who never replied about needing to do wordFrequencies.sort. So, as penance I did this series of six successively faster versions. First would be @gemath's, then the no-deps variant. Then using the heapqueue I mentioned. Then Table[string,int] instead of CountTable[string]. Then (for a big 2x time drop) using cligen mfile/mslice. Then pre-sizing with a good guess for average bytes per unique word. First will be the code and then the timings on an i7-6700k@4.7GHz with hopefully easy to reproduce input data. To keep things simple, I only include the iterator definition once.
# whist 00
import strutils, tables, zero_functional
proc main() =
var histo = initCountTable[string]()
for line in lines "/dev/stdin":
for word in line.split(" "):
if word.len > 0:
histo.inc(word)
histo.sort
histo.pairs() --> take(10).foreach(echo it[0] & " " & $it[1])
main()
#whist 0i; same as above but post build iterative not functional
histo.sort # maybe the 'real' answer to @Serge's Question
var i = 10
for word, cnt in histo:
echo cnt, " ", word
i.dec
if i == 0: break
# whist1; Now introduce `iterator topN`
import strutils, tables, heapqueue
iterator topN[T](h: CountTable[T]|Table[T, int], n=10):
tuple[cnt: int; key: T] =
var q = initHeapQueue[tuple[cnt: int; key: T]]()
for key, cnt in h:
if q.len < n:
q.push((cnt, key))
elif cnt > q[0].cnt: # retain 1st seen on tied cnt
discard q.replace((cnt, key))
while q.len > 0: # q now has top n entries
yield q.pop # Print in ascending order
proc main() =
var histo = initCountTable[string]()
for line in lines "/dev/stdin":
for word in line.split(" "):
if word.len > 0: histo.inc(word)
for tup in histo.topN: echo tup[0], " ", tup[1]
#whist2: move from `CountTable[string]` to `Table[string,int]`
proc main() =
var histo = initTable[string, int]()
for line in lines "/dev/stdin":
for word in line.split(" "):
if word.len > 0: histo.mgetOrPut(word, 0).inc
for tup in histo.topN: echo tup[0], " ", tup[1]
# whist3; move to memory mapped IO and from string to slices.
import cligen/[mfile, mslice]
proc main() =
let mf = mopen("/dev/stdin")
var histo = initTable[MSlice, int]()
for line in mf.mSlices:
for word in line.mSlices(' '):
if word.len > 0: histo.mgetOrPut(word, 0).inc
for tup in histo.topN: echo tup[0], " ", tup[1]
# whist4: presize table; everything the same except this line
var histo = initTable[MSlice, int](rightSize(mf.len div 45))
Then we run these 6 versions against the text files that are the first 50,000 lines of alphabetically-recursively globbed source in the Nim git repository. There are Zsh-isms here with setopt extendedglob numericglobsort and I use a little utime program I have that times the execution of its argument like /usr/bin/time but to nanoseconds and a little program to emit Parzen-interpolated quantiles with 0 equivalent to the sample minimum.
nim-repo..908bbb250f$ cat **.nim|tail -n50000 > /tmp/j
$ buncha PGO compiles with my nim-pgo script; panics:on -d:useMalloc --gc:arc, Linux
$ for p in whist[0-4]*(x.); echo $p $((repeat 100 utime ./$p < j >/dev/null)|& quantiles 0 .05 .1)
whist00 0.02391167 0.02400840 0.02403021
whist0i 0.02335627 0.02340394 0.02345088
whist1 0.02189838 0.02192395 0.02201866
whist2 0.01853665 0.01885521 0.01904099
whist3 0.00922274 0.00943208 0.00952293
whist4 0.00760648 0.00772719 0.00777402
My /tmp is a /dev/shm FS. So, no device IO here. To maybe see ratios relative to the "first row" as we optimize, we can pipe to an awk:
$ awk '{print $1,$2/0.02391167,$3/0.02400840,$4/0.02403021}'
whist00 1 1 1
whist0i 0.976773 0.974823 0.975892
whist1 0.915803 0.913178 0.916291
whist2 0.775214 0.785359 0.792377
whist3 0.3857 0.392866 0.39629
whist4 0.318107 0.321854 0.32351
Not bad. We eventually do over 3x better than the initial version going from 24ms to 7.6ms. As @gemath alludes to, the hit for zero_functional is only like 2.5%. The topN iterator surprised me in how little it saved (at that slow stage) only being another 6%. Then regular Table is 1.2x faster than CountTable. Then MSlice and memory mapped IO is about 2.0x faster than using string. Then presizing gets you another 1.2x speed-up.
As a cross check on my input data, this is the output or just whist4:
1201 of
1296 echo
1317 """
1389 let
1934 doAssert
2180 proc
2543 #
2574 var
3441 ==
11863 =
The outputs are the same across all programs but whist0* prints those 10 things in reverse order. Obviously easy and fast to use proc algorithm.reverse to flip those if desired.