nimforum mirror - How do you extract a part from an initCountTable in Nim?

Serge (orginal) [2020-07-27T16:52:18+02:00] view original

How coud I echo the first 10 values from the wordFequencies.sort() hereunder?


import strutils
import tables


proc main() =
    var wordFequencies = initCountTable[string]()
    
    # Populating dict from file:
    for line in lines "100-0.txt":
        for word in line.split(" "):
            wordFequencies.inc(word)
    
    wordFequencies.sort()

when isMainModule:
    main()

cblake (orginal) [2020-07-27T17:16:44+02:00] view original

If you only want to do "top 10" like extract activity, you can do less work with a HeapQueue[tuple[count: int; key: K]], as in https://github.com/c-blake/cligen/blob/master/examples/newest.nim (see also the heapqueue module).

Using the above you just iterate over key,count in the CountTable[K] and retain the top 10 (or M) in a heap you can pop out in the correct order (or possibly its reverse) which is O(N * log(M)) where N=CountTable.len= num of distinct keys.

Doing the sort and then filter way, as you were requesting, you have the whole N*log(N) sort. So, if log(N) is much greater than log(M) it will be much slower.

Basically rolling your own filtering sort is not hard and faster.

cblake (orginal) [2020-07-27T17:21:54+02:00] view original

(But I agree this idea would be a useful new proc, or really maybe a new iterator for CountTable, if Araq does not mind having tables depend upon heapqueue.)

cblake (orginal) [2020-07-27T17:45:04+02:00] view original

Here. This is the worked out logic for your above program { with your spelling error "wordFequencies" fixed ;-) }

import strutils, tables, heapqueue

var wordFrequencies = initCountTable[string]()
for line in lines "100-0.txt":
  for word in line.split(" "):
     wordFrequencies.inc(word)

# min-heap with q[0]=min
var q = initHeapQueue[tuple[cnt: int; word: string]]()
for word, cnt in wordFrequencies:
  if q.len < 10:            # 10 could/should be abstracted
    q.push((cnt, word))
  elif cnt > q[0].cnt:      # retain 1st on a tied count
    discard q.replace((cnt, word))
while q.len > 0:            # q now has top n times
  echo q.pop                # Print in ascending order

That second "bundle" of code could just become some

iterator topN[T](ct: CountTable[T], N=10) = ...

with the echo turned into a yield. Then you could instead just say:

for entry in wordFrequencies.topN(10):
  echo entry

cblake (orginal) [2020-07-27T18:22:48+02:00] view original

Oh, and though you did not say/ask, since this kind of thing is often a little "test drive benchmark", I should say that besides compiling with -d:danger, you would be better off using the split iterator and another loop level/nest.

Beyond that, with string keys as you have here it is likely faster to use a Table with a histo.mgetOrPut(word, 0).inc since Table saves hash codes to compare before doing the final string comparison on a successful lookup/update or no string comparison at all on novel keys.

Depending upon the shape of your distribution, Table could be dramatically faster, but those hash codes use a little space, too. So, if the shape is such that the CountTable can stay in L1/L2 cache, but regular Table cannot, it might not not win (or win by as much), for example.

There are other things you can do, like using cligen/mslice stuff, but the above ideas are probably the lowest hanging fruit.

gemath (orginal) [2020-07-28T12:17:07+02:00] view original

With zero_functional, one would add

wordFrequencies.pairs() --> take(10).foreach(echo it[0] & " " & $it[1])

after the sort.

cblake (orginal) [2020-07-28T22:16:12+02:00] view original

Ok. I realize I also didn't tell @Serge who never replied about needing to do wordFrequencies.sort. So, as penance I did this series of six successively faster versions. First would be @gemath's, then the no-deps variant. Then using the heapqueue I mentioned. Then Table[string,int] instead of CountTable[string]. Then (for a big 2x time drop) using cligen mfile/mslice. Then pre-sizing with a good guess for average bytes per unique word. First will be the code and then the timings on an i7-6700k@4.7GHz with hopefully easy to reproduce input data. To keep things simple, I only include the iterator definition once.

# whist 00
import strutils, tables, zero_functional

proc main() =
  var histo = initCountTable[string]()
  for line in lines "/dev/stdin":
    for word in line.split(" "):
      if word.len > 0:
        histo.inc(word)
  histo.sort
  histo.pairs() --> take(10).foreach(echo it[0] & " " & $it[1])
main()

#whist 0i; same as above but post build iterative not functional
  histo.sort   # maybe the 'real' answer to @Serge's Question
  var i = 10
  for word, cnt in histo:
    echo cnt, " ", word
    i.dec
    if i == 0: break

# whist1; Now introduce `iterator topN`
import strutils, tables, heapqueue

iterator topN[T](h: CountTable[T]|Table[T, int], n=10):
    tuple[cnt: int; key: T] =
  var q = initHeapQueue[tuple[cnt: int; key: T]]()
  for key, cnt in h:
    if q.len < n:
      q.push((cnt, key))
    elif cnt > q[0].cnt:  # retain 1st seen on tied cnt
      discard q.replace((cnt, key))
  while q.len > 0:        # q now has top n entries
    yield q.pop           # Print in ascending order

proc main() =
  var histo = initCountTable[string]()
  for line in lines "/dev/stdin":
    for word in line.split(" "):
      if word.len > 0: histo.inc(word)
  for tup in histo.topN: echo tup[0], " ", tup[1]

#whist2: move from `CountTable[string]` to `Table[string,int]`
proc main() =
  var histo = initTable[string, int]()
  for line in lines "/dev/stdin":
    for word in line.split(" "):
      if word.len > 0: histo.mgetOrPut(word, 0).inc
  for tup in histo.topN: echo tup[0], " ", tup[1]

# whist3; move to memory mapped IO and from string to slices.
import cligen/[mfile, mslice]
proc main() =
  let mf = mopen("/dev/stdin")
  var histo = initTable[MSlice, int]()
  for line in mf.mSlices:
    for word in line.mSlices(' '):
      if word.len > 0: histo.mgetOrPut(word, 0).inc
  for tup in histo.topN: echo tup[0], " ", tup[1]

# whist4: presize table; everything the same except this line
  var histo = initTable[MSlice, int](rightSize(mf.len div 45))

Then we run these 6 versions against the text files that are the first 50,000 lines of alphabetically-recursively globbed source in the Nim git repository. There are Zsh-isms here with setopt extendedglob numericglobsort and I use a little utime program I have that times the execution of its argument like /usr/bin/time but to nanoseconds and a little program to emit Parzen-interpolated quantiles with 0 equivalent to the sample minimum.


nim-repo..908bbb250f$ cat **.nim|tail -n50000 > /tmp/j
$ buncha PGO compiles with my nim-pgo script; panics:on -d:useMalloc --gc:arc, Linux
$ for p in whist[0-4]*(x.); echo $p $((repeat 100 utime ./$p < j >/dev/null)|& quantiles 0 .05 .1)
whist00 0.02391167 0.02400840 0.02403021
whist0i 0.02335627 0.02340394 0.02345088
whist1  0.02189838 0.02192395 0.02201866
whist2  0.01853665 0.01885521 0.01904099
whist3  0.00922274 0.00943208 0.00952293
whist4  0.00760648 0.00772719 0.00777402

My /tmp is a /dev/shm FS. So, no device IO here. To maybe see ratios relative to the "first row" as we optimize, we can pipe to an awk:


$ awk '{print $1,$2/0.02391167,$3/0.02400840,$4/0.02403021}'
whist00 1 1 1
whist0i 0.976773 0.974823 0.975892
whist1 0.915803 0.913178 0.916291
whist2 0.775214 0.785359 0.792377
whist3 0.3857 0.392866 0.39629
whist4 0.318107 0.321854 0.32351

Not bad. We eventually do over 3x better than the initial version going from 24ms to 7.6ms. As @gemath alludes to, the hit for zero_functional is only like 2.5%. The topN iterator surprised me in how little it saved (at that slow stage) only being another 6%. Then regular Table is 1.2x faster than CountTable. Then MSlice and memory mapped IO is about 2.0x faster than using string. Then presizing gets you another 1.2x speed-up.

As a cross check on my input data, this is the output or just whist4:


1201 of
1296 echo
1317 """
1389 let
1934 doAssert
2180 proc
2543 #
2574 var
3441 ==
11863 =

The outputs are the same across all programs but whist0* prints those 10 things in reverse order. Obviously easy and fast to use proc algorithm.reverse to flip those if desired.

Mirror of forum.nim-lang.org

6604 :: How do you extract a part from an initCountTable in Nim?