nimforum mirror - Suggestions for optimization?

lagerratrobe (orginal) [2020-02-11T01:42:06+01:00] view original

Hi All,

I read about Nim for the 1st time last week and decided to make something over the weekend. I was pleasantly surprised that I was able to cobble something together that did what I wanted it to do, but am a bit surprised that it's not faster. Maybe some of you could give some suggestions on how to improve performance?

The goal was to write a program that takes a word as input from the user, reads a local file of dictionary words, and returns all of the words in the dictionary that are anagrams of the user’s original word.

The code I came up with is this.


# anagram_redux.nim

# ---- Imports ----------------------------------------------------------------#
import strutils
import algorithm
import os
import strformat
import tables
import unidecode

# ---- Converts a word into it's lower-case, sorted "signature" ---------------#
proc makeSignature(word: string): string =
  let ascii = unidecode(word)
  let lower = strutils.toLowerAscii(ascii)
  let sorted_word = algorithm.sorted(lower, system.cmp)
  let signature = sorted_word.join()
  result = signature

# ---- Main body of program ---------------------------------------------------#
proc main =
  # -- Parse the lookup word from the argument passed in
  if os.paramCount() < 1:
      quit("Usage: anagram_redux <word>")
  let lookup_word = os.paramStr(1)
  let lookup_signature = makeSignature(lookup_word)
  echo strformat.fmt("Looking up '{lookup_word}'")
  
  # -- Create empty table to store results in
  var anagram_table = tables.initTable[string, seq[string]]()
  
  # -- Open the text file and process it
  let file = open("dict.txt")
  
  for word in file.lines():
    let signature = makeSignature(word)
    if anagram_table.hasKey(signature):
      anagram_table[signature].add(word)
    else:
      anagram_table[signature] = @[word]
  
  # -- Lookup user's word via its signature in the anagram_table
  if anagram_table[lookup_signature].len == 1:
    echo strformat.fmt("'{lookup_word}' does not have any anagrams")
  else:
    echo anagram_table[lookup_signature]

main()

The "dict.txt" file with dictionary words is on github at https://github.com/lagerratrobe/nim_dev. For comparison, here are the performance numbers I get from this and similar implementations in Python and R.

Nim unoptimized


real    0m1.809s
user    0m1.782s
sys     0m0.012s

Nim --opt:speed


real    0m0.599s
user    0m0.575s
sys     0m0.016s

Python


real    0m0.348s
user    0m0.325s
sys     0m0.020s


  user  system elapsed
  1.773   0.020   1.803

lagerratrobe (orginal) [2020-02-11T02:35:51+01:00] view original

Answering my own question on this. Using the "-d: release" compile option made all the difference. With that, I was able to reduce the time to:


real    0m0.286s
user    0m0.269s
sys     0m0.012s

DeletedUser (orginal) [2020-02-11T06:02:24+01:00] view original

Tiny optimization to create less copies:

proc makeSignature(word: string): string =
  let ascii = unidecode(word)
  result = strutils.toLowerAscii(ascii)
  result.sort(lower, system.cmp) # or algorithm.sort(result, lower, system.cmp)

Note that there is also the strtabs module for efficient tables of strings to strings and case insensitivity.

lagerratrobe (orginal) [2020-02-11T08:15:14+01:00] view original

Oh! strtabs would have saved some time. Still, it was good to understand how to use the core features.

Thank you for the feedback.

SolitudeSF (orginal) [2020-02-11T09:10:30+01:00] view original

use -d:danger if you need performance

cblake (orginal) [2020-02-11T10:07:13+01:00] view original

Besides @SolitudeSF's important suggestion, an algorithmic improvement you can use is to replace algorithm.sorted(lower, system.cmp) with a hand-written "counting sort". For ASCII (i.e. 1 byte textual characters) this is (very?) easy and should be quite a bit faster than algorithm.sorted. Something like this:

proc sortByLetter(word: string): string =
    var cnt: array[26, int8]
    for ch in word:           #simple counting sort
      cnt[ord(ch) - ord('A')].inc
    for i, c in cnt:
      for n in 0 ..< c:
        result.add chr(ord('A') + i)

You probably need some toUpper in there if your dictionary is stored in lowercase (although that could also be done prior to entry to the above proc or you could also just change A to a). That int8 type covers words up to 255 characters long, but I think the longest real word in any spoken language with dictionaries is less than that. You could always count and print a warning or switch that to int16 or int32 which could help if chemical elements or really crazy stuff is possibly in play.

The same idea could be adapted to unicode but then you would probably want a Table, not an array which would slow it down more than the above minor modifications.

I actually think this is a good problem context in which to introduce the oft neglected counting sort which can also be used with (unlike here) non-empty satellite data (at some slightly increased code baggage).

juancarlospaco (orginal) [2020-02-11T10:20:50+01:00] view original

const file = staticRead("dict.txt").strip.splitLines

cblake (orginal) [2020-02-11T11:22:45+01:00] view original

BTW, the Unicode version of this with 65536 to 2e6 "letters" (depending upon if you can restrict to "one Unicode plane") is definitely trickier. You might think a 2 pass radix sort would help, but since most words are short most of your sorts are very small. Indeed, even insertion sort might well beat the algorithm.sort merge sort in some average sense for this problem. Insertion sort tends to beat almost everything for N < 16..32 on modern CPUs due to cache effects. Almost any dictionary will have average, median, and mode word lengths below 16 just because long words are unpopular. So, just having a "wrapper sort" that switches to insertion for N < 20 and falls back to algorithm.sort for bigger is probably best. It would also be possibly valuable to the community if you looked into the stdlib algorithm.sort and had it switch to insertion at small N. It doesn't right now.

The trouble you run into with most array/table based approaches is that the alphabet is so much larger than the word length. So, a Fenwick Tree or anything with "implicit order" from the indices is just too big to iterate over effectively. Without implicit order you wind up still having to sort "used letters" and letter repeats are not usually very dramatic. About the only thing I can think of that might help Unicode (beyond just in-place/d-danger tricks) in that counting-sort-style approach is another structure from Preston Briggs in a 1993 Rice University tech report/thesis. That sparse-dense thing can sometimes be useful to manage iteration over very sparse subsets of small-ish universes like this Unicode code point space. It's conceivable you could use that as a Unicode histogram, but you would still have to sort a small array (number of unique letters). So, it's hard to say it'd be faster than the insertion-or-merge idea, but it's obscure enough to be worth mentioning as also neglected.

Of course, if you find yourself running this calculation a lot on a small-ish fixed set of basically static dictionaries, the single biggest optimization you might do is to simply save the signature index to a file. You could extend @juancarlospaco 's idea and have a const Table built into the binary executable on a per dictionary basis (e.g. 1 program file per dict). You would absolutely have to time several hundred or even thousand anagram queries to get a reading. Or if you wanted a generic program to work with many dictionary files then instead of a Table you could sort the signatures themselves paired with their words. Then you could just save that sorted list to a file and do lookup via binary search on the file with at most O(lg(Nwords)) disk probes per anagram query (plus time to open/mmap the file). You could also hash-structure that file to get that down to 1 probe at substantial code baggage. (https://github.com/c-blake/suggest/ has a fully worked out example of a much more complex persistent hash-structure store along those lines.) You could try using Nim's marshal module to save your Table to disk and load it back whenever you want it, but in this case I think the loading of the table would be no faster than just building it from scratch. You could also "save in memory" by having a long-running server that processes each dictionary just once and then answers anagram queries over a local network or pipe or something.

Personally, I think all of the above are good coding exercises.

cdome (orginal) [2020-02-11T12:37:13+01:00] view original

-d:release -d:danger might give extra speed

cblake (orginal) [2020-02-11T13:30:44+01:00] view original

To clarify how @cdome's comment differs from @SolitudeSF's and @lagerratrobe's own observation, the intent was always for -d:danger to imply -d:release, but I guess there are several versions of Nim out there for which this implication does not hold. You have to define both for those Nim versions to disable all checking. Going forward (from devel and probably from version 1.2) that will be redundant upon simply -d:danger.

Another thing you can do along the lines of just how you build like -d:release -d:danger, though is using GCC's profile-guided optimization in the C backend. I've seen up to 2X speed boosts for Nim generated code with that.

I suspect there is much more improvement to be had by the various algorithmic improvements I mentioned, though, and some are pretty easy. They do start to seem more like "cheating" in cross-programming language comparisons, especially moving all the hard work to compile-time. So, it depends if you're actually doing anything real with this code.

cblake (orginal) [2020-02-14T17:18:02+01:00] view original

Incidentally, in the small alphabet case (and realistically anagrams are usually related to word games over small alphabets) there is a neat trick for rapid signature computation: map the 26 letters to the first 26 primes (2, 3, 5, 7, .., 97, 101). Then make the signature of a word the product of the primes. If you never overflow the product this is guaranteed to be a valid anagram signature by the uniqueness of prime factorization and commutativity of multiplication. Note that this removes sorting from the equation entirely.

In the worst case, that product can start to overflow a 64-bit integer past about 9 letters (since 101^9 =~ 2^60 and 101^10 > 2^66). However, neither worst nor average cases really matter since we have a static dictionary of every possible case. So, it's actually easy to just test a given dictionary to see if any overflows collide in our exact case. While it is easy to defeat by an attacker, for "organic/natural" dictionaries this is likely to be very rare because 2^64 is much more than the square of the number of words in most dictionaries. So, we are in a regime very far from birthday paradox collisions. Really, we aren't random, but also really the "load on the address space" is much less than the square of the number of words - it is the product of the number of non-product overflowing words and product overflowing words which is much smaller (assuming most words do not overflow). So, this is really a somewhat empirical question about words and dictionaries and letters.

With just a naive 'A' .. 'Z' --> 2 .. 101 mapping and this dictionary: https://github.com/jesstess/Scrabble/blob/master/scrabble/sowpods.txt I got 15_294 words that overflowed (and no collisions!) for 267_751 words. One can do better, though. At least in English, letters have a very skewed usage distribution. So, we can make overflows even less likely. You can just look that frequency up on Wikipedia (https://en.wikipedia.org/wiki/Letter_frequency) and sort the first 26 primes by that frequency. Doing that with the above SOWPODS dictionary, we reduce the number of overflows to 223.

You can do better still with a pre-pass to measure this frequency in the dictionary, sort the primes list by that measured frequency. I get 175 overflows that way. You can do better still by measuring only the frequency of letter usage in words longer than 9 letters (where it actually matters to contain overflow). I get only 136 overflows (over 100x better than the initial 15,000) that way. You might be able to do slightly better by measuring the frequency of letters in just the 136 overflow words and adjusting, but that is more work and upsets the apple cart of the prior measurement perhaps requiring iteration. I think that 136 yields some a priori probability of any collision of under 2e-12 (under certain not quite right randomness assumptions).

So, while you might worry this "product signature" technique is not applicable due to overflow, it probably is in a language where anagrams are interesting Applying all these ideas (except the measurement which I did in a 15 line Python script for its convenient arbitrary precision arithmetic) in Nim we get:

import strutils,tables,sets #toUpperAscii,Table,HashSet

const prime: array[26, uint] = [  # >.len==9 frequency
  7'u, 59, 29, 31, 2, 67, 47, 53, 5, 101, 73, 23, 43,
  13, 17, 41, 97, 11, 3, 19, 37, 71, 79, 89, 61, 83 ]

proc product(word: string): uint =
  result = 1       #You might worry about overflow, BUT
  for ch in word:  #..long anagrams are so rare it's ok!
    result *= prime[ord(ch) - ord('A')] #223/267751oflow

proc sortByLetter(word: string): string =
  var cnt: array[26, int16]
  for ch in word: cnt[ord(ch) - ord('A')].inc
  for i, c in cnt:
    for n in 0 ..< c: result.add chr(ord('A') + i)

proc pBuildQry(dict="words", query: seq[string]) =
  var anas = initTable[uint, seq[string]]()
  for line in lines(dict):
    let word = line.toUpperAscii
    anas.mgetOrPut(word.product, @[]).add word
  for word in query:
    let word = word.toUpperAscii
    echo word, ":"
    try:
      for ana in anas[word.product]: echo "  ", ana
    except: discard

proc sBuildQry(dict="words", query: seq[string]) =
  var anas = initTable[string, seq[string]]()
  for line in lines(dict):
    let word = line.toUpperAscii
    anas.mgetOrPut(word.sortByLetter, @[]).add word
  for word in query:
    let word = word.toUpperAscii
    echo word, ":"
    try:
      for ana in anas[word.sortByLetter]: echo "  ", ana
    except: discard

proc collisions(dict="words") =
  var t = initTable[uint, seq[string]]()
  for line in lines(dict):
    let word = line.toUpperAscii
    t.mgetOrPut(word.product, @[]).add word.sortByLetter
  for product, sigs in t:
    if sigs.len > 1:
      let set = sigs.toHashSet()
      if set.len > 1: echo "collision: ", set

when isMainModule:  #You need to nimble install cligen
  import cligen     #..for this CLI to work.
  dispatchMulti([pBuildQry], [sBuildQry], [collisions])

With that same SOWPODS dictionary, I get build times about 1.6x faster with the product signature than the string-sorted-by-letters-with-counting-sort signature (129 milliseconds vs 200 milliseconds and 230 ms for algorithm.sort instead of counting sort). So, at a round figure about 2X as fast as @lagerratrobe's algorithm. Query times are presumably about a similar speed-up

I don't count the measurement time to decide the order of primes in that 129 ms. Honestly, that is kind of an unneeded optimization. I bet you'd have a hard time finding an organic dictionary defeating this. While no overflows guarantee no collisions, you almost certainly can get no collisions with many overflows. It's easy to check as shown above, and you absolutely should. I was mostly just curious if I could get down to zero overflows.

Of course, these same "measure the data" ideas could also be used on the Unicode case. If the "working set" of letters over the used dictionary is smallish like the 26 here and words are short-ish like the at most 15 letters here. I suspect in any language where anagrams are interesting to people would have those traits and it would be the fastest approach (at least at query time). You should, of course, always measure those collisions as above against your dictionaries. And it would still be faster in some system-wide sense to pre-compute/save the signature table somewhere, maybe right in the executable itself with const and Nim compile-time execution or otherwise in a data file. If you're doing this in a more pre-calculation separated way, that also makes it easier to check for collisions and/or avoid them with an optimal order of the primes array.

cblake (orginal) [2020-02-18T12:10:23+01:00] view original

Incidentally, @lagerratrobe, once you save a lot of time by replacing sorting by a perfect commutative hash, the next optimization is making the IO part faster (you need cligen-0.9.43 for this):

import strutils, tables, cligen/[mfile, mslice]

const prime: array[26, uint64] = [ #9/267751 oflow
  7'u64, 61, 41, 53, 2, 71, 47, 29, 3, 97, 89, 17, 59,
  19, 5, 31, 101, 11, 13, 23, 37, 79, 73, 67, 43, 83 ]

proc product(word: MSlice): uint64 =
  result = 1'u64      #Assumes whole file is uppercase
  for ch in word:     #iterator needs cligen>=0.9.43
    result *= prime[ord(ch) - ord('A')]

proc pBuildQry(dict="words", query: seq[string]) =
  var anas = initTable[uint64, seq[MSlice]]()
  let mf = mopen(dict)
  if mf == nil: return
  defer: mf.close
  for word in mf.mSlices():
    anas.mgetOrPut(word.product, @[]).add word
  for word in query:
    let word = word.toUpperAscii
    let prod = word.toMSlice.product
    echo word, ":"
    try:
      for ana in anas[prod]: echo "  ", ana
    except: discard
import cligen; dispatch(pBuildQry)

With that my 129 ms goes down by 57ms to 72 ms (1.8x better) on that SOWPODS dictionary. Then if you Nim compile with --gc:arc, it goes down to about 47 ms (1.53x better). Then if you use gcc's PGO it goes down to 42 ms (1.12x better).

I suspect with your dictionary and CPU that you would see a similar overall 4x to 5x improvement taking you from 286 ms down to more like <70ms. Language comparison-wise, I do not believe R or CPython would be able to be sped up similarly. (Perhaps in PyPy the product might be able to become fast, but perhaps not if they are worrying about overflow switching to arbitrary precision ints.)

cblake (orginal) [2020-07-26T13:52:30+02:00] view original

Incidentally, if anyone cares perhaps finding this in some search, there are three more optimizations one can do here: 1) a faster integer hash(), 2) pre-sizing the table based on a guessed average word length (language/dictionary dependent), and 3) using Nim's unusual "duplicate keys" features in Table to save about 50% memory (staying much more L3 cache resident in this case).

All together this runs in about 25 ms for me, or about 16x faster using 5x less memory than the original on the same SOWPODS dictionary:

import strutils, tables, cligen/[mfile, mslice], bitops, hashes

proc hash(x: int64|uint64|Hash): Hash {.inline.} = # OPTIM 1
  Hash(rotateLeftBits(uint64(x) * 15241094284759029579'u64, 27))

const prime: array[26, uint64] = [ #9/267751 oflow
  7'u64, 61, 41, 53, 2, 71, 47, 29, 3, 97, 89, 17, 59,
  19, 5, 31, 101, 11, 13, 23, 37, 79, 73, 67, 43, 83 ]

proc product(word: MSlice): uint64 =
  result = 1'u64
  for ch in word: result *= prime[ord(ch) - ord('A')]

proc pBuildQry(dict="words", query: seq[string]) =
  let mf = mopen(dict)
  if mf == nil: return
  var ana = initTable[uint64, MSlice](mf.len div 10) # OPTIM 2
  for word in mf.mSlices:
    ana.add word.product, word            # OPTIM 3a
  for word in query:
    let word = word.toUpperAscii
    echo word, ":"
    for ana in ana.allValues(word.toMSlice.product):  #OPTIM 3b
      echo "  ", ana
  mf.close
import cligen; dispatch(pBuildQry)

Of course, Table.add/allValues are becoming deprecated. So, you will only be able to do that add/allValues optimization while the features remain merely deprecated, not removed.

Oh, and at least in that SOWPODS dictionary the largest number of anagrams for a given word is like 13 for "parse", but the average number is much less than 2. So, the duplicate values don't really impact the collision cluster size much. This is actually a perfect application of the duplicate values feature which I plan to keep alive in https://github.com/c-blake/adix.

Mirror of forum.nim-lang.org

5917 :: Suggestions for optimization?