nimforum mirror - Slow strings

andrea (orginal) [2015-10-23T18:51:07+02:00] view original

I have a small benchmark for the Levenshtein distance. For reasons that are explained in the README, I want to benchmark the naive recursive definition of the Levenshtein distance, with memoization.

Now, it turns out that the results are not so encouraging, with respect to a Scala implementation (naive as well). I suspect that this is because in Nim strings are mutable. This is because of speed reasons, but in this particular case - and certainly others as well - it turns out to be an anti-optimization.

In the Levenshtein algorithm, a very common operation is taking the tail of a string. Because of the way strings are designed, this entails a copy. I can only speculate that this is the reason for the slowness.

Couldn't it be handy to have an immutable string type where substrings are just slices with no copy involved?

OderWat (orginal) [2015-10-23T21:31:59+02:00] view original

I rewrote it like this:

import streams, times, tables

proc readWords(path: string): seq[string] =
  result = @[]
  var
    input = newFileStream(path, fmRead)
    s = ""
  defer:
    input.close()
  while readLine(input, s):
    result.add(s)

iterator couples[A](xs: seq[A]): auto =
  for i in 0 .. < xs.len:
    for j in (i + 1) .. < xs.len:
      yield (xs[i], xs[j])

var cache: ref Table[int, int]

proc lev(at: string, ao: int, bt: string, bo: int): int =
  let hash = ao*10000+bo
  if cache.hasKey(hash):
    return cache[hash];
  
  if at.len == 0 or ao == at.len: return bt.len-bo
  if bt.len == 0 or bo == bt.len: return at.len-ao
  let
    d1 = lev(at, ao + 1, bt, bo) + 1
    d2 = lev(at, ao, bt, bo + 1) + 1
    d3 = lev(at, ao + 1, bt, bo + 1) + (if at[ao] == bt[bo]: 0 else: 1)
  result = min(min(d1, d2), d3)
  cache[hash] = result

proc levenshtein(a, b: string): int =
  cache = newTable[int,int]()
  lev(a, 0, b, 0)

when isMainModule:
  let
    words = readWords("./words1000.txt")
    numCouples = words.len * (words.len - 1) / 2
  var total = 0'i64
  let start = epochTime()
  for a, b in couples(words):
    total += levenshtein(a, b)
  let time = ((epochTime() - start) * 1000).int
  echo "The average Levenshtein distance is ", total.float / numCouples.float
  echo "The time to compute this was ", time, " ms"

This needs 1793 ms while your original code needs 29765 ms on my system.

I also "just" changed the parameters in my proc to a tuple proc lev(t: tuple[at: string, ao: int, bt: string, bo: int]): int which runs in 9725 ms. So just shifting the arguments into the tuple is a major slow down.

Doing it "correct" with a tuple as hash on the int's is also still slower 3260 ms for

var cache: ref Table[tuple[a,b: int], int]

proc lev(at: string, ao: int, bt: string, bo: int): int =
  let hash = (ao, bo)
  if cache.hasKey(hash):
    return cache[hash];
  
  if at.len == 0 or ao == at.len: return bt.len-bo
  if bt.len == 0 or bo == bt.len: return at.len-ao
  let
    d1 = lev(at, ao + 1, bt, bo) + 1
    d2 = lev(at, ao, bt, bo + 1) + 1
    d3 = lev(at, ao + 1, bt, bo + 1) + (if at[ao] == bt[bo]: 0 else: 1)
  result = min(min(d1, d2), d3)
  cache[hash] = result

proc levenshtein(a, b: string): int =
  cache = newTable[tuple[a,b: int], int]()
  lev(a, 0, b, 0)

andrea (orginal) [2015-10-24T11:31:34+02:00] view original

Thank you! Unfortunately, I wouldn't accept such an entry in the benchmark.

Maybe I was just not that clear in the README, but the whole point of the exercise is to gauge how difficult (both to write and to use) is a generic memoization function, that handles cases such as recursion and clearing the cache. This disqualifies manually memoizing the function by adding a hashtable specialized for your use case. You see, the ultimate form of such specialization is using a bidimensional array, which is just the dynamic programming technique.

I expect to get a speedup by using something like

type StringSlice = tuple[s: string, start: int]

as parameters for the memoized function. I did not have the time to try it, and now I am a little in a hurry.

My question is whether we should have such a type, or something equivalent, in the stdlib, to deal with the rather common case of taking substrings of immutable strings in a more efficient way

Jehan (orginal) [2015-10-24T12:49:12+02:00] view original

The JDK 7 abandoned the "no copy" behavior for substring construction. Not only could it leak memory (a two character string could keep a multi-MB string alive), but it doesn't really buy you anything for short strings, where there is no real difference between allocating the memory for the string or allocating the memory for the object referencing the original string.

Note in particular that even though this way you could make substring creation constant time, hashing still takes time proportional to the length of the string. Thus, you're turning an O(n^2) algorithm into an O(n^4) algorithm (with a high constant factor). As an alternative algorithm, this looks like a dead end to me. You could in theory get it back down to a more reasonable asymptotic cost by having your own string implementation with a rolling hash that's stored with the string and do substrings by reference (this way, the amortized cost of hashing would remain constant rather than being linear), but you'd still be stuck with a high constant factor.

According to some quick profiling, differences between the Scala and the Nim version seem to primarily come down to hash table implementation (hashing strings alone is where the Nim implementation spends around 1/6th of its time) and the cost of allocation.

OderWat (orginal) [2015-10-24T15:50:23+02:00] view original

@jehan could you give me a short intro on how to run such a profiling with Nim? I was starting to check out valgrind/callgrind stuff next but you seem to know how to do it :)

OderWat (orginal) [2015-10-24T17:59:29+02:00] view original

@andrea I wasn't proposing an entry for your benchmark in the first place! I also did not want to show you that I can do better :)

I just was thinking on how to simulate a immutable string type (like your string slice). That was the original idea at least.

Using that with your memo lib did not improve the timing at all. So I removed your general memo stuff and replaced it with a much simpler form to test if the string slicing actually had an effect. It looked like that while even just using the tuple parameters already slowed it down again.

Jehan (orginal) [2015-10-24T18:07:57+02:00] view original

OderWat: could you give me a short intro on how to run such a profiling with Nim?

I usually just run code through Instruments on OS X (with the "Invert Call Tree" option checked). If you're familiar with Nim's name-mangling mechanism, it's generally pretty easy to tell where the performance hotspots are. On Linux, Google performance tools should give you similar functionality.

mora (orginal) [2015-10-24T18:08:10+02:00] view original

Let me share how I can run a profiling tool with gcc.


nim c -d:release -t="-pg" -l="-pg" test.nim
./test # this run will create gmon.out
gprof test gmon.out

The profiled version runs slower, and the result is hard to read, numbers don't match up to 100%. However, it usually helps me to find out what happens.

Peter

OderWat (orginal) [2015-10-24T22:55:38+02:00] view original

Thanks... I am experimenting with Instruments now. Looks good so far!

Jehan (orginal) [2015-10-25T17:37:46+01:00] view original

@mora: OS X Instruments and Google performance tools allow you to do the same without the slowdown.

cblake (orginal) [2015-10-27T18:41:27+01:00] view original

@Andrea...Yeah..Honestly, with 2.7X (21873 ms/"8000 ms") that means whatever your Scala is doing it's 73% of the run-time (73/27=2.7). Almost 3/4 of program run-time is "waste". However you profile your Scala code, I would imagine that it would pretty rapidly lead to a conclusion as to whether it's easy to eliminate what's costing time. But again, somewhat off topic. Tricks to make Nim code faster: on topic. :-) General tricks related to strings, on-thread.

Jehan's hashCode idea from the JVM world seems to have been a great idea in this case with this particular access pattern, but it's obviously not something always worth the trade off. That makes one wonder if a general purpose "fatter" string type in Nim that had a hashCode slot would be helpful..maybe "hashedString". Just borrow most other things but taking care to re-hash with operations that mutate strings, etc. Then users could wrap their inputs in hashedString(inputs) or something to maybe easily see if it helped in their use case/algorithm. At the same time it would be backward compatible and people could just ignore the type if they don't care.

How much it helps may vary quite a bit just depending on inputs, even. For example, `==` and `!=` could be much accelerated, depending on how often results are false/true. Anyway, in light of these thoughts, you might consider trying to package that code you did there for more general purpose consumption.

@Jyelon - yeah. Bob has some good ones. I actually helped him on this topic 20 years ago. Property-wise, though I've found bit avalanche to be somewhat overrated. I'm not alone. For example, Knuth volume 3 has a lot of material trying to be better than random/uniform - achieving a property that hash("abc") and hash("abd") would not usually collide. That property conflicts with avalanche as a property. That "spread out sequential keys" is under a theory of inputs collections having many keys distinguished only at their tails. What is best ultimately depends on your "typical" key sets, as with almost all things hashing.

Personally, on modern 64-bit platforms, my profiling has generally suggested to me that 8-bytes-at-a-time/screaming fast string hashers can make up for an awful lot of extra probes here and there on collision chains that arise from weaker distributions. Really awful distros can truly kill performance, though. So, what level of and style of scrambling is best is a judgement call that people differ on.

OderWat (orginal) [2015-10-27T18:42:40+01:00] view original

Or using the Hash function PHP is using for "everything" ;-)

ulong zend_inline_hash_func(char *arKey, uint nKeyLength)
{
        ulong $h = 5381;
        char *arEnd = arKey + nKeyLength;
        
        while (arKey < arEnd) {
                $h += ($h << 5);
                $h += (ulong) *arKey++;
        }
        return $h;
}

cblake (orginal) [2015-10-27T19:11:55+01:00] view original

I'm not sure of the proposal is entirely serious, but hash functions that merely shift off the most significant bits are wastful of the entropy accumulated from the key. That entropy/information/randomness is (probably) greatest in the top 5 bits. The ELF/pjw style also have this trouble as does OderWat's PHP example. Bit-rotation of the hash register would be better, especially because Nim reduces hash codes to table addresses via low order bit extraction.

OderWat (orginal) [2015-10-27T19:30:08+01:00] view original

Well, what I wanted to point out is that a super simple hash algorithm does a good job in a language which uses "only hashes" for everything from associative arrays to object-members, function names and other language constructs.

I looked that up some months ago (esp: https://nikic.github.io/2012/03/28/Understanding-PHPs-internal-array-implementation.html) just because I could not believe that this is really the case.

I personally find PHP associative arrays and what you can do with them (index / iterate / sort) a very nice rapid production tool and the implementation seems to be good enough for some really big projects relying on this foundation.

This is not saying that I believe that they have the best hash function ... just that it works :)

P.S.: I may be ignorant but they do add each byte so shifting stuff off will not totally waste the high bits. It will also wrap after 64 bits which is quite some way for those bits to walk before they vanish. In addition I do not believe that the entropy for ASCII is highest in the top 5 bits. This may be different for "Binary" or even UTF-8 encoding. But again, I may be totally wrong, I did not test it.

cblake (orginal) [2015-10-27T20:09:20+01:00] view original

What I meant is that h <<= 5 sends the top 5 bits to oblivion and makes the bottom 5 bits 0 - it was about entropy in the hash value not in the inputs (though they obviously relate). What I'm saying is that it's better to do h = h << 5 xor h >> 59 { or maybe >> (32-5) instead) to not discard that information. Then the high order bits are moved to the low order side.

It doesn't usually cost any/much more to rotate. For example, such constructs usually compile into just a single x86/x86_64 ROT instruction with identical execution expense to a bit shift.

You're right that ASCII/UTF entropy will all differ..Some of my comment was surely oversimplified. It's unquestionably a subtle art to find good bit mixers, but other things being equal not discarding bits when shuffling them is just as cheap is likely to be better at coaxing distinct keys to different table addresses. :-)

I, too, favor simplicity. I would personally rather stick with the current simple 13 lines of `!&` and `!$` in hashes.nim than I would move to CityHash/FarmHash/Google's next hash. Trying to prevent collision attacks solves a much harder problem, but it also makes the code much harder to figure out (edit: and it's also much more open-ended subject to frequent revision).

OderWat (orginal) [2015-10-27T20:27:12+01:00] view original

Yeah I understand. Still it needs 12 bytes for the first 5 bits to vanish and the "new" bits are filled by the low 5 bits of the next byte. If the input are identifiers of a programming-language there is also a high probability that exactly the last characters are different which my be enough for PHPs domain. But having those bits wrapping would change a lot for them as they also use that bit mask for the table indexes. It would be fun to have some big PHP application run and count overall collisions with and without wrapping. They probably did that...

Having a string type with embedded hashes could be nice. Especially if the hash is only calculated on first use and therefor has only a small memory overhead. Those hash could even be just 16 bits and work like "for sure not equal" in comparisons. Updating on mutation could be a "dirty-flag" which only updates if used again.

cblake (orginal) [2015-10-27T20:50:14+01:00] view original

Yeah. I think we agree that the mixing needn't be fabulous to work "ok". I wouldn't be so sure they did any careful checking on shift vs rotate, though. I have a prior of less caution in play than that.

Anyway, you save space with small hash codes, but it also means you have to be careful to not just copy it into a Table hash code slot..(if the table is small enough you can, but otherwise you need to call hash()). Pros & cons.

dataman (orginal) [2017-05-20T17:38:43+02:00] view original

Levenshtein distance, based on my old Pascal source. In some cases, faster than a implementation in strutils module by 10%.

(Enj|Destr)oy! :)

import strutils, os
import nimbench

proc editDistance2*(a, b: string): int = #{.noSideEffect.} =
  ## Returns the edit distance between `a` and `b`.
  ##
  ## This uses the `Levenshtein`:idx: distance algorithm with only a linear
  ## memory overhead.  This implementation is highly optimized!
  var len1 = a.len
  var len2 = b.len
  if len1 > len2:
    # make `b` the longer string
    return editDistance2(b, a)
  
  # strip common prefix:
  var s = 0
  while a[s] == b[s] and a[s] != '\0':
    inc(s)
    dec(len1)
    dec(len2)
  
  # strip common suffix:
  while len1 > 0 and len2 > 0 and a[s+len1-1] == b[s+len2-1]:
    dec(len1)
    dec(len2)
  
  # trivial cases:
  if len1 == 0: return len2
  if len2 == 0: return len1
  
  # another special case:
  if len1 == 1:
    for j in s..s+len2-1:
      if a[s] == b[j]: return len2 - 1
    
    return len2
  
  inc(len1)
  inc(len2)
  var row: seq[int]
  newSeq(row, len2)
  
  for i in 0..len2 - 1: row[i] = i
  
  for i in 1 .. len1- 1:
    var char1 = a[s + i - 1]
    var prevCost = i - 1;
    var newCost = i;
    
    for j in 1 .. len2 - 1:
      var char2 = b[s + j - 1]
      
      if char1 == char2:
        newCost = prevCost
      else:
        newCost = min(newCost, min(prevCost, row[j])) + 1
      
      prevCost = row[j]
      row[j] = newCost
  
  result = row[len2 - 1]

var s1: string = "0123456789"
var s2: string = "0123455779"

if paramCount() > 1:
  if fileExists(paramStr(1)):
    s1 = readFile(paramStr(1))
  else:
    s1 = paramStr(1)
  
  if fileExists(paramStr(2)):
    s2 = readFile(paramStr(2))
  else:
    s2 = paramStr(2)

echo "editDistance:  ", editDistance(s1, s2)
echo "editDistance2: ", editDistance2(s1, s2)

bench(editDistance, m):
  var d = 0
  for i in 1..m:
    d = editDistance(s1, s2)
  
  doNotOptimizeAway(d)

benchRelative(editDistance2, m):
  var d = 0
  
  for i in 1..m:
    d = editDistance2(s1, s2)
  
  doNotOptimizeAway(d)


runBenchmarks()

cdunn2001 (orginal) [2017-05-21T01:35:16+02:00] view original

@dataman, Yours is significantly more readable too. Maybe if you provide a pull-request, plus evidence of correctness, it could be integrated into the standard library.

https://github.com/nim-lang/Nim/blob/master/lib/pure/strutils.nim#L1717

Meanwhile, @andrea's memoized macro is a work of art:

https://github.com/andreaferretti/memo

(Memoization is a bit slower than optimized pre-calculation, of course.)

Mirror of forum.nim-lang.org

1730 :: Slow strings