I have a small benchmark for the Levenshtein distance. For reasons that are explained in the README, I want to benchmark the naive recursive definition of the Levenshtein distance, with memoization.
Now, it turns out that the results are not so encouraging, with respect to a Scala implementation (naive as well). I suspect that this is because in Nim strings are mutable. This is because of speed reasons, but in this particular case - and certainly others as well - it turns out to be an anti-optimization.
In the Levenshtein algorithm, a very common operation is taking the tail of a string. Because of the way strings are designed, this entails a copy. I can only speculate that this is the reason for the slowness.
Couldn't it be handy to have an immutable string type where substrings are just slices with no copy involved?
I rewrote it like this:
import streams, times, tables
proc readWords(path: string): seq[string] =
result = @[]
var
input = newFileStream(path, fmRead)
s = ""
defer:
input.close()
while readLine(input, s):
result.add(s)
iterator couples[A](xs: seq[A]): auto =
for i in 0 .. < xs.len:
for j in (i + 1) .. < xs.len:
yield (xs[i], xs[j])
var cache: ref Table[int, int]
proc lev(at: string, ao: int, bt: string, bo: int): int =
let hash = ao*10000+bo
if cache.hasKey(hash):
return cache[hash];
if at.len == 0 or ao == at.len: return bt.len-bo
if bt.len == 0 or bo == bt.len: return at.len-ao
let
d1 = lev(at, ao + 1, bt, bo) + 1
d2 = lev(at, ao, bt, bo + 1) + 1
d3 = lev(at, ao + 1, bt, bo + 1) + (if at[ao] == bt[bo]: 0 else: 1)
result = min(min(d1, d2), d3)
cache[hash] = result
proc levenshtein(a, b: string): int =
cache = newTable[int,int]()
lev(a, 0, b, 0)
when isMainModule:
let
words = readWords("./words1000.txt")
numCouples = words.len * (words.len - 1) / 2
var total = 0'i64
let start = epochTime()
for a, b in couples(words):
total += levenshtein(a, b)
let time = ((epochTime() - start) * 1000).int
echo "The average Levenshtein distance is ", total.float / numCouples.float
echo "The time to compute this was ", time, " ms"
This needs 1793 ms while your original code needs 29765 ms on my system.
I also "just" changed the parameters in my proc to a tuple proc lev(t: tuple[at: string, ao: int, bt: string, bo: int]): int which runs in 9725 ms. So just shifting the arguments into the tuple is a major slow down.
Doing it "correct" with a tuple as hash on the int's is also still slower 3260 ms for
var cache: ref Table[tuple[a,b: int], int]
proc lev(at: string, ao: int, bt: string, bo: int): int =
let hash = (ao, bo)
if cache.hasKey(hash):
return cache[hash];
if at.len == 0 or ao == at.len: return bt.len-bo
if bt.len == 0 or bo == bt.len: return at.len-ao
let
d1 = lev(at, ao + 1, bt, bo) + 1
d2 = lev(at, ao, bt, bo + 1) + 1
d3 = lev(at, ao + 1, bt, bo + 1) + (if at[ao] == bt[bo]: 0 else: 1)
result = min(min(d1, d2), d3)
cache[hash] = result
proc levenshtein(a, b: string): int =
cache = newTable[tuple[a,b: int], int]()
lev(a, 0, b, 0)
Thank you! Unfortunately, I wouldn't accept such an entry in the benchmark.
Maybe I was just not that clear in the README, but the whole point of the exercise is to gauge how difficult (both to write and to use) is a generic memoization function, that handles cases such as recursion and clearing the cache. This disqualifies manually memoizing the function by adding a hashtable specialized for your use case. You see, the ultimate form of such specialization is using a bidimensional array, which is just the dynamic programming technique.
I expect to get a speedup by using something like
type StringSlice = tuple[s: string, start: int]
as parameters for the memoized function. I did not have the time to try it, and now I am a little in a hurry.
My question is whether we should have such a type, or something equivalent, in the stdlib, to deal with the rather common case of taking substrings of immutable strings in a more efficient way
The JDK 7 abandoned the "no copy" behavior for substring construction. Not only could it leak memory (a two character string could keep a multi-MB string alive), but it doesn't really buy you anything for short strings, where there is no real difference between allocating the memory for the string or allocating the memory for the object referencing the original string.
Note in particular that even though this way you could make substring creation constant time, hashing still takes time proportional to the length of the string. Thus, you're turning an O(n^2) algorithm into an O(n^4) algorithm (with a high constant factor). As an alternative algorithm, this looks like a dead end to me. You could in theory get it back down to a more reasonable asymptotic cost by having your own string implementation with a rolling hash that's stored with the string and do substrings by reference (this way, the amortized cost of hashing would remain constant rather than being linear), but you'd still be stuck with a high constant factor.
According to some quick profiling, differences between the Scala and the Nim version seem to primarily come down to hash table implementation (hashing strings alone is where the Nim implementation spends around 1/6th of its time) and the cost of allocation.
@andrea I wasn't proposing an entry for your benchmark in the first place! I also did not want to show you that I can do better :)
I just was thinking on how to simulate a immutable string type (like your string slice). That was the original idea at least.
Using that with your memo lib did not improve the timing at all. So I removed your general memo stuff and replaced it with a much simpler form to test if the string slicing actually had an effect. It looked like that while even just using the tuple parameters already slowed it down again.
OderWat: could you give me a short intro on how to run such a profiling with Nim?
I usually just run code through Instruments on OS X (with the "Invert Call Tree" option checked). If you're familiar with Nim's name-mangling mechanism, it's generally pretty easy to tell where the performance hotspots are. On Linux, Google performance tools should give you similar functionality.
Let me share how I can run a profiling tool with gcc.
nim c -d:release -t="-pg" -l="-pg" test.nim
./test # this run will create gmon.out
gprof test gmon.out
The profiled version runs slower, and the result is hard to read, numbers don't match up to 100%. However, it usually helps me to find out what happens.
Peter
@Andrea...Yeah..Honestly, with 2.7X (21873 ms/"8000 ms") that means whatever your Scala is doing it's 73% of the run-time (73/27=2.7). Almost 3/4 of program run-time is "waste". However you profile your Scala code, I would imagine that it would pretty rapidly lead to a conclusion as to whether it's easy to eliminate what's costing time. But again, somewhat off topic. Tricks to make Nim code faster: on topic. :-) General tricks related to strings, on-thread.
Jehan's hashCode idea from the JVM world seems to have been a great idea in this case with this particular access pattern, but it's obviously not something always worth the trade off. That makes one wonder if a general purpose "fatter" string type in Nim that had a hashCode slot would be helpful..maybe "hashedString". Just borrow most other things but taking care to re-hash with operations that mutate strings, etc. Then users could wrap their inputs in hashedString(inputs) or something to maybe easily see if it helped in their use case/algorithm. At the same time it would be backward compatible and people could just ignore the type if they don't care.
How much it helps may vary quite a bit just depending on inputs, even. For example, `==` and `!=` could be much accelerated, depending on how often results are false/true. Anyway, in light of these thoughts, you might consider trying to package that code you did there for more general purpose consumption.
@Jyelon - yeah. Bob has some good ones. I actually helped him on this topic 20 years ago. Property-wise, though I've found bit avalanche to be somewhat overrated. I'm not alone. For example, Knuth volume 3 has a lot of material trying to be better than random/uniform - achieving a property that hash("abc") and hash("abd") would not usually collide. That property conflicts with avalanche as a property. That "spread out sequential keys" is under a theory of inputs collections having many keys distinguished only at their tails. What is best ultimately depends on your "typical" key sets, as with almost all things hashing.
Personally, on modern 64-bit platforms, my profiling has generally suggested to me that 8-bytes-at-a-time/screaming fast string hashers can make up for an awful lot of extra probes here and there on collision chains that arise from weaker distributions. Really awful distros can truly kill performance, though. So, what level of and style of scrambling is best is a judgement call that people differ on.
Or using the Hash function PHP is using for "everything" ;-)
ulong zend_inline_hash_func(char *arKey, uint nKeyLength)
{
ulong $h = 5381;
char *arEnd = arKey + nKeyLength;
while (arKey < arEnd) {
$h += ($h << 5);
$h += (ulong) *arKey++;
}
return $h;
}
Well, what I wanted to point out is that a super simple hash algorithm does a good job in a language which uses "only hashes" for everything from associative arrays to object-members, function names and other language constructs.
I looked that up some months ago (esp: https://nikic.github.io/2012/03/28/Understanding-PHPs-internal-array-implementation.html) just because I could not believe that this is really the case.
I personally find PHP associative arrays and what you can do with them (index / iterate / sort) a very nice rapid production tool and the implementation seems to be good enough for some really big projects relying on this foundation.
This is not saying that I believe that they have the best hash function ... just that it works :)
P.S.: I may be ignorant but they do add each byte so shifting stuff off will not totally waste the high bits. It will also wrap after 64 bits which is quite some way for those bits to walk before they vanish. In addition I do not believe that the entropy for ASCII is highest in the top 5 bits. This may be different for "Binary" or even UTF-8 encoding. But again, I may be totally wrong, I did not test it.
What I meant is that h <<= 5 sends the top 5 bits to oblivion and makes the bottom 5 bits 0 - it was about entropy in the hash value not in the inputs (though they obviously relate). What I'm saying is that it's better to do h = h << 5 xor h >> 59 { or maybe >> (32-5) instead) to not discard that information. Then the high order bits are moved to the low order side.
It doesn't usually cost any/much more to rotate. For example, such constructs usually compile into just a single x86/x86_64 ROT instruction with identical execution expense to a bit shift.
You're right that ASCII/UTF entropy will all differ..Some of my comment was surely oversimplified. It's unquestionably a subtle art to find good bit mixers, but other things being equal not discarding bits when shuffling them is just as cheap is likely to be better at coaxing distinct keys to different table addresses. :-)
I, too, favor simplicity. I would personally rather stick with the current simple 13 lines of `!&` and `!$` in hashes.nim than I would move to CityHash/FarmHash/Google's next hash. Trying to prevent collision attacks solves a much harder problem, but it also makes the code much harder to figure out (edit: and it's also much more open-ended subject to frequent revision).
Yeah I understand. Still it needs 12 bytes for the first 5 bits to vanish and the "new" bits are filled by the low 5 bits of the next byte. If the input are identifiers of a programming-language there is also a high probability that exactly the last characters are different which my be enough for PHPs domain. But having those bits wrapping would change a lot for them as they also use that bit mask for the table indexes. It would be fun to have some big PHP application run and count overall collisions with and without wrapping. They probably did that...
Having a string type with embedded hashes could be nice. Especially if the hash is only calculated on first use and therefor has only a small memory overhead. Those hash could even be just 16 bits and work like "for sure not equal" in comparisons. Updating on mutation could be a "dirty-flag" which only updates if used again.
Yeah. I think we agree that the mixing needn't be fabulous to work "ok". I wouldn't be so sure they did any careful checking on shift vs rotate, though. I have a prior of less caution in play than that.
Anyway, you save space with small hash codes, but it also means you have to be careful to not just copy it into a Table hash code slot..(if the table is small enough you can, but otherwise you need to call hash()). Pros & cons.
Levenshtein distance, based on my old Pascal source. In some cases, faster than a implementation in strutils module by 10%.
(Enj|Destr)oy! :)
import strutils, os
import nimbench
proc editDistance2*(a, b: string): int = #{.noSideEffect.} =
## Returns the edit distance between `a` and `b`.
##
## This uses the `Levenshtein`:idx: distance algorithm with only a linear
## memory overhead. This implementation is highly optimized!
var len1 = a.len
var len2 = b.len
if len1 > len2:
# make `b` the longer string
return editDistance2(b, a)
# strip common prefix:
var s = 0
while a[s] == b[s] and a[s] != '\0':
inc(s)
dec(len1)
dec(len2)
# strip common suffix:
while len1 > 0 and len2 > 0 and a[s+len1-1] == b[s+len2-1]:
dec(len1)
dec(len2)
# trivial cases:
if len1 == 0: return len2
if len2 == 0: return len1
# another special case:
if len1 == 1:
for j in s..s+len2-1:
if a[s] == b[j]: return len2 - 1
return len2
inc(len1)
inc(len2)
var row: seq[int]
newSeq(row, len2)
for i in 0..len2 - 1: row[i] = i
for i in 1 .. len1- 1:
var char1 = a[s + i - 1]
var prevCost = i - 1;
var newCost = i;
for j in 1 .. len2 - 1:
var char2 = b[s + j - 1]
if char1 == char2:
newCost = prevCost
else:
newCost = min(newCost, min(prevCost, row[j])) + 1
prevCost = row[j]
row[j] = newCost
result = row[len2 - 1]
var s1: string = "0123456789"
var s2: string = "0123455779"
if paramCount() > 1:
if fileExists(paramStr(1)):
s1 = readFile(paramStr(1))
else:
s1 = paramStr(1)
if fileExists(paramStr(2)):
s2 = readFile(paramStr(2))
else:
s2 = paramStr(2)
echo "editDistance: ", editDistance(s1, s2)
echo "editDistance2: ", editDistance2(s1, s2)
bench(editDistance, m):
var d = 0
for i in 1..m:
d = editDistance(s1, s2)
doNotOptimizeAway(d)
benchRelative(editDistance2, m):
var d = 0
for i in 1..m:
d = editDistance2(s1, s2)
doNotOptimizeAway(d)
runBenchmarks()
Meanwhile, @andrea's memoized macro is a work of art:
(Memoization is a bit slower than optimized pre-calculation, of course.)