nimforum mirror - A path commonPrefix function

marks (orginal) [2020-01-03T09:19:59+01:00] view original

I have the following commonPrefix() function (for whole paths) which works fine on Linux (and Windows), but wondering if it could be shorter and/or more efficient?

proc commonPrefix*(paths: openArray[string], sep = "/"): string =
  if len(paths) == 0:
    return ""
  result = paths[0]
  for path in paths[1 .. ^1]:
    while not path.startsWith(result) and len(result) > 0:
      result = result[0 .. ^2]
  result = result[0 .. result.rfind(sep)]

Tests:

test "commonPrefix":
  check(commonPrefix(["/tmp/test1", "/tmp/data.dat"]) == "/tmp/")
  check(commonPrefix(["/home/mark/test1", "/tmp/data.dat"]) == "/")
  check(commonPrefix(["/home/mark/test1", "local/data.dat"]) == "")

SolitudeSF (orginal) [2020-01-03T10:11:42+01:00] view original

something like that

proc commonPrefix*(paths: openArray[string], sep = "/"): string =
  if len(paths) > 0:
    result = paths[0]
    for i in 1 .. paths.high:
      let path = paths[i]
      while not path.startsWith(result) and len(result) > 0:
        result.setLen(result.high)
    result.setLen(result.rfind(sep) + 1)

demotomohiro (orginal) [2020-01-03T15:13:59+01:00] view original

When your commonPrefix takes 2 paths, length of both paths are n and both paths has no common prefix, it has O(n^2) time complexity. Because startsWith proc is O(n) and it is called n times.

relativePath proc in os module calculate common prefix in O(n) time complexity. https://github.com/nim-lang/Nim/blob/devel/lib/pure/os.nim

marks (orginal) [2020-01-04T09:16:05+01:00] view original

This version walks forward char by char. Is there a nim equivalent to Python's timeit module for timing snippets?

proc commonPrefix*(paths: openArray[string], sep = "/"): string =
  if len(paths) == 0:
    return ""
  result = paths[0]
  var index = 0
  var ok = false
  for path in paths[1 .. ^1]:
    while index < len(path) and index < len(result) and
        result[index] == path[index]:
      inc index
      ok = true
  result = if not ok:
      ""
  else:
    result[0 .. result.rfind(sep, 0, index)]

cblake (orginal) [2020-01-04T11:27:09+01:00] view original

Is there a nim equivalent to Python's timeit module for timing snippets?

There are probably about 100 variants in people's local code, but I happened to just see your message and also to just add this to cligen/osUt the other day (needs requires "cligen#head" in your .nimble file or you could just copy this tiny impl and adapt as you like):

import strutils, times

template timeIt*(label:string, unit:float, places=3, sep="\n", body: untyped) =
  let t0 = epochTime()
  body
  let dt = epochTime() - t0
  stdout.write label, " ", formatFloat(dt * unit, ffDecimal, places), sep

timeIt("write1", 1e6, 3, " ") : stderr.write "hi"
timeIt("write2", 1e6, 3, "\n"): stderr.write "there"

You may want to add loops to repeat and use the stats.RunningStat stdlib API to do something nice more like Python's version. There is also this package called golden (https://github.com/disruptek/golden) to try to iterate until timing noise has been suppressed. And probably other various solutions, too. (And sometimes noise is the data and you want cold-cache/system competition effects included...benchmarking is obviously a larger topic than a snippet thing.)

marks (orginal) [2020-01-04T12:00:02+01:00] view original

Thanks for that. I discovered that my new version is 4x faster than the original..

marks (orginal) [2020-01-04T12:35:58+01:00] view original

This is the fastest and best so far:

proc commonPrefix*(paths: openArray[string], sep = "/"): string =
  if len(paths) == 0:
    return ""
  let first = paths[0]
  var index = -1
  block loop:
    for i in 0 ..< len(first):
      for j in 1 .. paths.high():
        let path = paths[j]
        if i < len(path) and first[i] != path[i]:
          break loop
      index = i
  if index == -1:
    return ""
  else:
    return first[0 .. first.rfind(sep, 0, index)]

cblake (orginal) [2020-01-04T12:56:20+01:00] view original

Are you aware of this binary search approach? https://www.tutorialcup.com/string/longest-common-prefix-using-biary-search.htm

cblake (orginal) [2020-01-04T13:56:15+01:00] view original

Eg., in Nim it might look like this:

proc minLen(strs: openArray[string]): int =
  result = int.high
  for s in strs: result = min(result, s.len)

proc check(strs: openArray[string]; j0, j1: int): bool =
  let srcSlc = strs[0][j0..j1]
  for s in strs:
    if s[j0..j1] != srcSlc: return false
  return true

proc longestCommonPfx*(strs: openArray[string]): string=
  if strs.len < 1: return ""
  let src = strs[0]
  if strs.len < 2: return src
  var lo = 0                      # Binary search
  var hi = strs.minLen - 1
  while lo <= hi:
    let mid = lo + (hi - lo) div 2
    if check(strs, lo, mid):      # All strs match this
      result.add src[lo .. mid]   # Append to answer
      lo = mid + 1                # Go higher
    else:                         # Go lower
      hi = mid - 1

when isMainModule:
  import os
  echo longestCommonPfx(commandLineParams())

There are also a lot of variants at http://rosettacode.org/wiki/Longest_common_prefix

For paths you would just need to add something to backup to the most recent directory separator, if any.

Araq (orginal) [2020-01-04T19:23:20+01:00] view original

I would return an int instead of a string, much cheaper.

cblake (orginal) [2020-01-05T16:05:54+01:00] view original

And before someone corrects me, an even less embarrassing variant might be:

proc rangeAt*[T](xs: openArray[T]): tuple[minAt, maxAt: int] =
  if xs.len < 1: return (-1, -1)  # raise?
  for i in 1 ..< xs.len:          # note tuple ints init to zero
    if xs[result.minAt] < xs[i]: result.minAt = i
    if xs[result.minAt] > xs[i]: result.minAt = i

with then let (minAt, maxAt) = strs.rangeAt and tracking (min|max) -> strs[(min|max)At] changes in the string comparison for the prefix length algo. Some paired-up value oriented interface for those who want it would also make sense:

proc range*[T](xs: openArray[T]): tuple[min, max: T] {.inline.} =
  if xs.len < 1: return           # raise?
  let (minAt, maxAt) = xs.rangeAt
  result.min = xs[minAt]
  result.max = xs[maxAt]

in which case maybe no change to the prefix len algo is needed.

While off the topic of the prefix calculation, it perhaps bears noting that SSE/AVX calculations can sometimes speed up the value-oriented versions of these range computations over arrays of CPU supported types like openArray[float32] by large factors - 24x even. Basically branchless and batching vector min/max instructions can be leveraged, but I am unaware of CPUs supporting "minAt" type vector instructions. Recent versions of gcc can recognize these min/max type value sweeps and correctly generate optimized code, but such pattern recognition can often fail as the code grows just a little more complex.

marks (orginal) [2020-01-06T09:04:14+01:00] view original

Using your for j in 1 .. paths.high() is much faster than my for s in paths[1 .. ^1], but for me my foldl is much faster than your maxLen. (My test data isn't great though, a dozen calls, eleven with common prefixes, repeated 100K times.)

Maybe a common prefix algorithm should be in the std. lib., perhaps with one version that is "raw" and another which does prefixes only at a given separator?

cblake (orginal) [2020-01-06T11:05:43+01:00] view original

It's a bit tricky to explain why your foldl would be much faster than my minLen (note "min" not "max"). Perhaps you aren't compiling with max optimization/-d:danger or something? Or perhaps some peculiarity related to your test data. Total data size, number of strings, and size of the found prefices are more salient statistics than how many times you repeat, though another possible explanation is that 100k repeats in a tight loop does something weird with a warmed up CPU branch predictor. My timings come from a Unix shell loop doing 10..100 repeated runs of the whole program. As with much benchmarking, it's debatable what is most representative.

I'll include the whole program below for reference. I called @marks' algorithm lcpVertical for vertically-oriented scanning and swapped i & j to have more traditional j=column index notation and slightly earlier exit. Because cligen allows unique prefixes ./lcp -alcpb /usr/bin/\* is a valid way to run things in binary search mode. I don't do the final rfind(sep) work in any of those, but it's the same work for all the algorithms anyway and just a couple line wrapper proc.

Anyway, I've run with a variety of inputs and things track my stated expectations perfectly (compiled with devel version of nim, d:danger, gcc-9.2, on Linux, i6700k). For really large inputs the range algo works best..for really small the binary search. So, an optimal at all scales would probably switch from binary search to range at some machine/system-load dependent threshold.

stdlib-wise, the range LCP algo is never worse than about 2X the run time on my various /usr/xyz/* type tests, though. So, from a non-tuning/performance robustness point of view, it would be a better candidate for stdlib inclusion. Beyond the "one stop shopping" algo-wise and almost trivial implementation, its rangeAt & range helpers are much more basic/common operations than longest common prefix. Beyond that basicness, as mentioned the value-based range may afford strong-speed-ups for specialized types. So, it might be helpful to have a single point of reference that could be optimized/assumed optimized the way certain standard C library functions are, such as memchr.

when defined(foldl):
  import sequtils                 # For me this is slower
  proc minLen(strs: openArray[string]): int {.inline.} =
    foldl(strs.mapIt(len(it)), min(a, b))
else:
  proc minLen(strs: openArray[string]): int {.inline.} =
    result = int.high
    for s in strs: result = min(result, s.len)

proc check(strs: openArray[string]; j0, j1: int): bool =
  let first = strs[0]
  for s in strs:
    for j in j0 .. j1:
      if s[j] != first[j]: return false
  return true

proc lcpLenBinSearch(strs: openArray[string]): int =
  if strs.len == 0: return 0      # maybe -1 instead?
  if strs.len == 1: return strs[0].len
  var lo = 0                      # Binary search
  var hi = strs.minLen - 1
  while lo <= hi:
    let mid = lo + (hi - lo) div 2
    if check(strs, lo, mid):      # All strs match this
      result += mid + 1 - lo      # Append to answer
      lo = mid + 1                # Go higher
    else:                         # Go lower
      hi = mid - 1

proc rangeAt*[T](xs: openArray[T]): tuple[minAt, maxAt: int] =
  if xs.len == 0: return (-1, -1) # raise?
  for i in 1 ..< xs.len:
    if xs[result.minAt] < xs[i]: result.minAt = i
    if xs[result.minAt] > xs[i]: result.minAt = i

proc range*[T](xs: openArray[T]): tuple[min, max: T] =
  if xs.len == 0: return          # raise?
  let (minAt, maxAt) = xs.rangeAt
  result.min = xs[minAt]
  result.max = xs[maxAt]

proc lcpLenRange(strs: openArray[string]): int =
  if strs.len == 0: return -1     # raise?
  let (minAt, maxAt) = strs.rangeAt
  for i, c in strs[minAt]:
    if c != strs[maxAt][i]: return i
  return strs[minAt].len

proc lcpLenVertical(strs: openArray[string]): int =
  if strs.len == 0: return 0      # simplified marks algo
  let first = strs[0]
  for j in 0 ..< first.len:
    for i in 1 ..< strs.len:
      if j >= strs[i].len or first[j] != strs[i][j]:
        return j
  return first.len

type lcpAlgo* = enum lcpBinSearch, lcpRange, lcpVertical

proc lcpLen*(strs: openArray[string], algo=lcpVertical): int =
  case algo
  of lcpBinSearch: return strs.lcpLenBinSearch
  of lcpRange:     return strs.lcpLenRange
  of lcpVertical:  return strs.lcpLenVertical

proc lcp*(strs: openArray[string], algo=lcpBinSearch): string =
  if strs.len < 1: return ""
  result = strs[0][0 ..< strs.lcpLen(algo)]

when isMainModule:  # asserts are the Rosetta Code tests
  assert lcp(["interspecies", "interstellar", "interstate"]) == "inters"
  assert lcp(["throne", "throne"]) == "throne"
  assert lcp(["throne", "dungeon"]) == ""
  assert lcp(["throne", "", "dungeon"]) == ""
  assert lcp(["cheese"]) == "cheese"
  assert lcp([""]) == ""
  assert lcp(@[]) == ""
  assert lcp(["prefix", "suffix"]) == ""
  assert lcp(["foo", "foobar"]) == "foo"
  
  import cligen, cligen/osUt
  proc timeAlgo(algo=lcpRange, strs: seq[string]) =
    timeIt(" in", 1e6, 3, " microseconds via " & $algo & "\n"):
      stdout.write strs.lcpLen(algo)
  dispatch(timeAlgo)

e (orginal) [2020-01-06T21:41:53+01:00] view original

In rangeAt you have

if xs[result.minAt] < xs[i]: result.minAt = i
if xs[result.minAt] > xs[i]: result.minAt = i

whereas I suspect you meant to use maxAt in the second line. Does this affect your timing results?

cblake (orginal) [2020-01-06T22:45:01+01:00] view original

Oops. Didn't survive some edits. Good catch as always, @e. Will fix in the prior post with a note. Does not effect my results which are like this:

$ echo /usr/bin/*|wc
      1    3077   58306
$ (repeat 50; lcp -alcpV /usr/bin/*)|mnsd
9 in 53.97 +- 0.30 microseconds via lcpVertical
$ (repeat 50; lcp -alcpR /usr/bin/*)|mnsd
9 in 51.33 +- 0.42 microseconds via lcpRange
$ (repeat 50; lcp -alcpB /usr/bin/*)|mnsd
9 in 39.29 +- 0.78 microseconds via lcpBinSearch

(I use Zsh which has repeat and a little mean-sdev computer.) Then for a larger inputs:


$ echo /usr/lib/python3.6/**|wc
      1   39036 3108640
$ (repeat 10 lcp -alcpV /usr/lib/python3.6/**)|mnsd
19 in 2388. +- 14. microseconds via lcpVertical
$ (repeat 10 lcp -alcpR /usr/lib/python3.6/**)|mnsd
19 in 1131.1 +- 1.5 microseconds via lcpRange
$ (repeat 10 lcp -alcpB /usr/lib/python3.6/**)|mnsd
19 in 1523. +- 17. microseconds via lcpBinSearch

The L3 on an i7-6700k is 8MB. So, this is showing the range method pulling ahead while still in L3 (3.1 MB). At one pass, it really has to be approximately the fastest at large scale. Reading from DIMMs the difference should be even more pronounced in favor of range.

It's possible that that marks/vertical scan style might be faster if the answer is very small. E.g., for a zero common prefix length, the answer will be concluded after just one pass down the first column via lcpVertical.

cblake (orginal) [2020-01-07T00:04:15+01:00] view original

(One could probably inline the range work into the lcp and keep track of just the first character and exit early if it's ever different to combine the best properties...)

cblake (orginal) [2020-01-07T07:44:05+01:00] view original

For explicitness,

proc lcpLenVRange(strs: openArray[string]): int =
  if strs.len == 0: return -1     # raise?
  var minAt, maxAt: int
  for i in 1 ..< strs.len:
    if strs[i] < strs[minAt]: minAt = i
    if strs[i] > strs[maxAt]: maxAt = i
    if strs[i].len > 0 and strs[i][0] != strs[0][0]:
      return 0
  for i, c in strs[minAt]:
    if c != strs[maxAt][i]: return i
  return strs[minAt].len

Integrating that into the test harness appropriately and running on my two running data input examples (extended to a case with zero length common prefix), and switching to mnmx (which gives the range of times instead of mean+-sdev) then gives:


$ for a in lcpVe lcpR lcpB lcpVR; { (repeat 20; lcp -a$a /usr/lib/python3.6/**)|mnmx }
19 in 2300.262..2448.32 microseconds via lcpVertical
19 in 1122.236..1162.767 microseconds via lcpRange
19 in 1523.256..1558.781 microseconds via lcpBinSearch
19 in 1189.47..1224.756 microseconds via lcpVRange

$ for a in lcpVe lcpR lcpB lcpVR; { (repeat 20; lcp -a$a /usr/bin/*)|mnmx }
9 in 51.737..76.532 microseconds via lcpVertical
9 in 48.637..68.903 microseconds via lcpRange
9 in 36.24..40.054 microseconds via lcpBinSearch
9 in 46.253..54.598 microseconds via lcpVRange

$ for a in lcpVe lcpR lcpB lcpVR; { (cd /usr/bin; repeat 20; lcp -a$a *)|mnmx }
0 in 3.099..5.722 microseconds via lcpVertical
0 in 43.392..80.109 microseconds via lcpRange
0 in 8.583..13.113 microseconds via lcpBinSearch
0 in 3.099..5.245 microseconds via lcpVRange

$ for a in lcpVe lcpR lcpB lcpVR; { (cd /usr/lib/python3.6; repeat 20; lcp -a$a **)|mnmx }
0 in 7.868..11.683 microseconds via lcpVertical
0 in 911.713..959.158 microseconds via lcpRange
0 in 174.522..190.735 microseconds via lcpBinSearch
0 in 9.537..14.782 microseconds via lcpVRange

So, you can get the early exit of the vertical method for zero common prefices in the range method without very much extra cost. Eg., slightly negative/in-the-noise cost for the L2 prefixLen==9 case and 6% slower for the prefixLen=19 L3 case with the early exits basically matching times in the two prefixLen==0 cases.

(and, yeah, yeah..the Nim program should import stats, do the repeated loops internally updating a RunningStat, and report. It could even take an lcpAll enum to report on all algorithms..and -- if one wants to commit to only file tree inputs -- maybe even take just a directory as a command parameter, import oswalkdir, and use walkDir and walkDirRec instead of * and ** to remove dependency on shell features.)

cblake (orginal) [2020-01-07T15:28:55+01:00] view original

Things in that last version weren't robust/efficient with regards to zero length strings being in the mix. This fixes that:

proc lcpLenVRange(strs: openArray[string]): int =
  if strs.len == 0: return -1     # raise?
  if strs[0].len == 0: return 0   # ANY "" anywhere => 0
  let byte0 = strs[0][0]
  var minAt, maxAt: int
  for i in 1 ..< strs.len:
    if strs[i].len == 0: return 0
    if strs[i] < strs[minAt]: minAt = i
    if strs[i] > strs[maxAt]: maxAt = i
    if strs[i][0] != byte0: return 0
  for i, c in strs[minAt]:
    if c != strs[maxAt][i]: return i
  return strs[minAt].len

In the interests of being mostly self-contained, I also pushed a better timeIt to cligen#head and updated my driver to use the new timeIt and to not time the stdout.write parts:

proc timeAlgo(algo=lcpVRange, reps=1, strs: seq[string]) =
  var x: int
  timeIt(stdout.write, "", 1e-6, 3, " usec via " & ($algo)[3..^1], reps):
    x += strs.lcpLen(algo)
  echo " ", int(x.float / reps.float + 0.5)

Finally, I created a benchLcp.zsh script:


#!/bin/zsh
lcp=`pwd`/lcp                   #We need Zsh only for its **
echo "L3 data, 19 answer"
for a in lcpVe lcpR lcpB lcpVR; do $lcp -r10 -a$a /usr/lib/python3.6/**; done
echo "L2 data, 9 answer"
for a in lcpVe lcpR lcpB lcpVR; do $lcp -r10 -a$a /usr/bin/*; done
echo "L3 data, 0 answer"
for a in lcpVe lcpR lcpB lcpVR; do (cd /usr/lib/python3.6; $lcp -r10 -a$a **); done
echo "L2 data, 0 answer"
for a in lcpVe lcpR lcpB lcpVR; do (cd /usr/bin; $lcp -r10 -a$a *); done

and used a little nim-pgo wrapper I have to use gcc's profile guided optimization (-fprofile-generate and -fprofile-use), compiling, running a test/bench, then re-compiling. With all that, I get:


L3 data, 19 answer
 1910.448..2253.771 1959.491 +- 32.812 usec via Vertical 19
 686.884..1150.131 777.125 +- 48.653 usec via Range 19
 922.680..1522.064 996.113 +- 58.569 usec via BinSearch 19
 693.560..1163.244 786.591 +- 49.932 usec via VRange 19
L2 data, 9 answer
 41.246..63.181 45.848 +- 2.480 usec via Vertical 9
 37.193..54.359 40.936 +- 1.778 usec via Range 9
 22.650..25.511 23.413 +- 0.353 usec via BinSearch 9
 36.716..55.075 41.509 +- 2.292 usec via VRange 9
L3 data, 0 answer
 0.000..0.954 0.167 +- 0.094 usec via Vertical 0
 535.488..1017.570 627.780 +- 48.740 usec via Range 0
 58.651..205.994 82.874 +- 15.300 usec via BinSearch 0
 0.715..3.576 1.097 +- 0.278 usec via VRange 0
L2 data, 0 answer
 0.000..0.000 0.000 +- 0.000 usec via Vertical 0
 32.663..38.862 33.927 +- 0.650 usec via Range 0
 3.099..4.768 3.290 +- 0.166 usec via BinSearch 0
 0.000..0.238 0.048 +- 0.032 usec via VRange 0

which makes total sense to me. The @marks performance discrepancies (except his mysterious foldl bit), probably just come from the "mix"/averaging of how many zero length answers are in his inputs.

So, this new variant of that VRange hybrid method would make the most sense to recommend for general usage (of the above selection, anyway). It's never that slow, fastest at large scale when performance matters most, and can leverage early exit in easy cases.

The binary search way for smaller scale data may be best if the caller already knows and can pass in a data scale parameter. If we must measure the data size, that takes a whole pass through the length fields. That measurement eats into the 36/22=1.6x L2-non-zero speed ratio, probably reducing it to like 1.3x which isn't that much a boost for all the complexity of measurement+algorithm switching. I got about a 1.15x boost from PGO and most probably skip that entirely to avoid build complexity.

Araq (orginal) [2020-01-07T17:01:39+01:00] view original

@cblake May I draw your attention to https://forum.nim-lang.org/t/5734#35803 ? The differences between -d:danger and -d:release a mysterious to us.

Mirror of forum.nim-lang.org

5767 :: A path commonPrefix function