nimforum mirror - Recursively replace pair of bytes with one byte, within given byte array

drkameleon (orginal) [2022-08-23T18:41:24+02:00] view original

So... what I'm trying to do is pretty much this: we have a seq[byte] and given two consecutive byte values, we have to replace them with one.

Like:


A B C D E F G A B F (replace A B with W)

W C D E F G W F

I've been experimenting with different algorithms.

This is my original implementation:

proc substitute2to1r*(a: ByteArray, subA: OpCode, subB: OpCode, replacement: OpCode): ByteArray =
    var i = 0
    let aLen = a.len
    while i < aLen-2+1:
        let diffA = a[i] - (Byte)(subA)
        if diffA >= 0 and diffA <= 29 and diffA == a[i+1] - (Byte)(subB):
            result.add((Byte)(replacement) + diffA)
            i.inc(2)
        else:
            result.add(a[i])
            i.inc(1)
    result.add(a[i..^1])

This is a better version of the above (the most performant of the 3 here):

proc substitute2to1r2*(a: ByteArray, subA: OpCode, subB: OpCode, replacement: OpCode): ByteArray =
    var i = 0
    var cntr = 0
    let aLen = a.len
    newSeq(result, aLen)
    while i < aLen-2+1:
        let diffA = a[i] - (Byte)(subA)
        if diffA >= 0 and diffA <= 29 and diffA == a[i+1] - (Byte)(subB):
            result[cntr] = (Byte)(replacement) + diffA
            cntr.inc()
            i.inc(2)
        else:
            result[cntr] = a[i]
            cntr.inc()
            i.inc(1)
    for j,it in a[i..^1]:
        result[cntr+j] = it
    result.setLen(cntr + a[i..^1].len)

And this is an attempt with in-place modification:

proc sub2to1r2*(a: var ByteArray, subA: OpCode, subB: OpCode, replacement: OpCode) =
    var i = 0
    var cntr = 0
    let aLen = a.len
    while i < aLen-2+1:
        let diffA = a[i] - (Byte)(subA)
        if diffA >= 0 and diffA <= 29 and diffA == a[i+1] - (Byte)(subB):
            a[i] = (Byte)(replacement) + diffA
            a[i+1] = (Byte)255
            cntr.inc()
            i.inc()
        i.inc()
    a.keepIf((item) => item != (Byte)255)

I thought by doing it in-place, there would be something to gain, but I guess I'm still not doing it in the best possible way.

Given that what I'm aiming for is speed, any suggestions?

auxym (orginal) [2022-08-23T19:42:12+02:00] view original

This reminds of an AOC problem for last year: https://adventofcode.com/2021/day/14

IIRC the trick for that one might have been memoization/dynamic programming. You could try that? You can find a bunch of hints and discussion on reddit: https://www.reddit.com/r/adventofcode/search/?q=2021%20day%2014&restrict_sr=1&sr_nsfw=

xigoi (orginal) [2022-08-23T22:55:34+02:00] view original

Memoization helped with the AoC problem because it was applying the same transformation several times to the same string. It doesn't look like that's the case here.

shirleyquirk (orginal) [2022-08-24T02:29:32+02:00] view original

please tell me if my assumptions are off, but your functions aren't quite working the way you say they should.

type
  ByteArray = seq[byte]
  OpChar = byte
  Byte = byte

let
  input = "ABCDEFGBF".mapIt(it.byte)
  output = input.substitute2to1r('A'.byte,'B'.byte,'W'.byte)

echo output.mapIt(it.char)
#@['W', 'Y', '[', 'G', 'W', 'F']

xigoi (orginal) [2022-08-24T09:35:59+02:00] view original

As he said:

drkameleon (orginal) [2022-08-24T10:07:03+02:00] view original

Exactly!

I just didn't want to bother you with too many details while still maintaining the code roughly as it looks.

What I'm experimenting with (and it's still an experiment - at least its implementation - that's why it looks so draft-like) is a bytecode optimizer for Arturo.

Basically, Arturo's evaluator produced a bytecode object, which is nothing other than a list of constants (=data) and a list of opcodes (=code).

https://github.com/arturo-lang/arturo/blob/master/src/vm/opcodes.nim

What I'm experimenting with here is a way of going through these opcodes and performing some simple transformations.

For example, the above code could be used for something like Store+Load->StoreLoad transformations.

Take this as an example (in Arturo):


x: 1
print 1 + x

This produces the following bytecode:


        consti1   -> push 1 to the stack
        store0    -> pop stack and store it as 'x'
        load0      -> load variable 'x' onto the stack
        consti1    -> push 1 to the stack
        call2         -> call function ('add')
        call1          -> call function ('print')
        end

Given that... a very simple peephole optimization, but still: store0 + load0 could be reduced to 1 single opcode (storl0) which stores the value to 'x' but does not pop it from the stack.

So in that case, we would have to go through the bytecode and replace all sequences of StoreX-LoadX with StorlX (where X=0..29).

That's exactly what the code above does. ;-)

radsoc (orginal) [2022-08-24T13:25:18+02:00] view original

I don't know if you have to deal with one large ByteArray or many small ones, but this part of your code (the fastest one) is quite expensive:

for j,it in a[i..^1]:
        result[cntr+j] = it
    result.setLen(cntr + a[i..^1].len)

My naive implementation is faster mainly because of this.

import std/times

type
  ByteArray = seq[byte]
  OpCode = byte
  Byte = byte

proc substitute2to1naive*(a: ByteArray, subA: OpCode, subB: OpCode,
    replacement: OpCode): ByteArray =
  newSeq(result, a.len)
  var resultIdx = 0
  var lastByte = a[0]
  var currentByte: Byte
  var skipNext: bool
  for i in 1..<a.len:
    currentByte = a[i]
    if currentByte == subB and lastByte == subA:
      result[resultIdx] = replacement
      inc resultIdx
      lastByte = byte('\x00')
      skipNext = true
    elif skipNext:
      skipNext = false
      lastByte = currentByte
    else:
      result[resultIdx] = lastByte
      inc resultIdx
      lastByte = currentByte
  result[resultIdx] = lastByte
  result.setLen(resultIdx + 1)

proc substitute2to1r2*(a: ByteArray, subA: OpCode, subB: OpCode,
    replacement: OpCode): ByteArray =
  var i = 0
  var cntr = 0
  let aLen = a.len
  newSeq(result, aLen)
  while i < aLen-2+1:
    if a[i] == subA and a[i+1] == subB:
      result[cntr] = replacement
      cntr.inc()
      i.inc(2)
    else:
      result[cntr] = a[i]
      cntr.inc()
      i.inc(1)
  for j, it in a[i..^1]:
    result[cntr+j] = it
  result.setLen(cntr + a[i..^1].len)

func toString(bytes: openArray[byte]): string =
  result = newString(bytes.len)
  copyMem(result[0].addr, bytes[0].unsafeAddr, bytes.len)

let input = cast[seq[byte]]("AABBCDEFGABHIJKLMABNOPQRSTABUVWXYZ")
var output1, output2: seq[byte]

var t1 = epochTime()
for i in 1..1_000_000:
  output1 = input.substitute2to1naive('A'.byte, 'B'.byte, '.'.byte)
echo output1.toString
# A.BCDEFG.HIJKLM.NOPQRST.UVWXYZ
echo (epochTime() - t1)

var t2 = epochTime()
for i in 1..1_000_000:
  output2 = input.substitute2to1r2('A'.byte, 'B'.byte, '.'.byte)
echo output2.toString
# A.BCDEFG.HIJKLM.NOPQRST.UVWXYZ
echo (epochTime() - t2)

If only I knew how to use SIMD...

emre (orginal) [2022-08-28T05:25:04+02:00] view original

To be perfectly precise, I'm not even substituting A+B with C; what is actually going on is substituting a range around A + B, with a range around C.

Can you demonstrate what this means with an example, otherwise it seems you could you use a string search algorithm.

dailyak77 (orginal) [2022-08-28T09:02:44+02:00] view original

The first one was when taking input from user, the second was when I had input specifically set to "Hello World" without bothering with asking for input. In both cases, there are extra characters, in the first case where it asks for input, it really messes up. Why?

Secondly, is this the right way to convert a char literal into an array of bytes?

dlesnoff (orginal) [2022-08-28T12:29:23+02:00] view original

I'll give you some ideas for the algorithm in pseudo-code, I have no implementation yet. I don't know if these ideas are perfect for your setting, this ranges might change what we optimize.

Instead of moving either from one/two byte whether you have done a replacement or not, you could always read your input data two by two.

Doing so, forces you to special case the last read and to distinguish at least the 4 following cases:

Your two bytes are AB, you replace them with C

Your two bytes are XB or AX with X another byte, you do not change anything and you move forward

Your two bytes are XA, with X different from B. You use a boolean to remember at the next step, that if you have BX, you will have to replace your string, since you read XABX.

Your two bytes are of the form BX. Check your boolean to know if you have a change to do. If it is BA, you will have to change your boolean too.

Now, I believe you want to do it recursively. The problem is if you have a string with AABB. You will not detect it in the first pass, if you do not check for this. You might have problematic strings with lets say: "XXXXXAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBBBBBXXXXXX" You will do a lot of pass, and read all the X many times for each pass.

To circumvent this, the best approach (without benchmark, so I can't tell for sure) would be to split your string in blocks.

If you have done a pass on two strings, what can we tell about the replacement we have to do in the concatenation of both strings ? Well, we have to check for the end of the first string and the beginning of the second string. We can have an approach where we do something like:

for s in substrings:
  substitute2to1(s)
for s1, s2 in [(1,2), (2,3), (3,4), ...]:
  checkForSubstitutionInConcatenation(s1, s2)

I do not think that you will benefit from fast pattern matching algorithms like https://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm

The limitation in the inplace algorithm is probably more the write operation than the read operation. I think that if you want to change your bytes in place, you are better off beginning from the end of your seq. I think that every time you change two bytes by one, the sequence will have to move all the end of your seq to keep data consecutive.

I do not think Nim forces data to be consecutive in memory, but it is best to try to do the modifications from the end to the start of the sequence.

kobi (orginal) [2022-08-28T17:02:32+02:00] view original

I am not sure I understand exactly the request, but this seems like a string searching algorithm, then replacing all the found positions. look into boyer-moore and similar. I think Nim has one in the stdlib (Sunday algorithm?). once you get a seq of all the index positions, and you know the length of the original opcodes, then compute the slices needed, and join them with the target opcode. (or iterating... whatever is more performant)

Mirror of forum.nim-lang.org

9407 :: Recursively replace pair of bytes with one byte, within given byte array