nimforum mirror - Bug in PEG or my fault?

HiPhish (orginal) [2022-03-10T20:05:18+01:00] view original

Hello,

I have been doint last winter's Advent of Code puzzles in order to practice Nim, and I am stuck at the day 5 puzzle. I use a PE grammar to parse the input; I could have used regular expressions but since the puzzles are going to become progressively harder I figured I might as well learn PEG while the input is still simple enough.

Here is the sample input:

0,9 -> 5,9
8,0 -> 0,8
9,4 -> 3,4
2,2 -> 2,1
7,0 -> 7,4
6,4 -> 2,0
0,9 -> 2,9
3,4 -> 1,4
0,0 -> 8,8
5,5 -> 8,2

Each line of text describes a line: first come the x- and y coordinate of the first point, then an arrow, and finally the coordinates of the last point. My grammar is:

let
  grammar = peg"""
    lines      <- line (\n line)* \n?
    line       <- first \s+ '->' \s+ last
    first      <- point
    last       <- point
    point      <- x ',' y
    x          <- coordinate
    y          <- coordinate
    coordinate <- number
    number     <- \d+
  """

The problem is that if the text file contains a trailing line feed Nim will try to parse the line feed as a number as well. I don't know why, because the last newline of the grammar is optional. The grammar does match, but the event parser throws an error because it cannot parse the last "number", which is just an empty string. Here is my event parser:

type
  Coordinate* = distinct Natural
  Point* = object
    x*, y*: Coordinate
  Line* = object
    first*, last*: Point

converter to_coordinate*(n: int): Coordinate =
  Coordinate(n)

proc parse_lines*(input: string): seq[Line] =
  var
    lines: seq[Line]
  let
    parser = block:
      var
        line: Line
        point: Point
        x, y, coordinate: Coordinate
      grammar.eventParser:
        pkNonTerminal:
          leave:
            let matchstr = s.substr(start, start + length - 1)
            case p.nt.name
            of "line":
              lines.add(line)
            of "first":
              line.first = point
            of "last":
              line.last = point
            of "point":
              point = Point(x: x, y: y)
            of "x":
              x = coordinate
            of "y":
              y = coordinate
            of "number":
              debugEcho "[", matchstr, "]"
              coordinate = matchstr.parseInt()
  if parser(input) != len(input):
    raise newException(ValueError, "Error parsing input string")
  lines

discard parse_lines(input)  # where input is a text string read from a file

The debugEcho is there for me to verify that all the numbers have been parsed, and it does indeed print all 40 numbers, plus an extra empty pair of brackets. I cannot see anything wrong here, the grammar is pretty straight forward. I solved the previous day's challenge without issue in a similar way. Any idea what is going on here?

Araq (orginal) [2022-03-11T05:59:48+01:00] view original

Your grammar is wrong, PEGs are deterministic and this rule


lines      <- line (\n line)* \n?

can only cause problems.

HiPhish (orginal) [2022-03-12T02:36:33+01:00] view original

I don't understand how my grammar is not deterministic: one or more line`s separated by a `\n, terminated by an optional \n. Maybe my understanding of determinism is wrong, but I used a similar grammar for the previous day without issues.

let
  grammar* = peg"""
    spec      <- draws (separator board)* \n?
    separator <- \n \n
    draws     <- draw (',' draw)*
    draw      <- number
    board     <- row \n row \n row \n row \n row
    row       <- \s* entry \s+ entry \s+ entry \s+ entry \s+ entry
    entry     <- number
    number    <- \d+
  """

Do you have a resource to learn from? I don't want to waste your time with non-Nim topics.

Araq (orginal) [2022-03-12T07:02:06+01:00] view original

PEGs are a strict representation of the simple imperative code that you would write if you were writing a parser by hand.

https://pest.rs/book/grammars/peg.html#non-backtracking

gemath (orginal) [2022-03-12T13:05:51+01:00] view original

The documentation of eventParser says that leave handlers are called with length set to -1 for an unsuccessful match. Your code tries to parse an unsuccessful line match in the (\n line)* expression, which cannot work. If the body of the leave block is wrapped in a if length > -1:, it works.

This is not strictly backtracking, since the parser only tries to peek one character ahead after the \n in the expression to find out that the match is unsuccessful, there is no buffer to be rewound and re-read. So the grammar works, but it is pretty regex-y. An IMHO more idiomatic rule would be


lines      <- (line (\n / $))+

HiPhish (orginal) [2022-03-13T01:07:03+01:00] view original

Thank you, that solved the problem. So if I understand correctly, the parser is trying to match pattern \n line, finds the \n, then tries matching line, which means it tries matching first, point, x, coordinated and finally number, which fails and causes the parser to leave that node.

But then, why does the previous day's code work? Is it because there are no false matches there since the separator is \n \n?

Araq (orginal) [2022-03-13T06:40:05+01:00] view original

But then, why does the previous day's code work? Is it because there are no false matches there since the separator is n n?

I think so.

gemath (orginal) [2022-03-13T21:08:17+01:00] view original

it tries matching first, point, x, coordinated and finally number, which fails and causes the parser to leave that node.

Yes, but to be precise: every entered node is left (i.e. its leave handler is executed) eventually, regardless of whether the match succeeded or not. The handler just blows up when parseInt is called on an empty string because of the unsuccessful match.

Mirror of forum.nim-lang.org

8998 :: Bug in PEG or my fault?