nimforum mirror - String expression parsing

Stefan_Salewski (orginal) [2021-08-25T21:34:28+02:00] view original

For the SDT tool in PCB mode we allow users to create pads not only with the mouse, but also by entering the data in form of a string, which is some form of a batch mode. A PCB pad is basically a rectangle, with an attached number and a name. The text input string has this shape:

Pad x1, y1, +x2, +y2, Radius, dx, dy, N, Number, Name

First parameter "Pad" is an optional string with default value "Pad"

x1, y2 are floats in string form, same x2 and y2. But x2 and y2 can be prefixed with a "+" indicating that x2 and x2 is not the opposite corner of the pad but width and height. Radius is an optional float parameter with default value zero. dx, dy are optional float numbers with default zero, the describe the translation if we have to create not only one single pad, but multiple. N is an optional non negative integer which tells how many pads should be created. Finally Number and Name are plain optional strings.

Unfortunately it gets more complicated by the fact that the values do not have to be separated by comma , but semicolon or arbitrary whitespace should be also allowed. Example input:

Pad 100, 100 120.5;140.5 7.5 100 0 8 0 PAD

I tried using parseutils, but with that the proc fills the whole screen. strscan may work when the values are all just separated by one separator character, but with comma or semicolon and arbitrary whitespace it is difficult.

So use PEGS or RegEx? Or preprocess the string to something like

Pad,100,100,120.5,140.5,7.5,100,0,8,0,PAD

first, and then use strscan?

Of course we could use 10 different input fields in the GUI instead. Toy program may do it that way. But for a real application 10 input fields waste to much space and entering all that data into one entry field is faster and simpler -- generally user will separate data by space.

I have a similar simpler proc for entering size of the drawing area, and that one is already ugly:

proc worldActivate(entry: Entry; pda: PDA) =
  var
    d: array[4, float] = [NaN, NaN, NaN, NaN]
    s: array[4, bool]
    t = entry.text
    i, j, k: int
    f: float
  i = 1
  entry.setIconFromIconName(EntryIconPosition.secondary, nil)
  for c in mitems(t):
    if c in {';', ','}:
      inc(i)
      c = ' '
    if c in {'0' .. '9'}:
      i = 0
    if i > 1:
      entry.setIconFromIconName(EntryIconPosition.secondary, "dialog-error")
      return
  while k < d.len:
    i = t.skipWhitespace(j)
    j += i
    if j == t.len:
      break
    s[k] = t[j] == '+'
    i = t.parseFloat(f, j)
    j += i
    if i > 0:
      d[k] = f
    inc(k)
  if k == 1:
    d[1] = d[0]
  elif k == 3:
    d[3] = d[2]
    s[3] = s[2]
  case k
  of 0:
    d = DefaultWorldRange
  of 1, 2:
    d[3] = d[1]
    d[2] = d[0]
    d[0] = 0
    d[1] = 0
  of 3, 4:
    if not s[2]:
      d[2] -= d[0]
    if not s[3]:
      d[3] -= d[1]
  else:
    discard
  t.setLen(0)
  for f in d:
    t.add(fmt("{f:g}, "))
  t.setlen(t.len - 2)
  entry.setText(t)
  (pda.dataX, pda.dataY, pda.dataWidth, pda.dataHeight) = d # (d[0], d[1], d[2], d[3])
  pda.fullScale = min(pda.darea.allocatedWidth.float / pda.dataWidth, pda.darea.allocatedHeight.float / pda.dataHeight)
  updateAdjustments(pda, 0, 0)
  pda.paint

sky_khan (orginal) [2021-08-26T04:37:40+02:00] view original

I've used your problem for trying npeg out . I guess I have too much time :)

Disclaimer: This is my first try using Npeg. Use at your own risk

Here you're:

import npeg, strutils, tables

type Dict = Table[string, string]

let parser = peg("pad", d: Dict):
  pad <- ?padstr * sep * x1 * sep * y1 * sep * x2 * sep * y2 * sep * ?radius * sep * ?dx * sep * ?dy * sep * ?n * sep * ?number * sep * ?name
  padstr <- >*Alnum:
    d["pad"] = $1
  sep <- *' ' * ?{',',';'} * *' '
  integer <- +Digit
  real <- +Digit * '.' * +Digit
  numeric <- real | integer
  x1 <- >numeric:
      d["x1"] = $1
  y1 <- >numeric:
      d["y1"] = $1
  x2 <- >(?'+' * numeric ):
      d["x2"] = $1
  y2 <- >(?'+' * numeric):
      d["y2"] = $1
  radius <- >numeric:
      d["radius"] = $1
  dx <- >numeric:
      d["dx"] = $1
  dy <- >numeric:
      d["dy"] = $1
  n <- >numeric:
      d["n"] = $1
  number <- >*Alnum:
      d["number"] = $1
  name <- >*Alnum:
      d["name"] = $1

var data: Table[string, string]
data["pad"] = "pad"
data["radius"] = "0.0"
data["dx"] = "0.0"
data["dy"] = "0.0"

doAssert parser.match("Pad 100, 100 120.5,140.5 7.5 100 0 8 0 PAD", data).ok
echo data

Araq (orginal) [2021-08-26T07:39:14+02:00] view original

first, and then use strscan?

First write a tokenizer and then a parser operating on tokens, that keeps things clean. I don't know which tools to use for these, I usually code them manually. NPeg and strscans can be helpful but avoid regular expressions.

Stefan_Salewski (orginal) [2021-08-26T08:33:19+02:00] view original

@sky_khan I 've used your problem for trying npeg out .

Thank you very much for that example -- I wanted to try npegs for a long time already, this gives me a nice starting point.

@Araq In Ruby I have done such things generally with regular expression, so that had been my first attempt until some years ago.

Tokenizer and parser -- sounds a bit complicated and I never really did it before, but maybe I should try it.

Late yesterday evening I studied the strscans module in more detail, I got the feeling that with user definable matchers it may work too.

Thanks.

Stefan_Salewski (orginal) [2021-08-26T23:26:35+02:00] view original

I had some success with strscans module.

One issue is that the user definable matchers which deliver a result do not work with optional arguments. They have always to return a value greater than zero to indicate a successful match, otherwise all the matching stops. So the string has to always start with "pad" in the example below:

import strscans # smartscan

proc jecho(x: varargs[string, `$`]) =
  for el in x:
    stdout.write(el & " ")
  stdout.write('\n')
  stdout.flushfile

proc stt(input: string; strVal: var string; start: int; n: int): int =
  if input[start .. start + "pad".high] == "pad":
    strVal = "pad"
    result = "pad".len

proc pls(input: string; plusVal: var int; start: int; n: int): int =
  if input[start] == '+':
    plusVal = 1 # bool
    result = 1

proc sep(input: string; start: int; seps: set[char] = {' ',',',';'}): int =
  while start + result < input.len and input[start + result] in {' ','\t'}:
    inc(result)
  if start + result < input.len and input[start + result] in {';',','}:
    inc(result)
  while start + result < input.len and input[start + result] in {' ','\t'}:
    inc(result)

proc plus(input: string; plusVal: var int; start: int; n: int): int =
  result = sep(input, start)
  if input[start + result] == '+':
    plusVal = 1 # bool
    result += 1

var st: string
var x1, y1, x2, y2, dx, dy: float
var px2, py2: int # bool
var n: int
var number, name: string

(st, x1, y1, px2, x2, py2, y2, dx, dy, n, number, name) = ("pad", NaN, NaN, 0, NaN, 0, NaN, NaN, NaN, 0, "", "") # defaults

var res: bool
var input = "pad 10.0, 10   12 +12.0 ;20 0 8 Num Name"
input = "pad 10.0 10 12 +12.0 20 0 8 Num Name"

# unfortunately the input start with "pad" is needed for unpatched strscan!

# using the pls matcher, this fails when there is no '+'
res = scanf(input, "${stt(0)}$[sep]$f$[sep]$f$[sep]${pls(0)}$f$[sep]${pls(0)}$f$[sep]$f$[sep]$f$[sep]$i$[sep]$w$[sep]$w", st, x1, y1, px2, x2, py2, y2, dx, dy, n, number, name)
jecho(res, st, x1, y1, px2, x2, py2, y2, dx, dy, n, number, name)

# using the plus matcher, so the '+' is optional
res = scanf(input, "${stt(0)}$[sep]$f$[sep]$f${plus(0)}$f${plus(0)}$f$[sep]$f$[sep]$f$[sep]$i$[sep]$w$[sep]$w", st, x1, y1, px2, x2, py2, y2, dx, dy, n, number, name)
jecho(res, st, x1, y1, px2, x2, py2, y2, dx, dy, n, number, name)

input = "pad 10.0, 10   12 +12.0" # test with missing optional values
(st, x1, y1, px2, x2, py2, y2, dx, dy, n, number, name) = ("pad", NaN, NaN, 0, NaN, 0, NaN, NaN, NaN, 0, "", "") # defaults
# using the plus matcher, so the '+' is optional
res = scanf(input, "${stt(0)}$[sep]$f$[sep]$f${plus(0)}$f${plus(0)}$f$[sep]$f$[sep]$f$[sep]$i$[sep]$w$[sep]$w", st, x1, y1, px2, x2, py2, y2, dx, dy, n, number, name)
jecho(res, st, x1, y1, px2, x2, py2, y2, dx, dy, n, number, name)

Output is


$ ./k
false pad 10.0 10.0 0 nan 0 nan nan nan 0
true pad 10.0 10.0 0 12.0 1 12.0 20.0 0.0 8 Num Name
false pad 10.0 10.0 0 12.0 1 12.0 nan nan 0

which is not that bad. I wonder if the user definable matchers could not just support optional arguments? (The matchers without result do that already)

A plain patch


$ diff ~/Nim/lib/pure/strscans.nim smartscan.nim
301c301
< proc notZero(x: NimNode): NimNode = newCall(bindSym"!=", x, newLit 0)
---
> proc notZero(x: NimNode): NimNode = newCall(bindSym"!=", x, newLit -1)
429c429
<           conds.add newCall(bindSym"!=", resLen, newLit 0)
---
>           conds.add newCall(bindSym"!=", resLen, newLit -1)

seems to make it working that way. So a return value of zero for the user defined matcher would indicate that there is no optional argument, and maybe -1 could be returned to indicate an error.

auxym (orginal) [2021-08-27T00:37:51+02:00] view original

FWIW, a good resource for writing a relatively simple tokenizer and recursive descent parser from scratch is https://craftinginterpreters.com/

I used to to write a parser for arithmetic expressions in last year's Advent of Code, which was massive overkill but interesting.

Araq (orginal) [2021-08-27T05:08:30+02:00] view original

Maybe scanf can be patched to either accept int or Option[int] for the custom matcher's return type. This would be backwards compatible.

Mirror of forum.nim-lang.org

8368 :: String expression parsing