Theoretically, PEG captures could be used to get matches for sub-expressions, but captures are limited to 20 in the current implementation and they are no replacement for a proper parser anyways. Luckily, the main matching routine of the pegs module can be easly converted into a simple interpreting event parser by adding some callbacks. Also, the object generated by the peg proc already is a complete AST of the PEG, the node's fields are just not accessible, they just need some exported getters.
So I just copied peg.nim, made these changes and now there's something to work with here. Just clone it and type nimble develop in the top directory to use it instaed of the original pegs module. I will lobby the powers that be to include the changes (at least the PEG AST accessors) in the official pegs module to make this thing obsolete.
Both the event parser and the PEG AST could be used for a parser generator. The parser would need a PEG of the PEG grammar itself, but the one from the doc of pegs doesn't really work. So the PEG AST is the best bet as it is. Since PEG is unambiguous, it should always be possible to generate parser code from a specific PEG AST. The pegs module (and hence xpegs) doesn't work in Nim's VM, so we can't use macros to generate parser code, but have to output actual source code. At least at first, because as soon as someone generates parser code for the PEG grammar itself that does run in the VM, we could then use that parser from there on. But again, a fully correct PEG of the PEG grammar itself would be needed for that.
I'm taking the liberty to shamelessly mention my recent project here, as this seems the appropriate thread to do so: NPeg is a PEG-style parser generator which allows free mixing of grammar and Nim code, which should be suitable for the task of lexing and parsing.
It can collect simple string captures, complex captures as a JSON tree, or run arbitrary Nim code at match time.
NPeg is available in nimble, the manual and project page are at https://github.com/zevv/npeg
But you can write BNF which equal to these.
I will support EBNF later (https://github.com/loloiccl/nimly/issues/21)
Unfortunately, it has two missing features that prevent me using instead of spending my time trying to debug my Nim's pegs grammar:
If you were to make it API compatible with Nim's pegs, it would be a great replacement for the pegs module.
Hi @spip,
Sorry, only noticed your post just now - for future communication feel free to post into the NPeg issues at github so I get properly notified.
It has its own grammar syntax for rules that does not follow (E)BNF like Nim's pegs
This is a design choice: having a grammar parseable by the Nim compiler has a number of advantages:
This said, it would probably be not too hard to create a (E)BNF compatible parser with everything that is now in place. I do see some problems with this though: (E)BNF and PEG grammers may look similar, but are not trivially compatible (for example, ordered choice in PEGs). You simply can not parse any arbitrary (E)BNF grammar with a PEG, there are always things that need some reordering or rewriting to make them PEG compatible, or at least more efficient to limit backtracking. (On the other hand: the current syntax is not too far from (E)BNF. For example, take a look at src/npeg/lib/uri.nim for a PEG translation of RFC3984.)
Also, the (E)BNF syntax would need a number of extensions in order to specify captures or other actions to perform at parse time, which kind of defies the purpose of having a compatible grammar to start with. Last but not least: I see no clean way to mix grammar and Nim code. I'm very much open to any ideas and experiments, so let me hear if you have any practical suggestions!
It does not support Unicode, meaning being able to parse UTF-8 strings using Runes
Like I said in the manual: there is rudimentary UTF-8 support available, and I'm not sure what exactly would be needed to make NPeg really "UTF-8 compatible". Over the last few days I added proper library support for NPeg, and started a bare minimum utf8 lib. The same applies here: I'd be glad to hear any ideas you might have and I'm happy to see if we can make NPeg suit your needs!
As I read, Nim has seamless with any C libraries and code, so for the lexer, you can use Ragel, it produces readable and compact code with -G2 option (I use it on low-end microcontrollers for command parsing).
The more interesting question is Nim able to do backtracking to implement DCG parsing for real complex context-sensitive and arbitrary syntaxes.