Question/food-for-thought:
What would be the most sensible way to go about support unicode values in a lexer using lexbase / BaseLexer as its basis?
I mean, as far as I can tell, using BaseLexer we can actually load up our string/input into a buffer and move through it byte-by-byte:
while true:
setLen(p.value, 0)
case p.buf[p.bufpos]
of someChar:
...
But the whole thing becomes quite more complicated when this "char" we are after, even if it's a single one, is a Unicode character, in which case I end up testing for a series of bytes (like p.buf[p.bufpos] and p.buf[p.bufpos+1], etc)
Here's an example of what I'm talking about: https://github.com/arturo-lang/arturo/blob/master/src/vm/parse.nim#L945-L967
...which looks rather ugly, plus not very easy to debug and reason about.
So, how would you go about it?
I now - as I was typing the question as a matter of fact - noticed this: https://github.com/nim-lang/RFCs/issues/388
and its corresponding: https://github.com/nim-lang/Nim/commit/c2b20516d33520b1d339b447ece32ade8625fefc
So, I guess the answer is I'm already doing it pretty much like it should be done...
You should really do it with macro or a series of elif buf.continuesWith("⊞", pos) checks.
I didn't use a macro but I used a code generator for reasons that have to do with bootstrapping.
Makes sense, that's along the lines of what I was thinking...
Thanks for the input! :)