nimforum mirror - Parsing Unicode and lexbase

drkameleon (orginal) [2022-12-01T11:20:11+01:00] view original

Question/food-for-thought:

What would be the most sensible way to go about support unicode values in a lexer using lexbase / BaseLexer as its basis?

I mean, as far as I can tell, using BaseLexer we can actually load up our string/input into a buffer and move through it byte-by-byte:

while true:
        setLen(p.value, 0)
        
        case p.buf[p.bufpos]
             of someChar:
                   ...

But the whole thing becomes quite more complicated when this "char" we are after, even if it's a single one, is a Unicode character, in which case I end up testing for a series of bytes (like p.buf[p.bufpos] and p.buf[p.bufpos+1], etc)

Here's an example of what I'm talking about: https://github.com/arturo-lang/arturo/blob/master/src/vm/parse.nim#L945-L967

...which looks rather ugly, plus not very easy to debug and reason about.

So, how would you go about it?

drkameleon (orginal) [2022-12-01T11:21:46+01:00] view original

I now - as I was typing the question as a matter of fact - noticed this: https://github.com/nim-lang/RFCs/issues/388

and its corresponding: https://github.com/nim-lang/Nim/commit/c2b20516d33520b1d339b447ece32ade8625fefc

So, I guess the answer is I'm already doing it pretty much like it should be done...

Araq (orginal) [2022-12-01T13:26:29+01:00] view original

You should really do it with macro or a series of elif buf.continuesWith("⊞", pos) checks.

I didn't use a macro but I used a code generator for reasons that have to do with bootstrapping.

drkameleon (orginal) [2022-12-01T13:40:13+01:00] view original

Makes sense, that's along the lines of what I was thinking...

Thanks for the input! :)

Mirror of forum.nim-lang.org

9681 :: Parsing Unicode and lexbase