So, I've been experimenting a bit with a handwritten lexer, using lexbase - in order to convert an existing project of mine which uses Flex/Bison.
For example, when we encounter a character that is in PermittedIdentifiers_Start, the appropriate code is executed:
template parseIdentifier(p: var Parser) =
var pos = p.bufpos
add(p.value, p.buf[pos])
inc(pos)
while p.buf[pos] in PermittedIdentifiers_In:
add(p.value, p.buf[pos])
inc(pos)
p.bufpos = pos
where:
Letters = {'a'..'z', 'A'..'Z'}
PermittedIdentifiers_Start = Letters
PermittedIdentifiers_In = Letters + {'0'..'9', '?'}
As you may easily see, it allows for identifiers starting with a letter, followed by letters,number or the question mark symbol.
How do I make it support non-ASCII characters as well? (e.g. Chinese, Cyrillic, Greek, etc)
This is the relevant part from my Flex lexer:
ASCII [A-Za-z\?\!\#\@\~]
DIGIT [0-9]
U [\x80-\xbf]
U2 [\xc2-\xdf]
U3 [\xe0-\xef]
U4 [\xf0-\xf4]
PERMITTED {ASCII}|{U2}{U}|{U3}{U}{U}|{U4}{U}{U}{U}
%%
Like so:
const PermittedIdentifiers_In = {'a'..'z', 'A'..'Z', '\128'..'\255'}
You can check for "valid Unicode" or "only some Unicode classes are allowed" in a postprocessing step , unicode.nim should be helpful for that.
Note that lexbase doesn't really have much to do with lexing (scanning)--it just provides buffering that guarantees that lines don't cross buffer boundaries so that the lexer doesn't have to check whether a refill is needed on every single character, only when it needs to consume the EOL. (Note also that it has an off-by-one error that was not backported from nimlexbase, so the offsetBase value is incorrect.)
When you say "support non-ASCII characters as well", presumably you mean UTF-8 -- there are a gazillion other encodings, but Nim itself doesn't support them so you have an excuse for not doing so (this will make your tool useless for some users, but they already won't be using Nim).
So, you could do what the Nim compiler does, which is treat every single unicode character as if it were an identifier character and none are operator characters, digits, separators, etc. This is cheap, but wrong. The right way (given support only for UTF-8) is that, whenever your lexer encounters a character with the high bit set, it splits off the unicode character and determines its category.
You could use runeAt(buf, pos) and runeLenAt(buf, pos) from the unicode module (in the standard library) to get and skip the unicode character (per the unicode module this is a Rune, which is represented by an int). You really want a combination of the two of those, which you can get via var rune: Rune; fastRuneAt(buf, pos, rune, doInc=true) (fastRuneAt is not documented). Note that the unicode module thinks that a character can be composed of up to 6 UTF-8 bytes, whereas the standard only allows 4-- the additional encodings are illegal and should be treated as such. Since fastRuneAt isn't documented and does that extra work for the 5th and 6th byte, you might want to roll your own, based on its code--this is simple, and perfectly safe as the UTF-8 standard won't be changing before civilization collapses.
Once you have a single unicode character in hand in the form of a Rune, you can categorize it as a letter, digit, operator, etc. with the "unicodeDB" module from nimble.
Note: this approach doesn't deal with normalization--many different unicode character sequences represent the same thing. To be complete, once you have an identifier or operator in hand and want to look it up in a symbol table, you should apply one of the unicode normalization modes (I think NKFC is the right one for this application) to hash and compare the strings. You can do that with the "normalize" module from nimble. I would keep a flag indicating whether the symbol contains any non-ASCII characters, and only do the normalization if so, otherwise just doing a straight string compare. (And if the symbol in hand has the flag but the one in the symbol table doesn't, you don't need to do the comparison at all.)