I may have missed something. Actually I'm looking for some bible that describes the rules in more detail and in a more consistent way than the empirical ones above. But maybe it's too early for that.
Anyway, if I have to distill a commandment from all this, it would be "Stick to the alphabet and numbers, or thou willst be driven out of paradise."
Here are my two cents on this topic. I think these rules are too complicated. I would prefer if I can focus on my problem, then on weather two identifiers are the same or not, when they don't appear to be the same.
But yes, I really do think this process should be documented. I would also add the information on the normalized identifier representation that can be used to check weather two identifiers are equal, by just comparing two strings.
@Araq: Simply by trying things out and attempting to make out some patterns in the observed behaviour. And having a peek now and then at the resulting C source.
As you can see, I gave up at rule twelve to be more specific. It became too cluttered when I experimented with exclamation marks and dollars, especially in combination with underscores and en-dashes. So it would be better indeed to build up the rules from the source, in my opinion. On the other hand: tests have the last word.
And having a peek now and then at the resulting C source.
Which is wrong. The generated C code is irrelevant for identifier equality. The compiler could emit NIM_$ID for everything instead and yet it wouldn't affect Nim.
That's 4 rules and the backtick rules are mostly irrelevant in practice. For example, in Java you can either write π or \u03C0. Does that mean I need to worry all the time about my hypothetical Java code becoming unreadable anytime soon? Hardly.
@Araq 19:17:16: That makes the distinction between rule 9 ('underscores are removed') and a part of rule 11 ('en-dashes are not removed but ignored') irrelevant indeed. Most of my rules are based on the behaviour of the executables or the compilability of the source, however.
@Araq 19:26:26: That four rules don't explain everything. I'm not quite sure what you mean by your last paragraph. In general, different representations of essentially the same character doesn't make life easier when you want to search through the source for occurrences of them, but that topic has been discussed elsewere. I was just overwhelmed by the complexity of the whole thing, that's all. Why are {`}!_:!{`} and {`}!:_!{`} okay and is {`}!_:_!{`} not okay? It makes me curious about the underlying mechanisms, in case I want to play with it. In most cases, the alphabet and numbers would suffice me, and I would certainly avoid the pathological ones like this example.
This the grammar rule:
symbol = '`' (KEYW|IDENT|literal|(operator|'('|')'|'['|']'|'{'|'}'|'=')+)+ '`'
':' is not an operator: =, :, :: are not available as general operators; they are used for other notational purposes. (From http://nim-lang.org/docs/manual.html#lexical-analysis-operators )
Hence ! :! is valid (operator followed by operator), ! : ! is not (operator followed by colon followed by operator). Simple. :P
@Araq: I guess that explains most of the apparently whimsical behaviour pointed at in rule 12, if not all. Have to rethink {`}echo___echo{`} (wrong) vs {`}echo_echo{`} (right, collapses to echoecho) vs {`}!___!{`} (right, collapses to {`}!!{`}). Still don't see that one.
Only 11 rules to go. :-)
Still don't see that one.
That's 3x the valid token _.
Only 11 rules to go.
It's really only 4 rules if you don't describe them in the most convoluted manner possible and at the same time mixing them with codegen choices to make a point.
Please don't get wrong about my intentions. If there's any point I want to make, it's about my ignorence.
I simply started with the link mentioned by moigagoo and the explanation in the first chapter Nim in Action, did some experiments, found some behaviour that didn't seem to be described by those texts, and tried to find more patterns in what I had observed. This is not a straightforward process. Hence the somewhat unorganized abundance of 12+ rules. I don't want to fight for them, quite the contrary, I want them to be reduced to a clear and compact system where nothing can be taken out of without loosing completeness. Any help with that is appreciated.
In spite of the grammar rule that Araq has mentioned, which reveals a lot more about the underlying mechanisms indeed, I don't think we are at that stage of completeness yet, speaking about documentation. I still fail to understand why {`}!___!{`} or {`}___{`} is 3 times a valid _ token (collapsing to {`}!!{`} or {`}_{`} respectively), and {`}echo___echo{`} is not, for instance. I've no idea why x_–y, x––y and –xy are good, and x–_y, x__y and _xy are bad, whithout adding more rules than I have read so far in the documentation (the dash-like symbols here are en-dashes). Yes, these are edge cases, and probably of no practical value in daily affairs. But what's wrong about that in the context of understanding a language's design?
@moigagoo: Conventions and restrictions are two different things. I was just aiming at the restrictions (or to put it otherwise: degrees of freedom) for now. Some Nim conventions can be found here. I've read them and I do know conventions are a subset of restrictions. If I have used some confusing terms in this thread, I apologize for that.
@Araq: Are naming rules a good starting point for learning a new language? In this case, I'm inclined to think not. I've noticed the structure of Nim is reflected in the naming to some extent. I can see the beauty of that, but it's also a complication in mastering (and documenting) Nim. Maybe I should study Nim for a year or so before trying to really understand the production rules for names.
Your newly found edge cases have all been fixed, thanks. And the 4 rules that I gave still apply. ;-)
PS: I really dislike this em-dash special casing and might remove it from the language again. I never understood why the fonts cannot be patched instead so that the underscore looks more like a dash...