Contrary to popular (Western) belief, Unicode is not the only character encoding that matters. In the CJK languages in particular, Unicode has serious issues.
On top of that, there are the obvious concerns when one of the fundamental types of a programming language depends too much on culture-specific assumptions that properly belong in a library that can evolve separately from the language itself (not even considering the fact that a system programming language is frequently only interested in abstract sequences of octets). For example, writing lowercase/uppercase function for Unicode raises some serious issues (e.g. what is uppercase("ß")? Plus, there are performance concerns (e.g. the speed of String.StartsWith in C#).
Nim is encoding-agnostic and fine with any kind of string that can be represented as a sequence of octets and relies on encoding-specific modules to deal with encoding-specific matters. For example, to delete the unicode character "π" from a string:
import unicode, strutils
proc main() =
var s = "ɑβπɣ"
const pi = "π"
echo s
var p = find(s, pi)
s.delete(p, p+s.runeLenAt(p)-1)
echo s
main()
Jehan, you say Nim is encoding-agnostic, but it really depends on UTF-8 or at least some of its properties (not to mention it's forced as the source encoding).
runeLenAt is obviously UTF-8 specific, and find could also give you a big surprise by finding that sequence of bytes on the boundary of two characters (UTF-8 actually has awesome properties that don't let this happen)
BlaXpirit: Jehan, you say Nim is encoding-agnostic, but it really depends on UTF-8 or at least some of its properties (not to mention it's forced as the source encoding).
Nim -- the language -- is. The unicode module (with runelenAt) obviously isn't; that's its point. But you can, e.g., but Shift-JIS characters in a Nim string just fine.
And yes, find is normally for octets. I used it precisely because it happens to work for unicode, too, so unicode doesn't require a separate search facility and wasn't going to write up something big for a quick example.
Actually I wanted strings to handle random inserts and deletes of ranges of characters =). And UTF-8 is definitely not the best choice for that. I'm pretty ok with Nim's string to be treated as UTF-8 string, but again, it just doesn't meet performance characteristics, when it comes to modifying those strings.
Also allowing such operations as delete, as it is currently implemented in strings module and operator [index] may lead to lots of mistakes among Nim newbies and those unfamiliar with UTF-8 specifics. And that may be a disadvertisement for Nim, answered with "well, Nim is great, but no language does everything right =)". In this case I think it's possible to make it right, before people start shooting their legs =).
Actually I wanted strings to handle random inserts and deletes of ranges of characters =).
That doesn't even work with UTF-32 thanks to combining characters. Unicode cannot be abstracted over and every attempt to do so is futile. (Btw we really need a 'glyphLen' too, 'runeLen' is not enough.) I had many more problems with Python's or C#'s "proper unicode handling" than with Nim's kinda non-solution to this problem.
Glyphs in Unicode can be arbitrarily complicated, as things like https://twitter.com/glitchr_ will demonstrate. I think the question is, what are you trying to accomplish by having a representation where one glyph is one item? What does your software need to actually do, and how does that fit into a world of varying complex languages?
I don't see any reason why Nim itself would need to mark strings with their encoding. If you want an Asian encoding such as Shift-JIS, you just put the Shift-JIS-encoded bytes into your string, and don't use UTF-8-related functions on it.
I think that @yglukhov should really have a look at this (scroll down to glyphs vs code-points) to understand the "real" problem and what @Araq is talking about.
But one has to realize that this is a level above the multi-byte encoding problematic. UTF-8 vs code-points is something different than code-point vs glyph.
And the sad true is: You can't handle all strings which are one octet per char for one encoding with the same efficiency in UTF-8.
The solution here is clearly not to move to 2 4 16 128 octets per "char" (which I write deliberately as char), because than you get bitten by the "codepoint != glyph" or something like "one char needs 1024 bytes" problem.
The solution for speed and simplicity is: Use the encoding which fits your needs and choose a tight one.
And this is what Nim offers: char "sized" strings which can be interpreted in the encoding you like. One is "utf-8" and that works pretty well for most stuff.
@Jehan & @OderWat, thank you for those links to those articles [ Encodings, Unabridged & Understanding characters, keystrokes, codepoints and glyphs ]. I found both articles very informative.
Here are two more (short) articles on this topic that I find interesting: rants by Python programmer Armin Ronacher (the developer of the widely-used Python libraries Flask, Jinja, Werkzeug, and others):
In brief, Armin is not a fan of Python 3's "Everything Is Unicode" approach, instead preferring Python 2's byte-string type str.
- @yglukhov wants versions of insert and delete that are Unicode-correct for ranges of Unicode characters.
- Nim's strings are encoding-agnostic byte-strings / octet-strings (as I understand Python 2's str type to be).
- @yglukhov would prefer to work in a space of larger-sized "character" types to have more-predictable performance (at the cost of increased memory usage).
[By the way, @yglukhov, there is an insert proc for strings, in the system module.]
How does the unicode module's Rune type not meet @yglukhov's needs? (In as much as they can be met, keeping in mind @Araq's point that you can't even have a "any character can always be represented by one item" in UTF-32 thanks to combining characters.)
Could @yglukhov not just convert a string that contains UTF-8 encoded text to a seq[Rune] and work with that?
Which leads me to ask:
- Why is there iterator runes(s: string): Rune but no proc runes(s: string): seq[Rune]?
2. Why not add a comment at the top of the unicode module that forestalls this question in the future by linking readers to the generic seq[T] versions of:
- find[T, S](a: T; item: S): int
- contains[T](a: openArray[T]; item: T): bool
- insert[T](x: var seq[T]; item: T; i = 0)
- delete[T](x: var seq[T]; i: int)
- Why not add a proc into the unicode module proc insert(dest: var seq[Rune]; src: seq[Rune]; i = 0) for the Rune-seq equivalent of inserting one string into another?