nimforum mirror - Unicode strings

yglukhov (orginal) [2015-03-06T17:43:22+01:00] view original

Hi All, So I've noticed that there's no insert function for strings, and then, when I looked through the module I noticed that delete proc is not utf-8-friendly. Whats the official explanation for this? Is it a bug or is it intended for performance reasons. Can I do a PR to strutils with insert and delete procs, that'l be unicode-friendly? Thanks.

def (orginal) [2015-03-06T18:26:02+01:00] view original

There is the unicode module, which would make insert delete a lot easier: http://nim-lang.org/unicode.html

cdunn2001 (orginal) [2015-03-06T20:38:26+01:00] view original

In the unicode module, proc toUTF8(c: Rune): string returns a string, which implies that strings can be utf8. If insert and delete can be utf8-friendly, I'd favor that.

Jehan (orginal) [2015-03-06T21:44:51+01:00] view original

Contrary to popular (Western) belief, Unicode is not the only character encoding that matters. In the CJK languages in particular, Unicode has serious issues.

On top of that, there are the obvious concerns when one of the fundamental types of a programming language depends too much on culture-specific assumptions that properly belong in a library that can evolve separately from the language itself (not even considering the fact that a system programming language is frequently only interested in abstract sequences of octets). For example, writing lowercase/uppercase function for Unicode raises some serious issues (e.g. what is uppercase("ß")? Plus, there are performance concerns (e.g. the speed of String.StartsWith in C#).

Nim is encoding-agnostic and fine with any kind of string that can be represented as a sequence of octets and relies on encoding-specific modules to deal with encoding-specific matters. For example, to delete the unicode character "π" from a string:

import unicode, strutils

proc main() =
  var s = "ɑβπɣ"
  const pi = "π"
  echo s
  var p = find(s, pi)
  s.delete(p, p+s.runeLenAt(p)-1)
  echo s

main()

Varriount (orginal) [2015-03-06T21:59:49+01:00] view original

Also, before it's brought up, using wide-character strings for everything introduces complications as well. Leaving aside that fact that strings take up twice as much memory, the task of handling sequences of octets is pushed to sequence data types (which have enough to do already)

yglukhov (orginal) [2015-03-11T20:14:21+01:00] view original

Maybe it would be better to introduce a unistring type along with unichar? I would suggest unichar to be utf16 char, and unistring to be array of those. What do you think guys? Worth a PR?

fadg44a3w4fe (orginal) [2015-03-11T23:45:21+01:00] view original

utf16 is a bad idea. It creates the feeling that there aren't handle multi-character code points, when they do exist.

BlaXpirit (orginal) [2015-03-12T01:46:59+01:00] view original

Jehan, you say Nim is encoding-agnostic, but it really depends on UTF-8 or at least some of its properties (not to mention it's forced as the source encoding).

runeLenAt is obviously UTF-8 specific, and find could also give you a big surprise by finding that sequence of bytes on the boundary of two characters (UTF-8 actually has awesome properties that don't let this happen)

BlaXpirit (orginal) [2015-03-12T01:52:36+01:00] view original

Varriount, your message doesn't provide any real arguments. I can only assume you are not familiar with Python, which does what you say would cause problems, but does not have these problems.

Jehan (orginal) [2015-03-12T02:00:21+01:00] view original

BlaXpirit: Jehan, you say Nim is encoding-agnostic, but it really depends on UTF-8 or at least some of its properties (not to mention it's forced as the source encoding).

Nim -- the language -- is. The unicode module (with runelenAt) obviously isn't; that's its point. But you can, e.g., but Shift-JIS characters in a Nim string just fine.

And yes, find is normally for octets. I used it precisely because it happens to work for unicode, too, so unicode doesn't require a separate search facility and wasn't going to write up something big for a quick example.

yglukhov (orginal) [2015-03-12T09:23:23+01:00] view original

Actually I wanted strings to handle random inserts and deletes of ranges of characters =). And UTF-8 is definitely not the best choice for that. I'm pretty ok with Nim's string to be treated as UTF-8 string, but again, it just doesn't meet performance characteristics, when it comes to modifying those strings.

Also allowing such operations as delete, as it is currently implemented in strings module and operator [index] may lead to lots of mistakes among Nim newbies and those unfamiliar with UTF-8 specifics. And that may be a disadvertisement for Nim, answered with "well, Nim is great, but no language does everything right =)". In this case I think it's possible to make it right, before people start shooting their legs =).

Araq (orginal) [2015-03-12T09:35:17+01:00] view original

Actually I wanted strings to handle random inserts and deletes of ranges of characters =).

That doesn't even work with UTF-32 thanks to combining characters. Unicode cannot be abstracted over and every attempt to do so is futile. (Btw we really need a 'glyphLen' too, 'runeLen' is not enough.) I had many more problems with Python's or C#'s "proper unicode handling" than with Nim's kinda non-solution to this problem.

yglukhov (orginal) [2015-03-12T11:45:29+01:00] view original

Ok, I admit not really a UTF expert. But what would be the most universal way of dealing with those? Can UTF-32 be converted without loss to a representation, where 1 char is guaranteed to take 1 codepoint? Or should we go with a non-performant-but-universal UTF-8 approach? Also, how do we deal with those Asian encodings, that are not UTF-friendly? Embedding encoding info to string sounds like a lot of changes. And it seems like there is no way that is guaranteed to work correctly when mixing UTF string with those Asian strings?

rspeer (orginal) [2015-03-12T17:49:45+01:00] view original

Glyphs in Unicode can be arbitrarily complicated, as things like https://twitter.com/glitchr_ will demonstrate. I think the question is, what are you trying to accomplish by having a representation where one glyph is one item? What does your software need to actually do, and how does that fit into a world of varying complex languages?

I don't see any reason why Nim itself would need to mark strings with their encoding. If you want an Asian encoding such as Shift-JIS, you just put the Shift-JIS-encoded bytes into your string, and don't use UTF-8-related functions on it.

OderWat (orginal) [2015-03-12T19:45:11+01:00] view original

I think that @yglukhov should really have a look at this (scroll down to glyphs vs code-points) to understand the "real" problem and what @Araq is talking about.

But one has to realize that this is a level above the multi-byte encoding problematic. UTF-8 vs code-points is something different than code-point vs glyph.

And the sad true is: You can't handle all strings which are one octet per char for one encoding with the same efficiency in UTF-8.

The solution here is clearly not to move to 2 4 16 128 octets per "char" (which I write deliberately as char), because than you get bitten by the "codepoint != glyph" or something like "one char needs 1024 bytes" problem.

The solution for speed and simplicity is: Use the encoding which fits your needs and choose a tight one.

And this is what Nim offers: char "sized" strings which can be interpreted in the encoding you like. One is "utf-8" and that works pretty well for most stuff.

jboy (orginal) [2015-03-16T09:51:27+01:00] view original

@Jehan & @OderWat, thank you for those links to those articles [ Encodings, Unabridged & Understanding characters, keystrokes, codepoints and glyphs ]. I found both articles very informative.

Here are two more (short) articles on this topic that I find interesting: rants by Python programmer Armin Ronacher (the developer of the widely-used Python libraries Flask, Jinja, Werkzeug, and others):

Everything you did not want to know about Unicode in Python 3

More About Unicode in Python 2 and 3

In brief, Armin is not a fan of Python 3's "Everything Is Unicode" approach, instead preferring Python 2's byte-string type str.

jboy (orginal) [2015-03-16T10:58:12+01:00] view original

I've read this thread a few times through. If I understand the arguments correctly:

@yglukhov wants versions of insert and delete that are Unicode-correct for ranges of Unicode characters.

Nim's strings are encoding-agnostic byte-strings / octet-strings (as I understand Python 2's str type to be).

@yglukhov would prefer to work in a space of larger-sized "character" types to have more-predictable performance (at the cost of increased memory usage).

[By the way, @yglukhov, there is an insert proc for strings, in the system module.]

How does the unicode module's Rune type not meet @yglukhov's needs? (In as much as they can be met, keeping in mind @Araq's point that you can't even have a "any character can always be represented by one item" in UTF-32 thanks to combining characters.)

Could @yglukhov not just convert a string that contains UTF-8 encoded text to a seq[Rune] and work with that?

Which leads me to ask:

Why is there iterator runes(s: string): Rune but no proc runes(s: string): seq[Rune]?

2. Why not add a comment at the top of the unicode module that forestalls this question in the future by linking readers to the generic seq[T] versions of:

find[T, S](a: T; item: S): int

contains[T](a: openArray[T]; item: T): bool

insert[T](x: var seq[T]; item: T; i = 0)

delete[T](x: var seq[T]; i: int)

Why not add a proc into the unicode module proc insert(dest: var seq[Rune]; src: seq[Rune]; i = 0) for the Rune-seq equivalent of inserting one string into another?

Araq (orginal) [2015-03-16T11:56:33+01:00] view original

Well questions a la "Why does Nim lack a way to do Alternate Polyadenylation Site Identification" are not very productive. PRs are always welcome.

jboy (orginal) [2015-03-16T16:28:07+01:00] view original

Done: https://github.com/Araq/Nim/issues/2353

Mirror of forum.nim-lang.org

985 :: Unicode strings