nimforum mirror - Is String Datatype in Nim Broken?

Jayadevan (orginal) [2018-06-08T00:30:26+02:00] view original

A string should be ideally a sequence of Runes. But here it is a sequence of UTF-8 Encoded bytes.

file:///home/appu/Pictures/Screenshot%20from%202018-06-07%2018-22-42-01.png

Jayadevan (orginal) [2018-06-08T00:41:37+02:00] view original

A small comment: Python handles the equivalent correctly:


print(a[0:2])

DeletedUser (orginal) [2018-06-08T01:03:57+02:00] view original

Use the unicode module with runeAt and runeLenAt for that. Or just convert the string to a seq of runes and then operate on the seq.

Jayadevan (orginal) [2018-06-08T01:47:21+02:00] view original

I have seen the unicode module, which in reality is not about Unicode, but about UTF-8.

I have Java and Python background, where everything is in Unicode. Java, Python (in Narrow mode), Google Dart etc have UTF-16 natively supported. But encoding doesn't usually matter - the characters are Unicode.

Does this mean that Strings in Nim are not sequences of characters, but are sequences of UTF-8 bytes?

Here is my Python code:


a = "ജയദേവൻ"
for x in range(1, 8):
    print(x, "\t", a[0:x])

The above code works perfectly in Python 3.

Is Unicode not natively supported in Nim?

GULPF (orginal) [2018-06-08T06:29:07+02:00] view original

I have Java and Python background, where everything is in Unicode. Java, Python (in Narrow mode), Google Dart etc have UTF-16 natively supported. But encoding doesn't usually matter - the characters are Unicode.

That's not really true. A "character" is a complex concept in unicode that most languages completely ignores for performance reasons. In most languages, strings are sequences of UTF-16 values, which is not the same thing as characters (or even the same thing as a unicode code point, because of surrogate pairs).

Does this mean that Strings in Nim are not sequences of characters, but are sequences of UTF-8 bytes?

Yes.

Is Unicode not natively supported in Nim?

Depends on how you define natively. If you just want UTF-16 (which is what it sounds like), it would be possible to define a custom string type that uses it instead of UTF-8. You could then easily have length, slice, and index operations that almost operate on characters (as long as you never use any surrogate pairs, or combining characters, or probably something else that I'm missing). Personally I think Nims design is better, since it's easier to understand.

Araq (orginal) [2018-06-08T06:29:33+02:00] view original

Does this mean that Strings in Nim are not sequences of characters, but are sequences of UTF-8 bytes?

Yeah and I'm sure the tutorials and manual mentions it...

Is Unicode not natively supported in Nim?

Well it's in a library. :-)

The above code works perfectly in Python 3.

IME nothing works "perfectly" when it comes to Unicode, Unicode is quite complex and you need to be aware of it in order to write correct code. You cannot ignore Unicode semantics in any language, be it Python or C#.

JohnAD (orginal) [2018-06-08T06:31:49+02:00] view original

I'll allow others to cover more of the internals of Nim.

As to Python, ever since version 3.3, it uses a hybrid UCS-1, UCS-2, and UCS-4 array; changing the internal array representation on-the-fly. So string that happens to only have ascii will be an 8-bit array. But, if you add a code point requiring 16-bit or 32-bit, the whole array flips to 16 or 32-bits. I don't recall it ever using UTF-16 however, but I'm not a Python historian. :)

I believe the Windows version of Python will output 'print' to UTF-16, of course. To html, it outputs UTF-8 (Jinja2 in Flask, django.utils.encoding in Django.) In Linux, it honors the shell's locale settings.

Nim's "rune" (using the unicode module) can use UTF-32 or UTF-8. I see support for UTF-16, but I don't know how that is supposed to work. I'll leave that answer for others.

Jayadevan (orginal) [2018-06-08T08:45:38+02:00] view original

Thanks all for the reply.

@JohnAD Yes - I forgot about the PEP 393.

So, from what I have understood from @Araq's and @GULPF's answers, "broken" was too strong a word.

By "Native Support", I mean the difference between https://rosettacode.org/wiki/Reverse_a_string#Nim and https://rosettacode.org/wiki/Reverse_a_string#Python .

But, I understand that what Araq said is true that nothing is perfect when it comes to Unicode. Both the codes (Python and Nim) don't take the semantics of the given script into account, and therefore produces incorrect output. "കാപ്പി" must be reversed to "പ്പികാ", but is wrongly reversed to "ിപ്പാക" in both the language's Rosetta example.

The libraries and languages also fail equally for "Regional Indicator Symbol" characters. "🇮🇳" gets reversed to "🇳🇮". That is, India gets reversed to Nicaragua - a semantically meaningless reversal!

The crux of the story is that we need a much more sophisticated library, taking into account the semantics of each script and other intricacies.

Thanks again for enlightening me about this topic.

zahary (orginal) [2018-06-08T13:36:45+02:00] view original

@nitely has put a lot of effort into making a comprehensive set of unicode libraries for Nim:

https://github.com/nitely/nim-unicodeplus

https://github.com/nitely/nim-unicodedb

https://github.com/nitely/nim-graphemes

https://github.com/nitely/nim-strunicode

He'll probably enjoy your quest to break all unicode implementations out there :)

Jehan (orginal) [2018-06-08T16:07:49+02:00] view original

As in some other languages (such as OCaml or D), strings in Nim are encoding-agnostic. While there are pros to having Unicode strings as the default, there are also definite cons.

Some of these are:

Dealing with other encodings (such as legacy data in Latin-1 or supporting Shift-JIS in Japan) becomes more complicated.

It adds either complexity or affects performance if you want to write code that does not care about encodings (such as sending strings between processes).

You will still need a type for "sequence of octets", possibly requiring you to duplicate functionality.

Full Unicode semantics can greatly affect performance. A well-known example is the String.StartsWith() function in C#, where you have to explicitly use the ordinal culture or string comparison function to not take a significant performance hit.

Serious Unicode support – i.e. where you actually need the full functionality to support multiple language and don't just use it as a sort of ASCII++ thing – has some non-trivial design and implementation complexity that you don't necessarily want in the core language.

Jayadevan (orginal) [2018-06-09T19:14:28+02:00] view original

When other modern languages support this well, why not Nim?

Code strings = ["ജയദേവൻ","കാപ്പി", "🇮🇳"]

strings.each do |s|: puts "#{s} -> #{s.reverse}"

end

Output ജയദേവൻ -> ൻവദേയജ കാപ്പി -> പിപ്കാ 🇮🇳 -> 🇮🇳

Speaking about library support, even assembly language supports Unicode (non-natively): https://www.nasm.us/doc/nasmdoc3.html#section-3.4.5

@Araq From your vast experience in designing the language, can you answer this question: Is having a native "Unicode Rune" Type built into the language going to have any impact on performance of other types, like the present "Byte String" Type?

I am asking this, because other than this, I like the language a lot.

dom96 (orginal) [2018-06-09T20:07:56+02:00] view original

So far all you've shown are dynamically typed languages. How do statically typed languages handle this?

If you really want we could introduce a UnicodeString type which has the appropriate slicing operators defined on it. Perhaps that would have made more sense instead of this toUpperAscii vs. toUpper convention we have going on :P

Araq (orginal) [2018-06-09T20:42:16+02:00] view original

import unicode

let strings = ["ജയദേവൻ","കാപ്പി", "🇮🇳"]
for x in strings:
  echo reversed(x)

The fact that reversed does not handle it well is just a bug, in fact, the unicode module is about to get a large improvement. We could do type UnicodeString = distinct string with the overloads as dom suggested but it makes little sense as then the same questions come up -- "what does items mean? code points, glyphs or graphemes?"

Jayadevan (orginal) [2018-06-09T22:47:37+02:00] view original

@Araq If the unicode module becomes a built-in, after the bugs are solved, it would be a native support built into the language. Python is "batteries included", because of built-ins (https://docs.python.org/3/library/functions.html). Since the compiler developers are aware of the built-ins, speed can be easily improved. Javascript, SQL etc all have built-ins. Static Language Java has its UTF-16 based String class (and many others) built-in. Java.lang package is built-in.

@dom96 The previous example was of Crystal (Statically typed). Java, C#, Dart2 etc are all UTF-16 based statically typed languages which support Unicode directly. When I use Java, I can use String.charAt(). That is because String belongs to Java.lang which is automatically imported.

Araq (orginal) [2018-06-09T23:08:45+02:00] view original

I assume you mean "builtin" as in "merge the unicode module into system.nim". That's not gonna happen because it has no benefits and plenty of downsides. (system.nim is already too big anyway and it's a burden for the embedded targets).

Optimizations have little to do with this "builtin" quality, a Unicode library can be as optimized as a "builtin" Unicode library. Especially in Nim with its meta programming superpowers.

mratsim (orginal) [2018-06-11T20:59:54+02:00] view original

I think it's builtin as in the standard lib and you don't need to nimble install an extra package.

treeform (orginal) [2018-06-11T22:42:40+02:00] view original

I support treating all strings as utf-8 like nim does. I feel that is the right way to do things. If you want Runes you use a library to run through them. If you need to support skin color on emojis, again library. Most of the time you are not dealing with this. So its fine and preferred to pass utf-8 bytes around. It is also fast.

Python tried to do too much and created endless unicode issues. It looks easy at start but is hard in a large code base. Don't do it the python way.

honhon (orginal) [2018-06-12T04:20:25+02:00] view original

I agree with treeform

mratsim (orginal) [2018-06-12T10:00:27+02:00] view original

In my opinion, a distinct type or object should be introduced called utf8string.

Changing the default string is very breaking. A lot of code is using string already to represent binary blob (though they should use seq[byte] but that's another story as range/span/buffer type for binary blobs is missing). Examples from the tiny world of nim crypto libraries here (nimSHA2 and libsodium). And I'm pretty sure I saw string used in Nim graphics/OpenGL libraries.

In many cases ASCII strings are plenty sufficient and I would like to avoid overhead to deal with utf-8. This can be done if different types are used and with proc overloading.

Mirror of forum.nim-lang.org

3898 :: Is String Datatype in Nim Broken?