nimforum mirror - utf-8 string for nim

rku (orginal) [2016-02-21T14:13:53+01:00] view original

I suppose ill be slapped for unorthodox thing i have done here but here it is.. utf-8 string for nim based on libutf8rewind.

https://github.com/rokups/nim-ustring

That most evil thing is string slicing like in python. Pretty sure plenty of people would enjoy s[1 ~ -1] instead of s[1..^2].

Edit: Note i had to make ~ operator so i can create slices with negative indices. I really hate making new operator for this but negative index restrictions on .. forced my hand.

runs for cover

Varriount (orginal) [2016-02-21T16:39:06+01:00] view original

As much as I like utf-8 compatibility, breaking conventions like these is a big minus for the library - it imposes additional mental overhead and makes generics/templates for multiple string types more complicated than it needs to be.

OderWat (orginal) [2016-02-21T17:28:19+01:00] view original

Similar could be said for 1..^2 doing something really strange ;-)

Varriount (orginal) [2016-02-21T19:17:57+01:00] view original

@Oderwat Yes, but my point is that a library which provides a variant on a builtin type and defies certain standards is making interoperability needlessly hard.

OderWat (orginal) [2016-02-21T20:34:12+01:00] view original

Sure. It should implement both notations. I thought the negative index one is an alternative but that seems not to be the case.

rku (orginal) [2016-02-22T10:19:27+01:00] view original

Yeah these are the valid points. To benefit most ill make it behave as default string implementation. Although im still keeping -d:ustringPythonic for how it works now.

What is bothering me is restriction on negative indexes for ... Is that limitation ever going away? I mean i understand why its there and all, but it prevents people reusing .. operator for their own purposes the way they want. I am pretty sure world would be a happier place if there was one operator less to use.

rku (orginal) [2017-02-04T13:36:40+01:00] view original

Just dropping a note that lib was bit tidied up after long hiatus. Nimble package properly set up. It also includes utf8rewind and builds it automatically so if you wish to try its just nimble install ustring away.

By the way as for old discussion regarding string slicing - before hiatus i reworked it to support standard nim behavior and added python-like behavior as custom addon. [-3..^2] slice produces same result as [-3..-1]. Enjoy.

Krux02 (orginal) [2017-02-04T18:01:58+01:00] view original

I am a bit puzzled here. Aren't Nim string utf8 already? I mean slicing is another thing, but I don't necessarily like the idea of yet another string type. So why did you create a distinct string type?

rku (orginal) [2017-02-05T18:56:57+01:00] view original

Nim strings are blobs of bytes. Yes, text is encoded to utf8 when storing text in those strings but they are still blobs of bytes. I needed a real string type that handles text the way one would expect instead of corrupting text on manipulation or requiring use of obscure modules. Besides utf8rewind I believe is more tested, proven and correct than nim's unicode module. I wish we did not need ustring but that is not the case so here it is. People that do not require good handling of text other than Latin family will surely not find this useful.

Krux02 (orginal) [2017-02-05T21:27:42+01:00] view original

When you can improve the utf8 support for Nim strings, I would highly support you to get that improvement into the standard library. Espescially when it is about fixing bugs. But for me it feels like you assume your audience knows the problems of utf8 strings in Nim. But I don't. I don't know the problems with Nim strings. If you would introduce me to the problem of normal strings that you encountered, I could argue whether I agree with your solution or not. Or I could give ideas of further improvements. But I simply don't understand where normal strings fail, and what you did to improve them.

I know that when you tread utf8 strings as arrays of bytes, that there are indices within a multibyte literal and therefore you are not allowed to use that index for the start or end of ranges/substrings. But that alone I do not see as a problem, when you know what you are dealing with.

So the short version of my posit would be: What is the actual problem you are trying to solve?

rku (orginal) [2017-02-07T15:52:23+01:00] view original

Swift has an elegant solution to this ... a string has properties that provide different views of the same sequence of bytes, so you can process it as a sequence of bytes, a sequence of utf16 code units, a sequence of utf32 code units, or a sequence of Character (the default, which is an extended grapheme cluster), just by specifying the property, e.g., mystring, mystring.utf8, mystring.utf16, etc.

Truth is that i did not read that swift page initially (long read). But now that you explicitly mention - it indeed sounds elegant. And not too far from what nim provides out of the box. Its just that this stuff is bit hidden away from user in separate module behind cryptic proc names. Ideally i would love to see standard string manipulation procs doing the right thing based on selected string view. That would be very user-friendly and probably best compromise we could come up with. Bonus points for not breaking backwards compatibility.

andrea (orginal) [2017-02-07T16:02:44+01:00] view original

Isn't this what ustring does?

It defines a type

type ustring = distinct string

and operations over it use UTF-8 units, so that user can do something like

someStringEncodedInUtf8.ustring.substring(12, 24)

and UTF-8 conventions are used

rku (orginal) [2017-02-07T16:38:31+01:00] view original

yes indeed, i did not even realize. This flexibility of nim.. now just need a type working at grapheme cluster level. and good names for them. someStringEncodedInUtf8.utf8 is nice, someStringEncodedInUtf8.ustring not so much.

andrea (orginal) [2017-02-07T17:50:25+01:00] view original

Uh? Now I am confused.

Aren't you the author of ustring? I think you are in position to change the name if you want :-)

rku (orginal) [2017-02-07T19:59:57+01:00] view original

Sure sure, but i am not the best at thinking proper names ;)

Krux02 (orginal) [2017-02-08T14:39:46+01:00] view original

Something I did notice with the default nim strings, is that upper bounds are inclusive. I don't like that, because that forces me to use invalid indices as bounds for utf8 strings:

import strutils

let str = "αβγδεφ"
let lower = str.find("β", 0)
let upper = str.find("ε", lower)
let sub = str.subStr(lower, upper - 1) # upper-1 is as an index alone is invalid, but in this context just using upper is invalid.

echo sub

Basically I prefer exclusive upper bounds for everything. But for strings I have an argument that is not pure about preference.

Krux02 (orginal) [2017-02-08T14:41:05+01:00] view original

Not sure if this is the right place to mention it, but something I did notice with the default nim strings, is that upper bounds are inclusive for substring operations. I don't like that, because that forces me to use invalid indices as bounds for utf8 strings:

import strutils

let str = "αβγδεφ"
let lower = str.find("β", 0)
let upper = str.find("ε", lower)
let sub = str.subStr(lower, upper - 1) # upper-1 is as an index alone is invalid, but in this context just using upper is invalid.

echo sub

Basically I prefer exclusive upper bounds for everything. But for strings I have an argument that is not pure about preference.

Araq (orginal) [2017-02-08T14:53:06+01:00] view original

I don't see any problem here, so what if "upper-1 alone is invalid". Of course it's invalid, it's the last byte that still belongs to the rune.

But for strings I have an argument that is not pure about preference.

It never was about preference, exclusive upper bounds do not work. ;-)

Krux02 (orginal) [2017-02-08T16:09:41+01:00] view original

Maybe I am a bit biased here, big majority of languages that I used had exclusive upper bounds, but in my experience it is the inclusive upper bounds that do not work 🤔.

Mirror of forum.nim-lang.org

2059 :: utf-8 string for nim