I suppose ill be slapped for unorthodox thing i have done here but here it is.. utf-8 string for nim based on libutf8rewind.
https://github.com/rokups/nim-ustring
That most evil thing is string slicing like in python. Pretty sure plenty of people would enjoy s[1 ~ -1] instead of s[1..^2].
Edit: Note i had to make ~ operator so i can create slices with negative indices. I really hate making new operator for this but negative index restrictions on .. forced my hand.
runs for cover
Yeah these are the valid points. To benefit most ill make it behave as default string implementation. Although im still keeping -d:ustringPythonic for how it works now.
What is bothering me is restriction on negative indexes for ... Is that limitation ever going away? I mean i understand why its there and all, but it prevents people reusing .. operator for their own purposes the way they want. I am pretty sure world would be a happier place if there was one operator less to use.
Just dropping a note that lib was bit tidied up after long hiatus. Nimble package properly set up. It also includes utf8rewind and builds it automatically so if you wish to try its just nimble install ustring away.
By the way as for old discussion regarding string slicing - before hiatus i reworked it to support standard nim behavior and added python-like behavior as custom addon. [-3..^2] slice produces same result as [-3..-1]. Enjoy.
When you can improve the utf8 support for Nim strings, I would highly support you to get that improvement into the standard library. Espescially when it is about fixing bugs. But for me it feels like you assume your audience knows the problems of utf8 strings in Nim. But I don't. I don't know the problems with Nim strings. If you would introduce me to the problem of normal strings that you encountered, I could argue whether I agree with your solution or not. Or I could give ideas of further improvements. But I simply don't understand where normal strings fail, and what you did to improve them.
I know that when you tread utf8 strings as arrays of bytes, that there are indices within a multibyte literal and therefore you are not allowed to use that index for the start or end of ranges/substrings. But that alone I do not see as a problem, when you know what you are dealing with.
So the short version of my posit would be: What is the actual problem you are trying to solve?
Swift has an elegant solution to this ... a string has properties that provide different views of the same sequence of bytes, so you can process it as a sequence of bytes, a sequence of utf16 code units, a sequence of utf32 code units, or a sequence of Character (the default, which is an extended grapheme cluster), just by specifying the property, e.g., mystring, mystring.utf8, mystring.utf16, etc.
Truth is that i did not read that swift page initially (long read). But now that you explicitly mention - it indeed sounds elegant. And not too far from what nim provides out of the box. Its just that this stuff is bit hidden away from user in separate module behind cryptic proc names. Ideally i would love to see standard string manipulation procs doing the right thing based on selected string view. That would be very user-friendly and probably best compromise we could come up with. Bonus points for not breaking backwards compatibility.
Isn't this what ustring does?
It defines a type
type ustring = distinct string
and operations over it use UTF-8 units, so that user can do something like
someStringEncodedInUtf8.ustring.substring(12, 24)
and UTF-8 conventions are used
Uh? Now I am confused.
Aren't you the author of ustring? I think you are in position to change the name if you want :-)
Something I did notice with the default nim strings, is that upper bounds are inclusive. I don't like that, because that forces me to use invalid indices as bounds for utf8 strings:
import strutils
let str = "αβγδεφ"
let lower = str.find("β", 0)
let upper = str.find("ε", lower)
let sub = str.subStr(lower, upper - 1) # upper-1 is as an index alone is invalid, but in this context just using upper is invalid.
echo sub
Basically I prefer exclusive upper bounds for everything. But for strings I have an argument that is not pure about preference.
Not sure if this is the right place to mention it, but something I did notice with the default nim strings, is that upper bounds are inclusive for substring operations. I don't like that, because that forces me to use invalid indices as bounds for utf8 strings:
import strutils
let str = "αβγδεφ"
let lower = str.find("β", 0)
let upper = str.find("ε", lower)
let sub = str.subStr(lower, upper - 1) # upper-1 is as an index alone is invalid, but in this context just using upper is invalid.
echo sub
Basically I prefer exclusive upper bounds for everything. But for strings I have an argument that is not pure about preference.
I don't see any problem here, so what if "upper-1 alone is invalid". Of course it's invalid, it's the last byte that still belongs to the rune.
But for strings I have an argument that is not pure about preference.
It never was about preference, exclusive upper bounds do not work. ;-)