Don't you think there is the lack of a Rune position proc ?
We have runeAt but it is based on byte pos. so :
let s = "az€rt"
echo runeAt(s,3) -> has no meaning
should be great to have something like runeAtPos where:
echo runeAtPos(s, 3) -> is r
A naive and slow demo implementation to show what I mean
import os
import unicode
include "system/inclrtl"
## Get Rune at position (and not buffer char)
proc runeAtPos*(s: string, pos:int): string =
var i = -1
for r in runes(s):
inc(i)
if i == pos:
return $r
when isMainModule:
let s = "az€r™tλγq"
assert runeAtPos(s, 0) == "a"
assert runeAtPos(s, 1) == "z"
assert runeAtPos(s, 2) == "€"
assert runeAtPos(s, 3) == "r"
assert runeAtPos(s, 4) == "™"
assert runeAtPos(s, 5) == "t"
assert runeAtPos(s, 6) == "λ"
assert runeAtPos(s, 7) == "γ"
assert runeAtPos(s, 8) == "q"
some comments ?
Hi @nodrygo,
A few quick comments:
I assume you have read the previous thread on Unicode in Nim?
No , but now it's done, thanks
May be I am wrong but mixing of both unicode utf8 and ascii in same byte array can be confusing
My dream would be to have some module and type (UTF8 ?) similar to Elixir powerful string ;-)
that said your proposal (#2353) is interesting , may be the lack for a unicode slice ?
I just created this Pull-Request
It adds procs to find the byte offset of a rune (efficiently) and two others to return the rune or an utf8 string of the rune (also efficiently). I bet @def can do it better though :)
I had no time for a test but if @Araq accepts it I will add one based on your example @nodrygo). I think the naming is good like @jboy said.
@Araq: I have no obsession with Runes ;-) and up your answer I had not yet realized that was UTF-32.
@OderWat: thanks for your PR
As I said the only request was for powerful UTF-8 functions like I can have in most of other language. People seem obsessed by fast ASCII but now we live in World Wide so, IMHO, good UTF-8 functions are not options and we should not wonder if index is byte or code-point oriented.
But btw, this is not major nor important because nowadays I am not working on a real project with Nim. (I have a little plan/need for simple Android/iOS appli. but I will probably use HAXE / openFL* )
And in fact for Web I should stay with Elixir/Phoenix.
The overall advantage for Nim is it's performance with C back-end and the capacity for other back-end (javascript) and also because it is easy to bind C. So it's a more generalist language.
HAXE/C++ backend produce not so fast code and need shared lib, but I have never benched with Nim.
To conclude Nim remain very interesting language and an excellent job and I would follow its evolution
OderWat: I just created this Pull-Request
This needs clearly defined semantics for when pos is not valid (i.e. negative or greater than or equal to the number of runes in a string).
I'm also not quite sure what the point of an O(len(s)) operation is in this context (i.e. to access a single character, but no others). is there any actual use case for this? It is almost always saner to convert the string to a seq[Rune], which you can index randomly.
@def / @araq: Of course it is not ideal to traverse an unicode string. But that is not for what this should be uses.
If I want to test some supposed fixed width data which comes as UTF-8 it will be faster than using the iterator. I have to admit that the use case may be seldom and that adding such functions may let lazy users ignore the iterator approach.
But I also think that you sometimes just need to do something "quick and lazy" and for this you won't need to access the iterator. Maybe adding a "Warning: This is a lazy way of handling an unicode string, consider using the iterator for maximum performance." would not just help people to see that there is another way but also stop people from thinking that something is missing and let them being lazy if they want.
@Jehan .. it runs out of index when pos is not valid. I don't see a problem with that?
It is more efficient to access "rune at position 10" than the iterator or any other way afais.
The use case is to check for a "rune / character" at a fixed position in an UTF-8 String. Like checking if there is a colon at position 20 in a fixed with unicode string.
Where "fixed width" means "it would be 20 chars in latin-1". We have such cases and I have some (non Nim) code which just checks on index positions in utf-8 strings where nobody cares if that could be faster if implemented different.
EDIT: Actually, we convert a lot of UTF-8 into Latin-1 before we continue processing.
P.S.: I won't die if that does not go into stdlib. But I rather like the idea to put it in and write a big warning in the docs to help pointing people into the right direction when they search for their "10th char in a utf-8 string" problems.
OderWat: it runs out of index when pos is not valid. I don't see a problem with that?
Well, rune and runeStr return the first Unicode character of the string for a negative index; they return the Unicode null character when pos equals the length of the the string (measured in Unicode characters), and result in an error when it's bigger. That isn't consistent in any way, shape, or form (the semantics of runeOffset aren't consistent, either).
I'd also recommend a less generic procedure name than rune.
OderWat: It is more efficient to access "rune at position 10" than the iterator or any other way afais.
I understand that. What I was wondering about is under what circumstances you actually need that feature because I couldn't think of any off-hand.
I made a lot of adjustments, changed names and used a scheme to optimize this "non iterator" way of handling UTF-8. Then I added an substr() like proc as use-case for the rune-position related helper functions.
The updated PR also shows examples for use-cases which are not to artificial in my eyes.
I also added a warning (probably to wordy) to the documentation such that "lazy" people find a solution and a hint that there are more options to solve their problem.