Hi,
My first message here, as a new user. I discovered Nim a while ago, but just started "playing seriously" with it a few days ago. So let me start with a big THANK YOU to Andreas and the whole community.
To the point...
I wrote a simple test with the unicode library :
import unicode
const word = "Méthode"
echo word
echo "-------"
for i in 0..word.len-1 :
echo i , " : " , word[i]
echo "-----"
for i in 0..word.runeLen-1 :
echo i , " : " , word.runeAt(i)
And the result surprises me :
Méthode ------- 0 : M 1 : 2 : � 3 : t 4 : h 5 : o 6 : d 7 : e ----- 0 : M 1 : é 2 : © 3 : t 4 : h 5 : o 6 : d
For the last part, i was expecting :
0 : M 1 : é 2 : t 3 : h 4 : o 5 : d 6 : e
Where's my mistake ?
Thanks...
In UTF-8 runes are a variable amount of bytes in size. runeAt takes the bye position as parameter. You could use runeLenAt with runeAt or fastRuneAt or the runes iterator. This may be closer to what you want:
var i = 0 # i is bytepos
while i < word.len:
echo i , " : " , word.runeAt(i)
i += word.runeLenAt(i)
or
var i = 0 # i is bytepos
var r: Rune
while i < word.len:
word.fastRuneAt(i, r)
echo i , " : " , r
or
var i = 0 # i is runepos
for rune in word.runes:
echo i , " : " , rune
inc i
returns the unicode character in s at byte index i
This makes sense and is as it should be -- runeAt is for traversing the UTF-8 encodings of a string by adding the length of each one, which is an O(N) operation. If runeAt took a character index, traversal would be O(N*N).
Indeed the documentation is quite clear, but my english is less so :-)
That's very logical.
Thanks def and jibal, your were both very helpful.