nimforum mirror - Rune at position

nodrygo (orginal) [2015-03-25T10:08:29+01:00] view original

Don't you think there is the lack of a Rune position proc ?

We have runeAt but it is based on byte pos. so :


let s = "az€rt"
echo  runeAt(s,3)      -> has no meaning

should be great to have something like runeAtPos where:


echo  runeAtPos(s, 3)  -> is r

A naive and slow demo implementation to show what I mean


import  os
import unicode
include "system/inclrtl"

## Get Rune at position (and not buffer char)

proc runeAtPos*(s: string, pos:int): string =
  var i = -1
  for r in runes(s):
    inc(i)
    if i == pos:
      return  $r

when isMainModule:
  let s = "az€r™tλγq"
  assert  runeAtPos(s, 0)  ==  "a"
  assert  runeAtPos(s, 1)  ==  "z"
  assert  runeAtPos(s, 2)  ==  "€"
  assert  runeAtPos(s, 3)  ==  "r"
  assert  runeAtPos(s, 4)  ==  "™"
  assert  runeAtPos(s, 5)  ==  "t"
  assert  runeAtPos(s, 6)  ==  "λ"
  assert  runeAtPos(s, 7)  ==  "γ"
  assert  runeAtPos(s, 8)  ==  "q"

some comments ?

jboy (orginal) [2015-03-25T11:07:13+01:00] view original

Hi @nodrygo,

A few quick comments:

I assume you have read the previous thread on Unicode in Nim?

Would it suffice for your needs to:
1. Convert a string to a seq[Rune] using a new proc runes(s: string): seq[Rune], then
2. Use normal seq indexing to obtain the i-th Rune in that seq[Rune]?

If a new proc were added to get the i-th Rune in a string, I would suggest calling it just rune(s: string; i: int): Rune rather than runeAtPos(s: string, pos:int): string. Observe that runeLenAt and runeAt both expect i to be a byte-index, while runeLen (without an At suffix) talks about the number of runes rather than the number of bytes. Hence, rune (without any suffix) by analogy.

Whether to return a Rune or a string? There's a mini-explosion of combinations here: Position in bytes or runes? Return a string or a Rune? etc. I would return a Rune rather than a string, just to be similar to runeAt. Also, since a Rune is implemented as an int, this avoids any unnecessary memory allocation (if you actually wanted the Rune rather than a string).

nodrygo (orginal) [2015-03-25T11:46:22+01:00] view original

I assume you have read the previous thread on Unicode in Nim?

No , but now it's done, thanks

May be I am wrong but mixing of both unicode utf8 and ascii in same byte array can be confusing

My dream would be to have some module and type (UTF8 ?) similar to Elixir powerful string ;-)

that said your proposal (#2353) is interesting , may be the lack for a unicode slice ?

OderWat (orginal) [2015-03-25T14:32:49+01:00] view original

I just created this Pull-Request

It adds procs to find the byte offset of a rune (efficiently) and two others to return the rune or an utf8 string of the rune (also efficiently). I bet @def can do it better though :)

I had no time for a test but if @Araq accepts it I will add one based on your example @nodrygo). I think the naming is good like @jboy said.

def (orginal) [2015-03-25T14:58:54+01:00] view original

I don't think this should be used to traverse a unicode string. Having to traverse the entire string for each position is not ideal. If you really need that you could create a seq of runes and index them instead.

Araq (orginal) [2015-03-25T15:47:29+01:00] view original

I still don't get the obsession with Runes tbh. A Rune is UTF-32 character, but again, Unicode is fundamentally a variable length encoding and there is not much you can do with a single rune.

nodrygo (orginal) [2015-03-25T17:27:48+01:00] view original

@Araq: I have no obsession with Runes ;-) and up your answer I had not yet realized that was UTF-32.

@OderWat: thanks for your PR

As I said the only request was for powerful UTF-8 functions like I can have in most of other language. People seem obsessed by fast ASCII but now we live in World Wide so, IMHO, good UTF-8 functions are not options and we should not wonder if index is byte or code-point oriented.

But btw, this is not major nor important because nowadays I am not working on a real project with Nim. (I have a little plan/need for simple Android/iOS appli. but I will probably use HAXE / openFL* )

And in fact for Web I should stay with Elixir/Phoenix.

The overall advantage for Nim is it's performance with C back-end and the capacity for other back-end (javascript) and also because it is easy to bind C. So it's a more generalist language.

HAXE/C++ backend produce not so fast code and need shared lib, but I have never benched with Nim.

To conclude Nim remain very interesting language and an excellent job and I would follow its evolution

Jehan (orginal) [2015-03-25T17:43:09+01:00] view original

OderWat: I just created this Pull-Request

This needs clearly defined semantics for when pos is not valid (i.e. negative or greater than or equal to the number of runes in a string).

I'm also not quite sure what the point of an O(len(s)) operation is in this context (i.e. to access a single character, but no others). is there any actual use case for this? It is almost always saner to convert the string to a seq[Rune], which you can index randomly.

OderWat (orginal) [2015-03-25T17:48:24+01:00] view original

@def / @araq: Of course it is not ideal to traverse an unicode string. But that is not for what this should be uses.

If I want to test some supposed fixed width data which comes as UTF-8 it will be faster than using the iterator. I have to admit that the use case may be seldom and that adding such functions may let lazy users ignore the iterator approach.

But I also think that you sometimes just need to do something "quick and lazy" and for this you won't need to access the iterator. Maybe adding a "Warning: This is a lazy way of handling an unicode string, consider using the iterator for maximum performance." would not just help people to see that there is another way but also stop people from thinking that something is missing and let them being lazy if they want.

OderWat (orginal) [2015-03-25T17:57:26+01:00] view original

@Jehan .. it runs out of index when pos is not valid. I don't see a problem with that?

It is more efficient to access "rune at position 10" than the iterator or any other way afais.

The use case is to check for a "rune / character" at a fixed position in an UTF-8 String. Like checking if there is a colon at position 20 in a fixed with unicode string.

Where "fixed width" means "it would be 20 chars in latin-1". We have such cases and I have some (non Nim) code which just checks on index positions in utf-8 strings where nobody cares if that could be faster if implemented different.

EDIT: Actually, we convert a lot of UTF-8 into Latin-1 before we continue processing.

P.S.: I won't die if that does not go into stdlib. But I rather like the idea to put it in and write a big warning in the docs to help pointing people into the right direction when they search for their "10th char in a utf-8 string" problems.

Jehan (orginal) [2015-03-25T18:53:42+01:00] view original

OderWat: it runs out of index when pos is not valid. I don't see a problem with that?

Well, rune and runeStr return the first Unicode character of the string for a negative index; they return the Unicode null character when pos equals the length of the the string (measured in Unicode characters), and result in an error when it's bigger. That isn't consistent in any way, shape, or form (the semantics of runeOffset aren't consistent, either).

I'd also recommend a less generic procedure name than rune.

OderWat: It is more efficient to access "rune at position 10" than the iterator or any other way afais.

I understand that. What I was wondering about is under what circumstances you actually need that feature because I couldn't think of any off-hand.

OderWat (orginal) [2015-03-25T19:48:38+01:00] view original

@jehan I see. I just thought of the "to many" case (I was pretty busy with other stuff). The negative index could be handled with "Natural" for position. I can fix the other problems later tonight. But I have a feeling that the PR is probably obsolete anyway.

OderWat (orginal) [2015-03-29T20:56:38+02:00] view original

I made a lot of adjustments, changed names and used a scheme to optimize this "non iterator" way of handling UTF-8. Then I added an substr() like proc as use-case for the rune-position related helper functions.

The updated PR also shows examples for use-cases which are not to artificial in my eyes.

I also added a warning (probably to wordy) to the documentation such that "lazy" people find a solution and a hint that there are more options to solve their problem.

Mirror of forum.nim-lang.org

1067 :: Rune at position