nimforum mirror - Nim code to Remove Accented Letters

alfrednewman (orginal) [2016-10-02T23:10:41+02:00] view original

Hello,

I am trying to replace the accented letters in a given text with their unaccented counterparts.

What is the best way to write the following C code in Nim ?

char p_RemoveAccent(char C)
{
    #define ACCENT_CHARS    "ÁÀÃÂÇáàãâçÉÊéêÍíÑÓÔÕñóôõÚÜúü"
    #define UNACCENT_CHARS  "AAAACaaaacEEeeIiNOOOnoooUUuu"
    
    const char *p_Char = memchr(ACCENT_CHARS, C, sizeof(ACCENT_CHARS));
    
    return (p_Char ? UNACCENT_CHARS[(p_Char - ACCENT_CHARS)] : C);
}

Is there any other optimized way to do that with Nim ?

Cheers

OderWat (orginal) [2016-10-03T00:00:40+02:00] view original

I think one could import memchr() to nim and do the same as in the C version. Besides that there is the naive implementation:

proc translate(c: char): char =
  const
    a = "abc"
    b = "123"
  
  let o = a.find(c)
  if o >= 0:
    return b[o]
  else:
    return c

echo translate('a')
echo translate('x')

Run It

Beware of UTF-8 thought. Your example and system.find() only works if the encoding translates every letter to one char.

alfrednewman (orginal) [2016-10-03T16:19:01+02:00] view original

Yeap, I know. What's the best way to implement your proc translate ?

Say I have an input string like "Cédille Français". How can I use the proc with it ?

Sorry my basic questions, but I am a pretty new with Nim... Cheers

OderWat (orginal) [2016-10-03T17:02:53+02:00] view original

Well it could work like this:

proc translate(c: char): char =
  const
    a = "abcdef"
    b = "123456"
  
  let o = a.find(c)
  if o >= 0:
    return b[o]
  else:
    return c

proc main() =
  var s = "that is the abc"
  for c in s.mitems():
    c = c.translate
  
  echo s

main()

Run it

Others may tell you to use map() or similar but I think the following would be a bit more "Nim"-esque:

template translate(c: char): char =
  const
    a = "abcdef"
    b = "123456"
  
  let o = a.find(c)
  if o >= 0:
    b[o]
  else:
    c

proc main() =
  var s = "that is the abc"
  for c in s.mitems():
    c = c.translate
  
  echo s

main()

The mitems() iterator will rewrite it into a plain loop. The template will inline the 'translation' and the c compiler will (hopefully) optimise all the stupidity the code-gen will produce out of what the compiler produces (look at the source in nimcache to see what I mean with that).

But you still would want to rewrite this such that nothing is done if the characters don't need a translation. How that works should be fairly easy to conclude from the examples.

alfrednewman (orginal) [2016-10-03T17:24:55+02:00] view original

OderWat, I guess there is something wrong with the UTF or so. I made the following test...

proc translate(c: char): char =
  const
    a = "ÁÀÃÂÇáàãâçÉÊéêÍíÑÓÔÕñóôõÚÜúü"
    b = "AAAACaaaacEEeeIiNOOOnoooUUuu"
  
  let o = a.find(c)
  if o >= 0:
    return b[o]
  else:
    return c

proc main() =
  var s = "that is the abc - Cédille Français"
  for c in s.mitems():
    c = c.translate
  
  echo s

I got the following result: that is the abc - CAUdille FranAOais

Run it

What I am missing ?

OderWat (orginal) [2016-10-03T18:25:55+02:00] view original

As I said.. this and your original c-code will not work with UTF-8 at all.

To make it work for UTF-8 you can use the unicode.nim module. But switching to UTF-8 and therefor Unicode will make all of that much more complex.

For example your input string will change its length (byte wise) because UTF-8 representation of chars will change.

There are different ways to solve your problem. Even such, which do not at all use the unicode module and do not even "really" know about UTF-8 which even may be the best solution for your task.

The most efficient solution for your exact problem would IMHO be an array of "utf-8" strings which get searched in your string while using two indices into the string. The first will be used to check if you find a replacement in the array the second is the position of the "resulting" string. You could also just create a new string but that would be slightly less efficient. If you do, make sure you preallocate the maximum space and setLen later for it.

The procedure is pretty simple: Every time you find a replacement you advance the (raw) length of the found string (aka RuneLen) in the first index and add the replacement char (from a string which uses the index of the first array to carry the replacement chars) at the second index.

if nothing is found you just copy one char and search again. As the result is always shorter than the original string that will work and end with the second index giving the new length of your string.

It is slightly inefficient to do that scanning byte wise because you will also search for matching substrings inside of other UTF-8 encoded sequences.

To avoid that you could use the same technique as the utf8 iterator uses:

iterator utf8*(s: string): string =
  var o = 0
  while o < s.len:
    let n = runeLenAt(s, o)
    yield s[o.. (o+n-1)] # <- this is what you need to search for and replace with your unacceted chars
    o += n

I hope of being some help without writing a working version down :)

flyx (orginal) [2016-10-03T19:44:30+02:00] view original

Here is how you can do it with premature optimization:

import unicode, strutils

proc translationTable(src, dest: string): array[0x1f00, char] {.compileTime.} =
  for i in 0..<0x1f00: result[i] = '\0'
  var
    srcIndex = 0
    destIndex = 0
  while srcIndex < src.len and destIndex < dest.len:
    let srcCh = src.runeAt(srcIndex)
    if srcCh.int32 >= 0x1f00:
      echo "Cannot translate this character: " & escape(srcCh.toUTF8())
      quit(1)
    let destCh = dest.runeAt(destIndex)
    if destCh.int32 >= 0xff:
      echo "Cannot translate to non-ASCII character: " & escape(destCh.toUTF8())
    result[srcCh.int] = destCh.char
    srcIndex.inc(src.runeLenAt(srcIndex))
    destIndex.inc(dest.runeLenAt(destIndex))

proc translate(input: string): string =
  const ttable = translationTable("ÁÀÃÂÇáàãâçÉÊéêÍíÑÓÔÕñóôõÚÜúü",
                                  "AAAACaaaacEEeeIiNOOOnoooUUuu")
  result = newStringOfCap(input.len)
  var i = 0
  while i < input.len:
    let r = input.runeAt(i)
    let t = ttable[r.int]
    i.inc(input.runeLenAt(i))
    result.add(toUTF8(Rune(
        (t.int8 == 0).int32 * r.int32 +
        (t.int8 != 0).int32 * t.int32)))

echo translate("Cédille Français")

Points of interest:

Uses a table that translates unicode characters up to 0x1eff to ASCII characters. This covers everything up to Latin Extended Additional, and I don't believe there are characters beyond that that can meaningfully translated into ASCII characters (but YMMV).

Table is constructed at compile time and is not sparse, so it occupies some 8KB regardless of how many characters you want to translate. On the other hand, there is no contains check necessary which makes the code faster. This is probably only useful if you use the translation code many times during one run.

Implementation uses no branching except for the loop, so it's very fast.

Edit: Yeah actually, there may not be any unicode characters beyond 0x1eff in the input for this to work. Fixing that is left as exercise for the reader.

alfrednewman (orginal) [2016-10-03T21:33:55+02:00] view original

@all, thanks a lot.

Mirror of forum.nim-lang.org

2563 :: Nim code to Remove Accented Letters