Hello,
I am trying to replace the accented letters in a given text with their unaccented counterparts.
What is the best way to write the following C code in Nim ?
char p_RemoveAccent(char C)
{
#define ACCENT_CHARS "ÁÀÃÂÇáàãâçÉÊéêÍíÑÓÔÕñóôõÚÜúü"
#define UNACCENT_CHARS "AAAACaaaacEEeeIiNOOOnoooUUuu"
const char *p_Char = memchr(ACCENT_CHARS, C, sizeof(ACCENT_CHARS));
return (p_Char ? UNACCENT_CHARS[(p_Char - ACCENT_CHARS)] : C);
}
Is there any other optimized way to do that with Nim ?
Cheers
I think one could import memchr() to nim and do the same as in the C version. Besides that there is the naive implementation:
proc translate(c: char): char =
const
a = "abc"
b = "123"
let o = a.find(c)
if o >= 0:
return b[o]
else:
return c
echo translate('a')
echo translate('x')
Beware of UTF-8 thought. Your example and system.find() only works if the encoding translates every letter to one char.
Yeap, I know. What's the best way to implement your proc translate ?
Say I have an input string like "Cédille Français". How can I use the proc with it ?
Sorry my basic questions, but I am a pretty new with Nim... Cheers
Well it could work like this:
proc translate(c: char): char =
const
a = "abcdef"
b = "123456"
let o = a.find(c)
if o >= 0:
return b[o]
else:
return c
proc main() =
var s = "that is the abc"
for c in s.mitems():
c = c.translate
echo s
main()
Others may tell you to use map() or similar but I think the following would be a bit more "Nim"-esque:
template translate(c: char): char =
const
a = "abcdef"
b = "123456"
let o = a.find(c)
if o >= 0:
b[o]
else:
c
proc main() =
var s = "that is the abc"
for c in s.mitems():
c = c.translate
echo s
main()
The mitems() iterator will rewrite it into a plain loop. The template will inline the 'translation' and the c compiler will (hopefully) optimise all the stupidity the code-gen will produce out of what the compiler produces (look at the source in nimcache to see what I mean with that).
But you still would want to rewrite this such that nothing is done if the characters don't need a translation. How that works should be fairly easy to conclude from the examples.
OderWat, I guess there is something wrong with the UTF or so. I made the following test...
proc translate(c: char): char =
const
a = "ÁÀÃÂÇáàãâçÉÊéêÍíÑÓÔÕñóôõÚÜúü"
b = "AAAACaaaacEEeeIiNOOOnoooUUuu"
let o = a.find(c)
if o >= 0:
return b[o]
else:
return c
proc main() =
var s = "that is the abc - Cédille Français"
for c in s.mitems():
c = c.translate
echo s
I got the following result: that is the abc - CAUdille FranAOais
What I am missing ?
As I said.. this and your original c-code will not work with UTF-8 at all.
To make it work for UTF-8 you can use the unicode.nim module. But switching to UTF-8 and therefor Unicode will make all of that much more complex.
For example your input string will change its length (byte wise) because UTF-8 representation of chars will change.
There are different ways to solve your problem. Even such, which do not at all use the unicode module and do not even "really" know about UTF-8 which even may be the best solution for your task.
The most efficient solution for your exact problem would IMHO be an array of "utf-8" strings which get searched in your string while using two indices into the string. The first will be used to check if you find a replacement in the array the second is the position of the "resulting" string. You could also just create a new string but that would be slightly less efficient. If you do, make sure you preallocate the maximum space and setLen later for it.
The procedure is pretty simple: Every time you find a replacement you advance the (raw) length of the found string (aka RuneLen) in the first index and add the replacement char (from a string which uses the index of the first array to carry the replacement chars) at the second index.
if nothing is found you just copy one char and search again. As the result is always shorter than the original string that will work and end with the second index giving the new length of your string.
It is slightly inefficient to do that scanning byte wise because you will also search for matching substrings inside of other UTF-8 encoded sequences.
To avoid that you could use the same technique as the utf8 iterator uses:
iterator utf8*(s: string): string =
var o = 0
while o < s.len:
let n = runeLenAt(s, o)
yield s[o.. (o+n-1)] # <- this is what you need to search for and replace with your unacceted chars
o += n
I hope of being some help without writing a working version down :)
Here is how you can do it with premature optimization:
import unicode, strutils
proc translationTable(src, dest: string): array[0x1f00, char] {.compileTime.} =
for i in 0..<0x1f00: result[i] = '\0'
var
srcIndex = 0
destIndex = 0
while srcIndex < src.len and destIndex < dest.len:
let srcCh = src.runeAt(srcIndex)
if srcCh.int32 >= 0x1f00:
echo "Cannot translate this character: " & escape(srcCh.toUTF8())
quit(1)
let destCh = dest.runeAt(destIndex)
if destCh.int32 >= 0xff:
echo "Cannot translate to non-ASCII character: " & escape(destCh.toUTF8())
result[srcCh.int] = destCh.char
srcIndex.inc(src.runeLenAt(srcIndex))
destIndex.inc(dest.runeLenAt(destIndex))
proc translate(input: string): string =
const ttable = translationTable("ÁÀÃÂÇáàãâçÉÊéêÍíÑÓÔÕñóôõÚÜúü",
"AAAACaaaacEEeeIiNOOOnoooUUuu")
result = newStringOfCap(input.len)
var i = 0
while i < input.len:
let r = input.runeAt(i)
let t = ttable[r.int]
i.inc(input.runeLenAt(i))
result.add(toUTF8(Rune(
(t.int8 == 0).int32 * r.int32 +
(t.int8 != 0).int32 * t.int32)))
echo translate("Cédille Français")
Points of interest:
Edit: Yeah actually, there may not be any unicode characters beyond 0x1eff in the input for this to work. Fixing that is left as exercise for the reader.