I'm trying to create sorted lower case representations of strings and am seeing some unexpected behavior. Below is an example of the code I'm using.
import unicode
import algorithm
import strutils
import encodings
let word = "Ångström"
var lower = unicode.toLower(word)
echo lower
# ångström
var sorted_word = algorithm.sorted(lower, system.cmp)
echo sorted_word
# @['g', 'm', 'n', 'r', 's', 't', '\xA5', '\xB6', '\xC3', '\xC3']
var joined_word = sorted_word.join()
echo joined_word
# gmnrst����
let current = getCurrentEncoding()
echo current
# UTF-8
I would like to preserve some sort of human readable output, so at the moment am doing this prior to sorting:
strutils.multiReplace(lower, ("'", ""), ("å", "a"), ("ö", "o"), ("é", "e"))
While this works for the very specific data set I'm using now, you can imagine how brittle this is. Ideally I would like something like the Python Unidecode, but would be happy to with some other alternative that retains the original characters as well.
Thank you.
proc runeCmp(x, y: Rune): int = system.cmp(int(x), int(y))
var sorted_word = $algorithm.sorted(unicode.toRunes(lower), runeCmp)
Wow! Amazingly fast and useful feedback. Thank you! I had no idea the unidecode module existed and the _runeCmp()_ example sheds useful light on how things work.
Really appreciate the help.