nimforum mirror - How to sort UTF-8 string?

lagerratrobe (orginal) [2020-02-10T21:56:03+01:00] view original

I'm trying to create sorted lower case representations of strings and am seeing some unexpected behavior. Below is an example of the code I'm using.


import unicode
import algorithm
import strutils
import encodings


let word = "Ångström"

var lower = unicode.toLower(word)
echo lower
# ångström

var sorted_word = algorithm.sorted(lower, system.cmp)
echo sorted_word
# @['g', 'm', 'n', 'r', 's', 't', '\xA5', '\xB6', '\xC3', '\xC3']

var joined_word = sorted_word.join()
echo joined_word
# gmnrst����

let current = getCurrentEncoding()
echo current
# UTF-8

I would like to preserve some sort of human readable output, so at the moment am doing this prior to sorting:


strutils.multiReplace(lower, ("'", ""), ("å", "a"), ("ö", "o"), ("é", "e"))

While this works for the very specific data set I'm using now, you can imagine how brittle this is. Ideally I would like something like the Python Unidecode, but would be happy to with some other alternative that retains the original characters as well.

Thank you.

bluemax75 (orginal) [2020-02-10T22:26:51+01:00] view original

https://nim-lang.org/docs/unidecode.html

def (orginal) [2020-02-10T22:32:22+01:00] view original


proc runeCmp(x, y: Rune): int = system.cmp(int(x), int(y))
var sorted_word = $algorithm.sorted(unicode.toRunes(lower), runeCmp)

lagerratrobe (orginal) [2020-02-10T23:57:58+01:00] view original

Wow! Amazingly fast and useful feedback. Thank you! I had no idea the unidecode module existed and the _runeCmp()_ example sheds useful light on how things work.

Really appreciate the help.

Mirror of forum.nim-lang.org

5916 :: How to sort UTF-8 string?