nimforum mirror - Regex and capture unicode text

stbalbach (orginal) [2021-01-21T19:05:28+01:00] view original

Working with various languages in Wikipedia and would like to capture text that is Unicode, for example:

This works (plain ascii):

import re
let t = "{{Cite book|test=}}"
echo $(findBounds(t, re("(*UTF8)[{]{2}Cite book[|][^}]+}}", {}) ))

This does not work (Unicode):

import re
let t = "{{Сite book|ссылка=|автор=Виноградов В. Б., Бараниченко Н. Н.}}"
echo $(findBounds(t, re("(*UTF8)[{]{2}Cite book[|][^}]+}}", {}) ))

No luck w/ nre

import nre
let t = "{{Сite book|ссылка=|автор=Виноградов В. Б., Бараниченко Н. Н.}}"
for found in t.findIter(re("(*UTF8)(?s)[{]{2}Cite book[|][^}]+}}")):
  echo $found

How to capture unicode?

treeform (orginal) [2021-01-21T21:08:24+01:00] view original

Use unicode mode in the regex:

re"(*UTF)..."

See: https://forum.nim-lang.org/t/7399#46881

treeform (orginal) [2021-01-21T21:17:20+01:00] view original

You might want to preform some sort of unicode normalization first, to map unicode "C" to ascii "C" etc...

Maybe? https://github.com/nitely/nim-normalize

stbalbach (orginal) [2021-01-21T23:02:09+01:00] view original

That's great, thank you, did not know this. Now understand the issue and the possible solution.

Mirror of forum.nim-lang.org

7409 :: Regex and capture unicode text