nimforum mirror - Unicode support for Windows 10 console

ktamp (orginal) [2020-02-01T14:16:56+01:00] view original

I've tried Nim programs in Windows 10 console and found unicode support to have some serious issues.

I've used Nim v1.0.6 with either MinGW-W64 v4.3.0 or MinGW-W64 v8.1.0 as the backend.

Before running a Nim program, chcp returns "Active code page: 737" (that is "OEM Greek").
After running a Nim program, chcp returns "Active code page: 65001" (that is "UTF-8").
So we know for sure that a Nim program successfully sets the code page to "UTF-8".

Now take the following short code as en example:

import strutils
import unicode

let s = stdin.readLine
echo s
for cp in s.utf8:
  echo cp.toHex, ' ', if cp == "\e": "ESC" else: cp

With Latin character input, the code works as expected:

Input:

test

Output:

test
74 t
65 e
73 s
74 t

With Greek character input, the code produces some interesting results:

Input:

τεστ

Output:

If SetConsoleCP() and SetConsoleOutputCP() are used to set the code page to "737", the code works somehow, even though the utf8() iterator does not really return UTF-8 code points:

Input:

τεστ

Output:

τεστ
AB τ
9C ε
A9 σ
AB τ

On Linux the code works as expected.

What are your suggestions?

demotomohiro (orginal) [2020-02-02T15:53:34+01:00] view original

> On Windows console applications the code-page is set at program startup to UTF-8. Use the new switch -d:nimDontSetUtf8CodePage to disable this feature. https://github.com/nim-lang/Nim/blob/devel/changelogs/changelog_1_0_0.md

ktamp (orginal) [2020-02-02T16:32:43+01:00] view original

Any chance of having working UTF-8 support?
If it is enabled by default, at least it should be working.

ktamp (orginal) [2020-02-04T21:31:25+01:00] view original

I did some of my own testing using plain old C (GCC) to get more insight.

Windows 10 console really likes UTF-16 encoded text. It simply cannot get UTF-8 input.

Forcing a program to get UTF-8 input [ _setmode(_fileno(stdin), _O_U8TEXT) ] combined with UTF-8 code page results in NUL characters, as is also the case with Nim.

Enabling the experimental UTF-8 support in Regional Settings -> Advanced does not help.

On the other hand UTF-16 input and UTF-8 output can be enabled without touching the code page, but then strings would really be UTF-16 encoded.

If only Nim would be more like Go on this matter.

ktamp (orginal) [2020-05-01T18:46:04+02:00] view original

Back to this subject (and to Nim).

To demotomohiro: Thanks, -d:nimDontSetUtf8CodePage does indeed half the job needed, by creating an executable that does not change Windows console code-page.

The other half of the job is to have a piece of software that internally uses UTF-8, so that the same source code works the same in Windows and Linux.

In this context echo should be overloaded, so that source code using it does not have to be changed, but io module functions (like readLine) should be left unchanged, since no function used for file I/O should be tampered with.

So I came up with the following solution:

#console.nim

when system.hostOS == "windows":
  import encodings
  import rdstdin
  import strutils
  
  let consoleEncoding = getCurrentEncoding()
  let consoleEncoder = encodings.open(consoleEncoding, "UTF-8")  #stdout
  let consoleDecoder = encodings.open("UTF-8", consoleEncoding)  #stdin
  
  template echo(x: varargs[string, `$`]) {.used.} =
    echo consoleEncoder.convert(x.join(""))
  
  template readLineFromStdin(prompt: string): TaintedString {.used.} =
    consoleDecoder.convert(readLineFromStdin(prompt).string).TaintedString
  
  template readLineFromStdin(prompt: string; line: var TaintedString): bool {.used.} =
    var ts: TaintedString
    result = readLineFromStdin(prompt, ts)
    line = consoleDecoder.convert(ts.string).TaintedString
  
  proc readLine(): string {.used.} = consoleDecoder.convert(stdin.readLine.string)
else:
  proc readLine(): string {.used.} = stdin.readLine.string

and

import strutils
import unicode
include console

for cp in readLine().utf8:
  echo cp.toHex, ' ', if cp == "\e": "ESC" else: cp

console has to be included, not imported, for overloading to work.

Now things work on Windows as they should have had from the beginning (that is like on Linux).

Input:

τεστ

Output:

CF84 τ
CEB5 ε
CF83 σ
CF84 τ

I wonder why things are not already done this way in Nim, since the way things are done does not work anyway.

doongjohn (orginal) [2020-05-03T17:48:19+02:00] view original

@ktamp I have a same issue. (can't read unicode input from windows console)

I used your console.nim but it doesn't work.

i've encountered a very similar issue with zig. https://github.com/ziglang/zig/issues/5148 maybe it's relevant?

Araq (orginal) [2020-05-03T18:01:29+02:00] view original

I wonder why things are not already done this way in Nim, since the way things are done does not work anyway.

The way things are done does work for me, on every computer I tried, that's why. And when we change it to your solution, it won't work for others. Been there, done that.

ktamp (orginal) [2020-05-03T20:32:09+02:00] view original

I suppose you use code-page 949?

I did "chcp 949" in both cmd and Windows PowerShell, run the above code, pasted text "안녕" from zig issue page as input and the output was:

EC9588 안
EB8595 녕

Just being curious...

What is your "Current language for non-Unicode programs" in Windows? (Settings -> Time & Language -> Date, time, & regional formatting -> Additional date, time , & regional settings -> Change date, time, or number formats -> Administrative)

I hope you do not have "Beta: Use Unicode UTF-8 for worldwide language support" enabled (also in Administrative tab).

After running nim to compile your code, did you close and reopen cmd? Because running nim sets code-page to 65001.

I'm really trying to find a solution that works for everyone.
Every computer I've tried with 65001 cannot get input in Greek (every character is replaced by 0x00).
And sources around the internet state that Windows console has many issues with UTF-8 support, essentially supporting only 7 bits in various places.

ktamp (orginal) [2020-05-03T20:38:36+02:00] view original

What is your "Current language for non-Unicode programs" in Windows?
I suppose it is German, but want a confirmation to do some more investigations here.

ktamp (orginal) [2020-05-03T23:41:42+02:00] view original

Please forget the part about language settings for non-Unicode programs.

I've just installed support for Korean, using Microsoft IME and "2 Beolsik". Using code-page 949 and typing "dkssud" (which yields "안녕") I got again:

EC9588 안
EB8595 녕

doongjohn (orginal) [2020-05-31T12:33:08+02:00] view original

Thanks for the detailed explanation!

stdin.readLine() # <- Error: unhandled exception: EOF reached [EOFError]

However this error still occurs when using the fgetws version. (win api version works as expected.)

ktamp (orginal) [2020-05-31T13:22:08+02:00] view original

I replaced _O_BINARY with _O_TEXT in the fgetws version. Everything should be fine now.

ktamp (orginal) [2020-06-08T20:12:26+02:00] view original

Hello again. Both console.nim and console2.nim above have been updated to the latest version.

Input of strings of any length is supported. This also means that there is no chance of broken input when surrogate pairs of UTF-16 characters are needed to describe a glyph.

A clearError procedure has been added.

console2.nim now properly handles Ctrl+Z.

doongjohn (orginal) [2020-06-13T17:38:09+02:00] view original

> Input of strings of any length is supported.

I can't add more characters beyond this.

ktamp (orginal) [2020-06-15T00:13:46+02:00] view original

I think we are here: https://stackoverflow.com/a/5558123

Depending on your OS, the command line input will only accept 8191 characters for XP and 2047 characters for NT and Windows 2000.

Under Windows 10, console input using fgets/fgetws is still restricted to 2046 characters + '\r' + '\0', amounting to 2048 characters in total.

On the other hand we cannot set 2048 as a hard limit in fgets/fgetws because we do not know what the future will bring. And perhaps there is still a possibility for an executable to be run under a quite outdated Windows XP installation?

As for readConsoleW its hard limit is set by us to whatever value we put in numberOfCharsToRead. And the read operation is completed at once; no extra calls to readConsoleW are useful.

Should we set numberOfCharsToRead = 2048 to be in unison with fgets/fgetws? Most probably yes...

Araq (orginal) [2020-06-15T09:26:24+02:00] view original

Nice to see progress here. Maybe we can improve the stdlib's solution too, PRs are welcome.

ktamp (orginal) [2020-06-15T23:43:36+02:00] view original

Codes have been updated again.

In console.nim initial buffer length was set to 2048 Utf16Chars in order to avoid unnecessary relocations.

In console2.nim, consoleReadLine was trimmed down to what readConsoleW actually supports.

@Araq Do you prefer the fgetws way or the readConsoleW way?

The first one fits in with standard C library, but is a little messy.

The second one is cleaner, but adds an extra variable to mark EOF and does not fit in well with the feof way, so it may not be a good candidate if the relevant functionality is to be merged with that of readLine.

ktamp (orginal) [2020-06-16T22:40:36+02:00] view original

After all it wasn't that difficult to make the readConsoleW version work the feof way.

This means now EOF is propagated between console2.nim and io.nim!

ktamp (orginal) [2020-06-20T22:19:12+02:00] view original

Since WideCStrings are UTF16 encoded and standard strings are UTF8 encoded, I wonder why newWideCString(s) allocates 4 * len(s) + 2 bytes.

The most expensive case would be when each rune is s is standard ASCII. In this case each rune would take up one byte in s but would require two bytes in a WideCString.

And the least expensive case would be when each rune in s takes up four bytes to be represented in UTF8. In this case each rune would again require four bytes in a WideCString.

And len(s) returns the size of s in bytes, not runes. So, allocating more than 2 * len(s) + 2 bytes in newWideCString(s) seems to be a waste of memory.

Accessing directly stdin->_flag in console2.nim in order to check/set EOF may not be such a good idea when multithreading is involved.

For checking EOF, feof can be used; io.endOfFile does not work when stdin is a terminal, because it is based on fgetc and waits for input + ENTER.

And, for setting EOF, a simple close(stdin) does the trick.

Anyway, both console.nim and console2.nim have been updated.

Both have been simplified as far as possible. And, as far as I can tell, they are both Nim v2 ready.

Also, in accordance to setting EOF by closing stdin in console2.nim, clearError procedure has been removed.

PR seems to be getting close...

Mirror of forum.nim-lang.org

5874 :: Unicode support for Windows 10 console