I've tried Nim programs in Windows 10 console and found unicode support to have some serious issues.
I've used Nim v1.0.6 with either MinGW-W64 v4.3.0 or MinGW-W64 v8.1.0 as the backend.
Before running a Nim program, chcp returns "Active code page: 737" (that is "OEM Greek").
After running a Nim program, chcp returns "Active code page: 65001" (that is "UTF-8").
So we know for sure that a Nim program successfully sets the code page to "UTF-8".
Now take the following short code as en example:
import strutils
import unicode
let s = stdin.readLine
echo s
for cp in s.utf8:
echo cp.toHex, ' ', if cp == "\e": "ESC" else: cp
With Latin character input, the code works as expected:
Input:
test
Output:
test
74 t
65 e
73 s
74 t
With Greek character input, the code produces some interesting results:
Input:
τεστ
Output:
00
00
00
00
If SetConsoleCP() and SetConsoleOutputCP() are used to set the code page to "737", the code works somehow, even though the utf8() iterator does not really return UTF-8 code points:
Input:
τεστ
Output:
τεστ
AB τ
9C ε
A9 σ
AB τ
On Linux the code works as expected.
What are your suggestions?
If only Nim would be more like Go on this matter.
Back to this subject (and to Nim).
To demotomohiro: Thanks, -d:nimDontSetUtf8CodePage does indeed half the job needed, by creating an executable that does not change Windows console code-page.
The other half of the job is to have a piece of software that internally uses UTF-8, so that the same source code works the same in Windows and Linux.
In this context echo should be overloaded, so that source code using it does not have to be changed, but io module functions (like readLine) should be left unchanged, since no function used for file I/O should be tampered with.
So I came up with the following solution:
#console.nim
when system.hostOS == "windows":
import encodings
import rdstdin
import strutils
let consoleEncoding = getCurrentEncoding()
let consoleEncoder = encodings.open(consoleEncoding, "UTF-8") #stdout
let consoleDecoder = encodings.open("UTF-8", consoleEncoding) #stdin
template echo(x: varargs[string, `$`]) {.used.} =
echo consoleEncoder.convert(x.join(""))
template readLineFromStdin(prompt: string): TaintedString {.used.} =
consoleDecoder.convert(readLineFromStdin(prompt).string).TaintedString
template readLineFromStdin(prompt: string; line: var TaintedString): bool {.used.} =
var ts: TaintedString
result = readLineFromStdin(prompt, ts)
line = consoleDecoder.convert(ts.string).TaintedString
proc readLine(): string {.used.} = consoleDecoder.convert(stdin.readLine.string)
else:
proc readLine(): string {.used.} = stdin.readLine.string
and
import strutils
import unicode
include console
for cp in readLine().utf8:
echo cp.toHex, ' ', if cp == "\e": "ESC" else: cp
console has to be included, not imported, for overloading to work.
Now things work on Windows as they should have had from the beginning (that is like on Linux).
Input:
τεστ
Output:
CF84 τ
CEB5 ε
CF83 σ
CF84 τ
I wonder why things are not already done this way in Nim, since the way things are done does not work anyway.
@ktamp I have a same issue. (can't read unicode input from windows console)
I used your console.nim but it doesn't work.
i've encountered a very similar issue with zig. https://github.com/ziglang/zig/issues/5148 maybe it's relevant?
I wonder why things are not already done this way in Nim, since the way things are done does not work anyway.
The way things are done does work for me, on every computer I tried, that's why. And when we change it to your solution, it won't work for others. Been there, done that.
I suppose you use code-page 949?
I did "chcp 949" in both cmd and Windows PowerShell, run the above code, pasted text "안녕" from zig issue page as input and the output was:
EC9588 안
EB8595 녕
Just being curious...
I'm really trying to find a solution that works for everyone.
Every computer I've tried with 65001 cannot get input in Greek (every character is replaced by 0x00).
And sources around the internet state that Windows console has many issues with UTF-8 support, essentially supporting only 7 bits in various places.
Please forget the part about language settings for non-Unicode programs.
I've just installed support for Korean, using Microsoft IME and "2 Beolsik". Using code-page 949 and typing "dkssud" (which yields "안녕") I got again:
EC9588 안
EB8595 녕
Thanks for the detailed explanation!
stdin.readLine() # <- Error: unhandled exception: EOF reached [EOFError]
However this error still occurs when using the fgetws version. (win api version works as expected.)
I can't add more characters beyond this.
I think we are here: https://stackoverflow.com/a/5558123
Depending on your OS, the command line input will only accept 8191 characters for XP and 2047 characters for NT and Windows 2000.
Under Windows 10, console input using fgets/fgetws is still restricted to 2046 characters + '\r' + '\0', amounting to 2048 characters in total.
On the other hand we cannot set 2048 as a hard limit in fgets/fgetws because we do not know what the future will bring. And perhaps there is still a possibility for an executable to be run under a quite outdated Windows XP installation?
As for readConsoleW its hard limit is set by us to whatever value we put in numberOfCharsToRead. And the read operation is completed at once; no extra calls to readConsoleW are useful.
Should we set numberOfCharsToRead = 2048 to be in unison with fgets/fgetws? Most probably yes...
@Araq Do you prefer the fgetws way or the readConsoleW way?
After all it wasn't that difficult to make the readConsoleW version work the feof way.
This means now EOF is propagated between console2.nim and io.nim!
Since WideCStrings are UTF16 encoded and standard strings are UTF8 encoded, I wonder why newWideCString(s) allocates 4 * len(s) + 2 bytes.
The most expensive case would be when each rune is s is standard ASCII. In this case each rune would take up one byte in s but would require two bytes in a WideCString.
And the least expensive case would be when each rune in s takes up four bytes to be represented in UTF8. In this case each rune would again require four bytes in a WideCString.
And len(s) returns the size of s in bytes, not runes. So, allocating more than 2 * len(s) + 2 bytes in newWideCString(s) seems to be a waste of memory.
Accessing directly stdin->_flag in console2.nim in order to check/set EOF may not be such a good idea when multithreading is involved.
For checking EOF, feof can be used; io.endOfFile does not work when stdin is a terminal, because it is based on fgetc and waits for input + ENTER.
And, for setting EOF, a simple close(stdin) does the trick.
Anyway, both console.nim and console2.nim have been updated.
Both have been simplified as far as possible. And, as far as I can tell, they are both Nim v2 ready.
Also, in accordance to setting EOF by closing stdin in console2.nim, clearError procedure has been removed.
PR seems to be getting close...