Python Unicode - What Characters Can Be Printed in Windows Console?

Which Unicode characters can be printed in a Windows console from Python?

The following code

for code in range(1114112):
    print(chr(code), end=",")

gives unimpressive results, including an error:

UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 0: surrogates not allowed

Yet the docs for str claim values up to 0x110000 are allowed.

Is there a way to get some more characters to display?

Solution

To answer your question, we need to check several layers of Unicode.

Valid Unicode code points are from 0 to U+10FFFF. You may find with unicodedata.category(char) what category has a Unicode code point.

The values from U+D800 to U+DFFF are surrogates, they should not be used (and they cannot be encoded/decoded in UTF-16). [They are used to enhance UCS-2 (so old Unicode, which has code point until U+FFFF), to UTF-16 (to U+10FFFF). Old programs/languages (as Javascript) may use two surrogates representation, instead of one UTF-16 codepoint].

Note: Python allows them because of surrogateescape (mostly used to read sys.argv), but you should ignore them, but use them only internally, before to convert them properly.

So, do no try to use such codes.

There are also the noncharacters: U+FDD0–U+FDEF, and FFFE or FFFF (i.e., U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, … U+10FFFE, U+10FFFF) [from Wikipedia, Unicode], which should not be used, but ev. BOM (U+FEFF), but in this case only as first character. Reason: the first block: What's the purpose of the noncharacters U+FDD0 to U+FDEF?, the others: to autodetect encoding, so we should not have confusing code points: if you detect them, you know that you are using a wrong encoding, and you change encoding, until you get a valid first code point.

Now, with the unicodedata.category(char), you can get also the categories of the code (see Unicode character categories). Characters until U+1F, and U+7F–U+9F are control character, do no print them.

You may have formatting characters, which could modify nearby characters.

So you may want to exclude the C* (note: this will discard all the above characters) and maybe also the Z* (white spaces) character categories.

So you have the printable characters, known by unicodedata standard module. Use unicodedata.unidata_version to check up which unicode version the database is updated. You may ev. allow Cn category (unassigned): maybe now they are assigned.

But this is not enough. You need a font to display such characters. Google has the "No Tofu fonts" which is (I think) the most complete font.

But this is also not enough. You get only the standard representation of the characters (and probably not, you should add a U+200C (ZWNJ) after each character, to force fonts not to join characters (e.g. in the Indic languages). But so you miss all the characters which are represented by a combination of code points: e.g. many accented characters, characters enclosed in circles or squares, country flags (you need two country code characters in the correct order), etc.

Note: I'm curious on how to get all glyphs from a font file, but this is not your question.

ADDENDUM:

I forgot to say: Combining characters cannot be displayed alone, so you need to precede the e.g. with U+25CC, you may check them with unicodedata.combining(chr).

So you may use this code

# if your console is not UTF-8 (or any unicode encoding) and python
# do no get it, you will get garbage
import unicodedata

combining = '\u25cc'
placeholder = '\ufffd'
zwnj = '\u200c'

line = ''
for code in range(0x10FFFF+1):
    c = chr(code)
    cat = unicodedata.category(c)
    if cat.startswith('C'):  # and cat != 'Cn':
        r = placeholder
    elif cat.startswith('Z'):
        r = ' '
    elif unicodedata.combining(c) > 0:
        r = combining + c + zwnj
    else:
        r = c + zwnj
    line += r
    if code % 256 == 255:
        print(line)
        line = ''