windowsterminalutf-8duckdb

In DuckDB, can there be proper UTF-8 output in duckbox mode to Windows console?


Edit:
Since the question is off-topic here, I marked it for closing/migration to SuperUser. It was not migrated so far, so I recreated this question and answer at SuperUser.


I cannot get non-ASCII characters to be properly displayed in DuckDB console, even if the console application supports UTF-8. I have a sample CSV file encoded in UTF-8 containing a few test strings:

Language Code,Greeting
pl,Cześć
de,Grüß dich
el,Γειά σου
ru,Привет
ar,مرحبا
he,שלום
ja,こんにちは
ko,안녕하세요
hi,नमस्ते

After starting duckdb.exe from Windows console (chcp reports code page 852) I use the command

SELECT * FROM read_csv('hello_in_languages.csv');

and the response (in duckbox output mode) has flawed non-latin characters as expected:

┌───────────────┬────────────┐
│ Language Code │  Greeting  │
│    varchar    │  varchar   │
├───────────────┼────────────┤
│ pl            │ Cześć      │
│ de            │ Grüß dich  │
│ el            │ ???? ???   │
│ ru            │ ??????     │
│ ar            │ ?????      │
│ he            │ ????       │
│ ja            │ ????? │
│ ko            │ ????? │
│ hi            │ ??????       │
├───────────────┴────────────┤
│ 9 rows           2 columns │
└────────────────────────────┘

Then I switch shell's code page to UTF-8 using Windows command chcp 65001:

.shell chcp 65001

and I see different issue:

����������������������������Ŀ
� Language Code �  Greeting  �
�    varchar    �  varchar   �
����������������������������Ĵ
� pl            � Cze��      �
� de            � Gr�� dich  �
� el            � ???? ???   �
� ru            � ??????     �
� ar            � ?????      �
� he            � ????       �
� ja            � ????? �
� ko            � ????? �
� hi            � ??????       �
����������������������������Ĵ
� 9 rows           2 columns �
������������������������������
Language Code,Greeting
pl,Cześć
de,Grüß dich
el,Γειά σου
ru,Привет
ar,مرحبا
he,שלום
ja,こんにちは
ko,안녕하세요
hi,नमस्ते

Does this mean that DuckDB cannot work properly with UTF-8 data through the console? Or is there a fix?


Solution

  • Issue .output command.

    Even if the .output (without parameters) is the default, it seems that its settings are improperly initialized in DuckDB v1.4.2. Issuing the command once more fixes the issue.

    Then DuckDB prints UTF-8 as expected:

    .output
    select * from read_csv('hello_in_languages.csv');
    ┌───────────────┬────────────┐
    │ Language Code │  Greeting  │
    │    varchar    │  varchar   │
    ├───────────────┼────────────┤
    │ pl            │ Cześć      │
    │ de            │ Grüß dich  │
    │ el            │ Γειά σου   │
    │ ru            │ Привет     │
    │ ar            │ مرحبا      │
    │ he            │ שלום       │
    │ ja            │ こんにちは │
    │ ko            │ 안녕하세요 │
    │ hi            │ नमस्ते       │
    ├───────────────┴────────────┤
    │ 9 rows           2 columns │
    └────────────────────────────┘
    

    Note:

    During previous examining and testing, I found out that .binary command can also have an effect on this. .binary on seemed to fix the issue. But in another round of testing, it seems that setting of .binary does not matter, and .output command is sufficient to fix things. But I am mentioning that .binary command just for the record in case if someone will get into the same issues.