ctextcharacter-encoding

Combining two code pages in one program in C


I am making a program that reads from a file that has characters from two different alphabets (Cyrillic and German). However, when printed to the terminal, ö, ä and ü come out as ?.

So far, I have tried:

Is there any way for the program to read the characters from both alphabets? Is there some 'mixed code page' I have missed out on?

Code:

void readG(){
    system("cls");

    // open the file in read mode
    fptr = fopen("C:\\Users\\pl\\projects\\sources\\lernwortschatz.txt", "r");

    // print title
    printf("LERNWORTSCHATZ\n");
    printf("A1\n");
    printf("-------------------------------------------------\n");

    // read and print the file's contents
    while(fgets(str, 10000, fptr))
    {
        printf("%s", str);
    }

    // close the file
    fclose(fptr);
}

Example:

What I am trying to do with the file: I want to get all of its contents and then immediately print it without saving.

Part of the result in bytes:

30 20 2d 20 4d 65 69 6e 65 20 57 3f 72 74 65 72 
20 69 6d 20 4b 75 72 73 0a 2d 2d 2d 2d 2d 2d 2d 
2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 
2d 2d 2d 2d 2d 2d 2d 2d 2d 2d 0a 61 6e 73 65 68 
65 6e 20 2d 20 e2 e8 e6 0a 64 61 73 20 42 69 6c 
64 2c 2d 65 72 20 2d 20 ea e0 f0 f2 e8 ed ea e0

Here is a part of the file the program reads:

ansehen - виж
das Bild,-er - картинка
hören - слушам
noch einmal - още един път
ankreuzen - зачерквам/попълвам

Here is this part of the file in hex representation:

00000000: 2d2d 2d2d 2d2d 2d2d 2d2d 2d0d 0a61 6e73  -----------..ans
00000010: 6568 656e 202d 20e2 e8e6 0d0a 6461 7320  ehen - .....das
00000020: 4269 6c64 2c2d 6572 202d 20ea e0f0 f2e8  Bild,-er - .....
00000030: edea e00d 0a68 3f72 656e 202d 20f1 ebf3  .....h?ren - ...
00000040: f8e0 ec0d 0a6e 6f63 6820 6569 6e6d 616c  .....noch einmal
00000050: 202d 20ee f9e5 20e5 e4e8 ed20 effa f20d   - ... .... ....
00000060: 0a61 6e6b 7265 757a 656e 202d 20e7 e0f7  .ankreuzen - ...

Solution

  • So, the file you have indeed does have two different single-byte character encodings on each line. That's quite the technical feat to have managed with any regular text editor! :)

    Let's take the hören line 68 3f 72 65 6e 20 2d 20 f1 eb f3 f8 e0 ec as an example, but I'm going to modify it a bit because the hex dump you're showing is already broken; the byte 3F is the question mark, not what would be ö in ISO-8859-1 (F6).

    I'm going to use Python to illustrate the problems you'll face because it's good at dealing with various encodings.

    >>> x = '68 f6 72 65 6e 20 2d 20 f1 eb f3 f8 e0 ec'
    >>> b = bytes.fromhex(x)
    b'h\xf6ren - \xf1\xeb\xf3\xf8\xe0\xec'
    

    If we just decode the hexadecimal encoding of those bytes into a bytestring, we can see its Python representation, where all of the printable 7-bit ASCII bytes are shown as themselves, but everything else is shown as an escape sequence. Don't be fooled, this is not human-readable text, it's just a sequence of bytes that partially looks readable.

    Alright, so let's try to decode this into text as ISO-8859-1 (aka latin-1) (which is near to the CP1252 codepage).

    >>> b.decode("latin-1")
    'hören - ñëóøàì'
    

    We can see that the ö for hören was decoded well, but the Cyrillic is unreadable mojibake.

    Let's do it the other way, then:

    >>> b.decode("cp1251")
    'hцren - слушам'
    

    The German turns out a bit unfortunate, because the byte \xf6 is interpreted as ц in CP1251 but the Russian checks out (according to Google Translate anyway).

    So – if we were using Python, we'd decode this by splitting it and decoding each half:

    >>> de_bytes, _, ru_bytes = b.partition(b" - ")
    >>> (de_bytes.decode("latin-1"), ru_bytes.decode("cp1251"))
    ('hören', 'слушам')
    

    (and this indeed prints out just fine on my Mac's terminal, and would also do so in Python UTF-8 Mode on Windows).

    Now, back to C land: the issue is that fgets() and friends don't give a darn about encodings – they're all just bytes (though fgets() knows that the byte 0x0a (10 in decimal) is the newline character in ASCII encoding, and stops reading there).

    When you read those bytes, you get exactly those bytes, and it's up to your app to interpret them. When you output those bytes using printf() on your regular Windows terminal, it will use the current console output codepage to translate the bytes into glyphs.

    Technically, you could output these files correctly in your Windows terminal with something like

    1. read a line
    2. switch to codepage 1252 (SetConsoleOutputCP(1252);)
    3. write out each Latin byte until you find space-dash-space
    4. switch to codepage 1251 (SetConsoleOutputCP(1251);)
    5. write out each Cyrillic byte until you're out of this line

    ... rinse and repeat.

    Another option would be to read your input into Unicode codepoints, e.g. UTF-8 or UTF-16. You'd still have to interpret each half of the lines differently, and UTF-8 in particular is a variable-width encoding, so you can't trust strlen() to give you the actual human-eyes length of a string anymore, but at least your playing ground would be level enough so you could use some of the answers in Properly print utf8 characters in windows console.