pythonpython-3.xpython-bytearray

Replacing byte in bytes array to fix encoding


I'm using ftfy to fix broken UTF-8 encoding that shows as CP1252 and convert it to UTF-8 cyrillic, but I've found that some letters can't be fixed.

I have a string Ð'010СС199 that I convert to bytes and define pairs b"\xc3\x90'010\xc3\x90\xc2\xa1\xc3\x90\xc2\xa1199" where:

\xc3\x90' -> \xd0\x92 -> Cyrillic В
\xc3\x90\xc2\xa1\ -> \xd0\xa1 -> cyrillic С

As you can see Ð' length is 2. ord won't work in this case.

For using slice I must know where is start and end.

Translate also doesn't work here.

Previously I've used simple string replacement, but now I'd like to improve my method and exclude mistakes.

Original Ð'010СС199 -> conversion -> outputВ010СС199

EDIT:

    str = "Ð'010СС199"
    str_to_bytes = str.encode("UTF-8")
    print(str_to_bytes)
    # UTF-8 bytes
    # \xc3\x90\xc2\xa0 : \xd0\xa0 -> cyrillic Р
    # \xc3\x90\xc2\xa1 : \xd0\xa1 -> cyrillic С
    # \xc3\x90\xe2\x80\x94' : \xd0\x97 -> cyrillic З
    # \xc3\x90' : \xd0\x92 -> Cyrillic В
    test_str = b"\xc3\x90'010\xc3\x90\xc2\xa1\xc3\x90\xc2\xa1199"
    t1 = test_str.replace(b'\xc3\x90\xc2\xa1', b'\xd0\xa1')
    print(t1)
    dict_cyr = {"Ð'": "P",
                "С":"C"}
    t2 = test_str.translate(test_str)
    print(t2)

I can explain how I received results. 1. I used 2cyr.com decoder. But even it failed in some cases. 2. I have a manually translated strings, so I compared them and define what byte corresponds to cyrillic letter with help of UTF-8 chartable.


Solution

  • A common problem in encoding/decoding is encoding a string in utf-8 and later decoding the bytestring as if it were cp1252 (often because of a stupid windows app).

    It could be what happens here, because CYRILLIC CAPITAL LETTER VE ('В' or '\u0412') and CYRILLIC CAPITAL LETTER ES (or) respectively translate as:

    >>> '\u0412'.encode().decode('cp1252')
    'Ð’'
    >>> '\u0421'.encode().decode('cp1252')
    'С'
    

    Which is close from your original string, except that my transformation uses a RIGHT SINGLE QUOTATION MARK ( or U+2019) while your string contains an APOSTROPHE (' or U+0027).

    If the string actually contains an APOSTROPHE, it could be caused by an attempt of filtering non latin characters from a cp1252 encoded string. The downside is that it is hard to guess whether the apostrophe is a true one or a filtered right single quotation mark.

    If it does contain a single quotation mark, then it can be transformed back as simply as:

    >>> 'В010СС199'.encode('cp1252').decode()
    'В010СС199'