I'm using ftfy
to fix broken UTF-8
encoding that shows as CP1252
and convert it to UTF-8
cyrillic, but I've found that some letters can't be fixed.
I have a string Ð'010СС199
that I convert to bytes and define pairs b"\xc3\x90'010\xc3\x90\xc2\xa1\xc3\x90\xc2\xa1199"
where:
\xc3\x90' -> \xd0\x92 -> Cyrillic В
\xc3\x90\xc2\xa1\ -> \xd0\xa1 -> cyrillic С
As you can see Ð'
length is 2. ord
won't work in this case.
For using slice
I must know where is start
and end
.
Translate
also doesn't work here.
Previously I've used simple string replacement, but now I'd like to improve my method and exclude mistakes.
Original Ð'010СС199
-> conversion -> outputВ010СС199
EDIT:
str = "Ð'010СС199"
str_to_bytes = str.encode("UTF-8")
print(str_to_bytes)
# UTF-8 bytes
# \xc3\x90\xc2\xa0 : \xd0\xa0 -> cyrillic Р
# \xc3\x90\xc2\xa1 : \xd0\xa1 -> cyrillic С
# \xc3\x90\xe2\x80\x94' : \xd0\x97 -> cyrillic З
# \xc3\x90' : \xd0\x92 -> Cyrillic В
test_str = b"\xc3\x90'010\xc3\x90\xc2\xa1\xc3\x90\xc2\xa1199"
t1 = test_str.replace(b'\xc3\x90\xc2\xa1', b'\xd0\xa1')
print(t1)
dict_cyr = {"Ð'": "P",
"С":"C"}
t2 = test_str.translate(test_str)
print(t2)
I can explain how I received results. 1. I used 2cyr.com decoder. But even it failed in some cases. 2. I have a manually translated strings, so I compared them and define what byte corresponds to cyrillic letter with help of UTF-8 chartable.
A common problem in encoding/decoding is encoding a string in utf-8 and later decoding the bytestring as if it were cp1252 (often because of a stupid windows app).
It could be what happens here, because CYRILLIC CAPITAL LETTER VE ('В'
or '\u0412'
) and CYRILLIC CAPITAL LETTER ES (or
) respectively translate as:
>>> '\u0412'.encode().decode('cp1252')
'Ð’'
>>> '\u0421'.encode().decode('cp1252')
'С'
Which is close from your original string, except that my transformation uses a RIGHT SINGLE QUOTATION MARK (’
or U+2019) while your string contains an APOSTROPHE ('
or U+0027).
If the string actually contains an APOSTROPHE, it could be caused by an attempt of filtering non latin characters from a cp1252 encoded string. The downside is that it is hard to guess whether the apostrophe is a true one or a filtered right single quotation mark.
If it does contain a single quotation mark, then it can be transformed back as simply as:
>>> 'В010СС199'.encode('cp1252').decode()
'В010СС199'