pythonutf-8decodemojibake

Decode a utf8 string in python


I have a problem about encode and decode in Python. I want to encode a plain text in Vietnamese by my algorithm, but this algorithm can't encode a vietnamese plaintext, so I convert it to UTF-8 by plaintext.encode('utf-8'), then I convert it from bytes to string (because my algorithm only encodes a string). But my problem is in the decode part, then I decode by my algorithm, I got a UTF-8 string, so I want to decode UTF-8 string to Vietnamese text (mojibake), but I can't use receiveString.decode('utf-8') because "string has no attribute 'decode'". I know strings can't use this method but how to handle that?

This is the string I receive:

b'v\\xc3\\xb4 \\xc4\\x91\\xe1\\xbb\\x8bch thi\\xc3\\xaan h\\xe1\\xba\\xa1'

That's a UTF-8 string, I want to decode it but

'str' object has no attribute 'decode'

Solution

  • Pretty unclear question. However, the following code snippet could help (inline comments show partial progress report):

    receive_string = "b'v\\xc3\\xb4 \\xc4\\x91\\xe1\\xbb\\x8bch thi\\xc3\\xaan h\\xe1\\xba\\xa1'"
    vietnamese_txt = (receive_string
      .encode()                      # b"b'v\\xc3\\xb4 \\xc4\\x91\\xe1\\xbb\\x8bch thi\\xc3\\xaan h\\xe1\\xba\\xa1'"
      .decode('unicode_escape')      #  "b'vô Ä\x91á»\x8bch thiên hạ'"
      .encode('latin1').decode()     #  "b'vô địch thiên hạ'" 
      .lstrip('b').strip("'"))       #    'vô địch thiên hạ'
    
    print(vietnamese_txt)            #     vô địch thiên hạ
    
    vô địch thiên hạ