pythonpython-3.xutf-8cp1251

How to convert a string from cp1251 to UTF-8 in Python3?


A help needed with a pretty simple Python 3.6 script.

First, it downloads an HTML file from an old-fashioned server which uses cp1251 encoding.

Then I need to put the file contents into a UTF-8 encoded string.

Here is what I'm doing:

import requests
import codecs

#getting the file
ri = requests.get('http://old.moluch.ru/_python_test/0.html')

#checking that it's in cp1251
print(ri.encoding)

#encoding using cp1251
text = ri.text
text = codecs.encode(text,'cp1251')

#decoding using utf-8 - ERROR HERE!
text = codecs.decode(text,'utf-8')

print(text)

Here is the error:

Traceback (most recent call last):
  File "main.py", line 15, in <module>
    text = codecs.decode(text,'utf-8')
  File "/var/lang/lib/python3.6/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xca in position 43: invalid continuation byte

I'd really appreciate any help with it.


Solution

  • Not sure what you are trying to do.

    .text is the text of the response, a Python string. Encodings don't play any role in Python strings.

    Encodings only play a role when you have a stream of bytes that you want to convert to a string (or the other way around). And the requests module already does that for you.

    import requests
    
    ri = requests.get('http://old.moluch.ru/_python_test/0.html')
    print(ri.text)
    

    For example, assume you have a text file (i.e.: bytes). Then you must pick an encoding when you open() the file - the choice of encoding determines how the bytes in the file are converted into characters. This manual step is necessary because open() cannot know what encoding the bytes of the file are in.

    HTTP on the other hand sends this in the response headers (Content-Type), so requests can know this information. Being a high-level module, it helpfully looks at the HTTP headers and converts the incoming bytes for you. (If you would use the much more low-level urllib, you'd have to do your own decoding.)

    The .encoding property is purely informational when you use the .text of the response. It might be relevant if you use the .raw property, though. For work with servers that return regular text responses, using .raw is seldom necessary.