pythonunicodeencoding

How to open a text file that has emojis in it?


I´m trying to do the simplest thing, open a file, read and close it in python. Simple. Well this is the code:

name_file = open("Forever.txt", encoding='UTF-8')
data = name_file.read()
name_file.close()

print (data)

I know that this texts has emojis in it like hearts, etc. The thing is that this emojis are not in there unicode syntax like U+2600 , they are placed as little images. I think the following error is because of this little images:

return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\U0001f681' in         
position 2333: character maps to <undefined>

I tried the following, without specifyng encoding:

name_file = open("Forever.txt")

And the error changed to this:

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 2303: character maps to <undefined>

No idea why is this happening.

Maybe one solution would be to save in a variable everything that is test and deleting the rest...mmm.


Solution

  • You are getting a UnicodeEncodeError, likely from your print statement. The file is being read and interpreted correctly, but you can only print characters that your console encoding and font actually support. The error indicates the character isn't supported in the current encoding.

    For example:

    Python 3.3.5 (v3.3.5:62cf4e77f785, Mar  9 2014, 10:35:05) [MSC v.1600 64 bit (AMD64)] on win32
    Type "help", "copyright", "credits" or "license" for more information.
    >>> print('\U0001F681')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "C:\\Python33\lib\encodings\cp437.py", line 19, in encode
        return codecs.charmap_encode(input,self.errors,encoding_map)[0]
    UnicodeEncodeError: 'charmap' codec can't encode character '\U0001f681' in position 0: character maps to <undefined>
    

    But print a character the terminal encoding supports, and it works:

    >>> print('\U000000E0')
    à
    

    My console encoding was cp437, but if I use a Python IDE that supports UTF-8 encoding, then it works:

    >>> print('\U0001f681')
    🚁
    

    You may or may not see the character correctly. You need to be using a font that supports the character; otherwise, you get some default replacement character.