pythonnon-unicode

charmap error when encoding Japanese characters


I am making a program to translate specific Japanese characters to their English spelling from an external text file using the replace() function, but I am facing a strange error.

I made sure to encode all the characters in the text file and then put it to an variable and then start the replace process in bytes level on that variable, then after that it gets decoded again into strings and then gets written to a new text file.

path = input('Location: ').strip('"')
txt = ''
with open(path,'rb') as f:
    txt = f.read()

def convert(jchar,echar):
    ct = txt.replace(jchar.encode('utf-8'),echar.encode('utf-8'))
    return ct

txt = convert('ぁ','a')
txt = convert('っ','su')

with open('Translated.txt','w') as tf:   
    tf.write(txt.decode('utf-8'))

input('Done.')

If the text file includes all the Japanese characters that are replaceable in the script everything goes perfectly, but if the text file contains an Japanese character that's not replaceable in the script i get this error:

UnicodeEncodeError: 'charmap' codec can't encode character '\u306e' in position 6: character maps to <undefined>

And by that, python seems to not be able to decode a Japanese character's bytes again after encoding it.

And the worst is that there are even some other non-Unicode characters that even if I made it replaceable on the python script, I still get the same error, which means that python cannot even encode it, but my main focus now is why python refuses to decode the bytes of a Japanese character although that python it self was able to encode it.


Solution

  • You need to set the correct encoding when opening the file you're writing to, like such:

    with open('Translated.txt','w', encoding='utf-8') as tf:
    

    Python defaults to a specific encoding based on the platform you're running it on. On Windows, it's probably ASCII. When you try to write your characters to the file, it tries to decode write the byte(s) as an ASCII (or whatever non-Unicode encoding your system defaults to) string, but there's no ASCII character for that byte, so it fails.

    The reason it works when you're substituting the character is that the Roman characters can be written as ASCII, and because the error happens when you try to write to a file. If you take a look at the Traceback that's printed, you'd see exactly where it happened:

    Traceback (most recent call last):
      File ".\sandbox.py", line 61, in <module>
        tf.write(txt.decode('utf-8'))
      File "[...]\Python\Python37\lib\encodings\cp1252.py", line 19, in encode
        return codecs.charmap_encode(input,self.errors,encoding_table)[0]
    UnicodeEncodeError: 'charmap' codec can't encode character '\u3041' in position 11: character maps to <undefined>