pythonunicodeutf-8utf-16file-encodings

How to convert encoding of text file (which contains text of language other than English) from "UTF-16 LE" to "UTF-8" in Python?


I have few text files which contain text in Hindi language in a folder. But those text files are in UTF-16 LE Encoding. I want to change the encoding to UTF-8 without changing text in it. How can I do that?

I wrote two python files but none of them are working proprely. When I run any of them, along with changing the encoding, they clear the file content. These are code in my Python files:

File 1:

import os
for root, dirs, files in os.walk("."):  
    for filename in files:
        #print(filename[-4:])
        if(filename[-3:] == "txt"):
            f= open(filename,"w+")
            x = f.read()
            print(x)
            f.close()
            f1= open(filename, "w+", encoding="utf-8")
            f1.write(x)
            f1.close()

File 2:

import codecs
BLOCKSIZE = 1048576
with codecs.open("ee.txt", "r", "utf-16-le") as sourceFile:
    with codecs.open("ee.txt", "w", "utf-8") as targetFile:
        while True:
            contents = sourceFile.read(BLOCKSIZE)
            print(contents)
            if not contents:
                break
            targetFile.write(contents)

Solution

  • You are not specifying the files are in utf-16 LE when reading the contents - that, and there is this confusion of trying to read and write to the same file at the same time, which won't work.

    Also, unless you are running this code in a server where an attack attempt may be made by sending you an inordinately big text file, you should not worry about file size, and just read all file contents at once. (For you to have an idea, the Bible which is a big book is on the order of 3 MB in size (with 8bit encoding) - and even small VPS servers will have at on the order of 200MB of memory available to your program - that is, you could convert a book the size of 30+ bibles at once). Typical desktop computers will have several times this amount of memory.

    Also, the relatively recent "pathlib" Python library can ease terating through all your text files, and its Path.read_text and Path.write_text methods will open a file, read or write the contents in the correct encoding, and close it in a single expression. Since when using this method, at time of writting the file the reading will be already done, we can simply do it with two calls:

    import pathlib
    for filepath in pathlib.Path(".").glob("**/*.txt"):
       data = filepath.read_text(encoding="utf-16 LE")
       filepath.write_text(data, encoding="utf-8")
    

    If you prefer to be on the safe side, on the very, very unlikely of a catastrophic computer crash on the middle of a file conversion, you could write to a diffrently named file, and do the deleting/rename afterwards - so the code is like this:

    import pathlib
    for filepath in pathlib.Path(".").glob("**/*.txt"):
       data = filepath.read_text(encoding="utf-16 LE")
       tmp_name = filepath.name + ".tmp"
       filepath.with_name(tmp_name).write_text(data, encoding="utf-8")
       filepath.unlink()
       filepath.with_name(tmp_name).rename(filepath.name)