pythoncsvcharacter-encodingwindows-1251

Why does Python String concatenation work with Russian text but string.format() does not


I'm trying to parse (and escape) rows of a CSV file that is stored in Windows-1251 character encoding. Using this excellent answer to deal with this encoding I've ended up with this one line to test the output, for some reason this works:

print(row[0]+','+row[1])

Outputting:

Тяжелый Уборщик Обязанности,1 литр

While this line doesn't work:

print("{0},{1}".format(*row))

Outputting this error:

Name,Variant

Traceback (most recent call last):
  File "Russian.py", line 26, in <module>
    print("{0},{1}".format(*row))
UnicodeEncodeError: 'ascii' codec can't encode characters in position 2-3: ordinal not in range(128)

Here are the first 2 lines of the CSV:

Name,Variant
Тяжелый Уборщик Обязанности,1 литр

and in case it helps, here is the full source of Russian.py:

import csv
import cgi
from chardet.universaldetector import UniversalDetector
chardet_detector = UniversalDetector()

def charset_detect(f, chunk_size=4096):
    global chardet_detector
    chardet_detector.reset()
    while 1:
        chunk = f.read(chunk_size)
        if not chunk: break
        chardet_detector.feed(chunk)
        if chardet_detector.done: break
    chardet_detector.close()
    return chardet_detector.result

with open('Russian.csv') as csv_file:
    cd_result = charset_detect(csv_file)
    encoding = cd_result['encoding']
    csv_file.seek(0)
    csv_reader = csv.reader(csv_file)
    for bytes_row in csv_reader:
        row = [x.decode(encoding) for x in bytes_row]
        if len(row) >= 6:
            #print(row[0]+','+row[1])
            print("{0},{1}".format(*row))

Solution

  • The strings in your list were likely already unicode, so you didn't get an issue.

    print(row[0]+','+row[1])
    Тяжелый Уборщик Обязанности,1 литр
    

    But here we are trying to add unicode to a normal string! That's why you get the UnicodeEncodeError.

    print("{0},{1}".format(*row))
    

    So just change it to:

    print(u"{0}, {1}".format(*row))