pythonpython-2.7

python Incorrect formatting Cyrillic


def inp(text):
    tmp = str()
    arr = ['.' for x in range(1, 40 - len(text))]
    tmp += text + ''.join(arr)
    print tmp

s=['tester', 'om', 'sup', 'jope']
sr=['тестер', 'ом', 'суп', 'жопа']
for i in s:
    inp(i)
for i in sr:
    inp(i)

Output:

tester.................................
om.....................................
sup....................................
jope...................................

тестер...........................
ом...................................
суп.................................
жопа...............................

Why is Python not properly handling Cyrillic? End of the line is not straight and scrappy. Using the formatting goes the same. How can this be corrected? thanks


Solution

  • Read this:

    Basically, what you have in text parameter to inp function is a string. In Python 2.7, strings are bytes by default. Cyrilic characters are not mapped 1-1 to bytes when encoded in e.g. utf-8 encoding, but require more than one byte (usually 2 in utf-8), so when you do len(text) you don't get the number of characters, but number of bytes.

    In order to get the number of characters, you need to know your encoding. Assuming it's utf-8, you can decode text to that encoding and it will print right:

    #!/usr/bin/python
    # coding=utf-8
    def inp(text):
        tmp = str()
        utext = text.decode('utf-8')
        l = len(utext)
        arr = ['.' for x in range(1, 40 - l)]
        tmp += text + ''.join(arr)
        print tmp
    
    s=['tester', 'om', 'sup', 'jope']
    sr=['тестер', 'ом', 'суп', 'жопа']
    for i in s:
        inp(i)
    for i in sr:
        inp(i)
    

    The important lines are these two:

        utext = text.decode('utf-8')
        l = len(utext)
    

    where you first decode the text, which results in an unicode string. After that, you can use the built in len to get the length in characters, which is what you want.