pythonpython-3.xunicodeformattingzero-width-space

How can I get sensible results from len(), str.format() and a zero-width space?


I'm trying to format text in a kind of table and write the result to a file, but I have problems with the alignment, because my source sometimes contains the Unicode character 'ZERO WIDTH SPACE' or \u200b in python. Consider the following code example:

str_list = ("a\u200b\u200b", "b", "longest entry\u200b")
format_str = "|{string:<{width}}| output of len(): {length}\n"

max_width = 0
for item in str_list:
    if len(item) > max_width:
        max_width = len(item)

with open("tmp", mode='w', encoding="utf-8") as file:
    for item in str_list:
        file.write(format_str.format(string=item,
                                     width=max_width,
                                     length=len(item)))

Content of 'tmp' after running above script:

|a​​           | output of len(): 3
|b             | output of len(): 1
|longest entry​| output of len(): 14

So this looks like len() does not result in the 'printed width' of the string, and str.format() does not know how to handle zero width characters.

Or, this behavior is intentional and I need to do something else.

To be clear, I'm looking for a way to get something like this result:

|a​​            | output of len(): 1
|b            | output of len(): 1
|longest entry​| output of len(): 13

I'd prefer if it's possible to do without mangling my source.


Solution

  • The wcwidth package has a function wcswidth() which returns the width of a string in character cells:

    from wcwidth import wcswidth
    
    length = len('sneaky\u200bPete')      # 11
    width = wcswidth('sneaky\u200bPete')  # 10
    

    The difference between wcswidth(s) and len(s) can then be used to correct for the error introduced by str.format(). Modifying your code above:

    from wcwidth import wcswidth
    
    str_list = ("a\u200b\u200b", "b", "longest entry\u200b")
    format_str = "|{s:<{fmt_width}}| width: {width}, error: {fmt_error}\n"
    
    max_width = max(wcswidth(s) for s in str_list)
    
    with open("tmp", mode='w', encoding="utf-8") as file:
        for s in str_list:
            width = wcswidth(s)
            fmt_error = len(s) - width
            fmt_width = max_width + fmt_error
            file.write(format_str.format(s=s,
                                         fmt_width=fmt_width,
                                         width=width,
                                         fmt_error=fmt_error))
    

    … produces this output:

    |a​​            | width: 1, error: 2
    |b            | width: 1, error: 0
    |longest entry​| width: 13, error: 1
    

    It also produces correct output for strings including double-width characters:

    str_list = ("a\u200b\u200b", "b", "㓵", "longest entry\u200b")
    

    |a​​            | width: 1, error: 2
    |b            | width: 1, error: 0
    |㓵           | width: 2, error: -1
    |longest entry​| width: 13, error: 1