pythonbashstring-length

Get the printed length of a string in terminal


It seems like a fairly simple task, yet I can't find a fast and reliable solution to it.

I have strings in bash, and I want to know the number of characters that will be printed on the terminal. The reason I need this, is to nicely align the strings in three columns of n characters each. For that, I need to add as many "space" as necessary to make sure the second and third columns always starts at the same location in the terminal.

Example of problematic string length:

v='féé'

echo "${#v1}"
 > # 5 (should be 3)

printf '%s' "${v1}" | wc -m
 > # 5 (should be 3)

printf '%s' "${v1}" | awk '{print length}'
 > # 5 (should be 3)

The best I have found is this, that works most of the time.

echo "${v}" | python3 -c 'v=input();print(len(v))'
 > # 3 (yeah!)

But sometimes, I have characters that are modified by the following sequences. I can't copy/past that here, but this is how it looks like:

v="de\314\201tresse"
echo "${v}"
 > # détresse
echo "${v}" | python3 -c 'v=input();print(len(v))'
 > # 9 (should be 8)

I know it can be even more complicated with \r character or ANSI sequences, but I am only going to have to deal with "regular" strings that can be commonly found in filenames, documents and other file content writing by humans. Since the string IS printed in the terminal, I guess there must be some engine that knows or can know the printed length of the string.

I have also considered the possible solution of sending ANSI sequence to get the position of the cursor in the terminal before and after printing the string, and use the difference to compute the length, but it looks like a rabbit hole I don't want to dig. Plus it will be very slow.


Solution

  • How about

    v='féé'
    echo "${v}" | python3 -c 'import unicodedata as ud;v=input();print(len(ud.normalize("NFC",v)))'
    

    If you have trouble installing with

    pip install unicodedata
    

    try unicodedata2

    Additional Notes

    This will normalize strings to utf-8 according to the NFC standard explained here. If you are working with Latin ANSI, then it should work fine. However, for pre-Unicode ANSI encodings of languages such as Arabic, Greek, Hebrew, Russian or Thai, then NFC may keep the original formatting. Although it is generally more advisable to use NFC, you could try NFKC in those cases. The reason for preferring NFC is to avoid normalizing symbols that are compatible but not canonically equivalent, for example the single character ff (U+FB00): if you normalize it with NFC, it is length 1, but if you normalize it with NFKC, that's length 2. Depending on your application that can create some issues, but if you just want readable text, then NFKC is fine.