bashvariablesstring-length

Length of string in bash


How do you get the length of a string stored in a variable and assign that to another variable?

myvar="some string"
echo ${#myvar}  
# 11

How do you set another variable to the output 11?


Solution

  • UTF-8 string length

    By using wc

    by using wc, you could (from man bc):

       -c, --bytes
              print the byte counts
    
       -m, --chars
              print the character counts
    

    So you could under :

    echo -n Généralité | wc -c
    
     13
    
    echo -n Généralité | wc -m
    
     10
    
    echo -n Généralité | wc -cm
    
     10      13
    
    for string in Généralités Language Théorème Février  "Left: ←" "Yin Yang ☯";do
        strlens=$(echo -n "$string"|wc -mc)
        chrs=$((${strlens% *}))
        byts=$((${strlens#*$chrs }))
        printf " - %-*s is %2d chars length, but uses %2d bytes\n" \
            $(( 14 + $byts - $chrs )) "$string" $chrs $byts
    done
    
     - Généralités    is 11 chars length, but uses 14 bytes
     - Language       is  8 chars length, but uses  8 bytes
     - Théorème       is  8 chars length, but uses 10 bytes
     - Février        is  7 chars length, but uses  8 bytes
     - Left: ←        is  7 chars length, but uses  9 bytes
     - Yin Yang ☯     is 10 chars length, but uses 12 bytes
    

    See further, at Useful printf correction tool, for explanation about this syntax.

    Under , you could split wc's ouput directly:

    for string in Généralités Language Théorème Février  "Left: ←" "Yin Yang ☯";do
        read -r chrs byts < <(wc -mc <<<"$string")
        printf " - %-$((14+$byts-chrs))s is %2d chars length, but uses %2d bytes\n" \
            "$string" $((chrs-1)) $((byts-1))
    done
    

    But having to fork to wc for each strings could consume a lot of system resources, I prefer to use the pure bash way! Have a look at bottom of this answer to know why!!

    By using pure

    The first idea I had was to change locales environment to force bash to consider each characters as bytes:

    myvar='Généralités'
    chrlen=${#myvar}
    oLang=$LANG oLcAll=$LC_ALL
    LANG=C LC_ALL=C
    bytlen=${#myvar}
    LANG=$oLang LC_ALL=$oLcAll
    printf "%s is %d char len, but %d bytes len.\n" "${myvar}" $chrlen $bytlen
    

    will render:

    Généralités is 11 char len, but 14 bytes len.
    

    you could even have a look at stored chars:

    myvar='Généralités'
    chrlen=${#myvar}
    oLang=$LANG oLcAll=$LC_ALL
    LANG=C LC_ALL=C
    bytlen=${#myvar}
    printf -v myreal "%q" "$myvar"
    LANG=$oLang LC_ALL=$oLcAll
    printf "%s has %d chars, %d bytes: (%s).\n" "${myvar}" $chrlen $bytlen "$myreal"
    

    will answer:

    Généralités has 11 chars, 14 bytes: ($'G\303\251n\303\251ralit\303\251s').
    

    Nota: According to Isabell Cowan's comment, I've added setting to $LC_ALL along with $LANG.

    So function could be:

    strU8DiffLen() {
        local chLen=${#1} LANG=C LC_ALL=C
        return $((${#1}-chLen))
    }
    

    But surprisingly, this is not the quickest way:

    Same, but without having to play with locales

    I recently learn %n format of printf command (builtin):

    myvar='Généralités'
    chrlen=${#myvar}
    printf -v _ %s%n "$myvar" bytlen
    printf "%s is %d char len, but %d bytes len.\n" "${myvar}" $chrlen $bytlen
    Généralités is 11 char len, but 14 bytes len.
    

    Syntax is a little counter-intuitive, but this is very efficient! (further function strU8DiffLen is about 2 time quicker by using printf than previous version using local LANG=C.)

    Length of an argument, working sample

    Argument work same as regular variables

    showStrLen() {
        local -i chrlen=${#1} bytlen
        printf -v _ %s%n "$1" bytlen
        LANG=$oLang LC_ALL=$oLcAll
        printf "String '%s' is %d bytes, but %d chars len: %q.\n" "$1" $bytlen $chrlen "$1"
    }
    

    will work as

    showStrLen théorème
    
    String 'théorème' is 10 bytes, but 8 chars len: $'th\303\251or\303\250me'
    

    Useful printf correction tool:

    If you:

    for string in Généralités Language Théorème Février  "Left: ←" "Yin Yang ☯";do
        printf " - %-14s is %2d char length\n" "'$string'"  ${#string}
    done
    
     - 'Généralités' is 11 char length
     - 'Language'     is  8 char length
     - 'Théorème'   is  8 char length
     - 'Février'     is  7 char length
     - 'Left: ←'    is  7 char length
     - 'Yin Yang ☯' is 10 char length
    

    Not really pretty output!

    For this, here is a little function:

    strU8DiffLen() {
        local -i bytlen
        printf -v _ %s%n "$1" bytlen
        return $(( bytlen - ${#1} ))
    }
    

    or written in one line:

    strU8DiffLen() { local -i _bl;printf -v _ %s%n "$1" _bl;return $((_bl-${#1}));}
    

    Then now:

    for string in Généralités Language Théorème Février  "Left: ←" "Yin Yang ☯";do
        strU8DiffLen "$string"
        printf " - %-*s is %2d chars length, but uses %2d bytes\n" \
            $((14+$?)) "'$string'" ${#string} $((${#string}+$?))
      done 
    
     - 'Généralités'  is 11 chars length, but uses 14 bytes
     - 'Language'     is  8 chars length, but uses  8 bytes
     - 'Théorème'     is  8 chars length, but uses 10 bytes
     - 'Février'      is  7 chars length, but uses  8 bytes
     - 'Left: ←'      is  7 chars length, but uses  9 bytes
     - 'Yin Yang ☯'   is 10 chars length, but uses 12 bytes
    

    Unfortunely, this is not perfect!

    But there left some strange UTF-8 behaviour, like double-spaced chars, zero spaced chars, reverse deplacement and other that could not be as simple...

    Have a look at diffU8test.sh or diffU8test.sh.txt for more limitations.

    Comparison: fork to wc vs pure :

    Making a little loop of 1'000 String length inquiries:

    string="Généralité"
    time for i in {1..1000};do strlens=$(echo -n "$string"|wc -mc);done;echo $strlens
    
    real    0m2.637s
    user    0m2.256s
    sys 0m0.906s
    10 13
    
    string="Généralité"
    time for i in {1..1000};do printf -v _ %s%n "$string" bytlen;chrlen=${#string};done;echo $chrlen $bytlen
    
    real    0m0.005s
    user    0m0.005s
    sys 0m0.000s
    10 13
    

    Hopefully result (10 13) is same, but execution time differ a lot, something like 500x quicker using pure bash!!