How do you get the length of a string stored in a variable and assign that to another variable?
myvar="some string"
echo ${#myvar}
# 11
How do you set another variable to the output 11
?
wc
by using wc
, you could (from man bc
):
-c, --bytes print the byte counts -m, --chars print the character counts
So you could under posix shell:
echo -n Généralité | wc -c
13
echo -n Généralité | wc -m
10
echo -n Généralité | wc -cm
10 13
for string in Généralités Language Théorème Février "Left: ←" "Yin Yang ☯";do
strlens=$(echo -n "$string"|wc -mc)
chrs=$((${strlens% *}))
byts=$((${strlens#*$chrs }))
printf " - %-*s is %2d chars length, but uses %2d bytes\n" \
$(( 14 + $byts - $chrs )) "$string" $chrs $byts
done
- Généralités is 11 chars length, but uses 14 bytes
- Language is 8 chars length, but uses 8 bytes
- Théorème is 8 chars length, but uses 10 bytes
- Février is 7 chars length, but uses 8 bytes
- Left: ← is 7 chars length, but uses 9 bytes
- Yin Yang ☯ is 10 chars length, but uses 12 bytes
See further, at Useful printf correction tool, for explanation about this syntax.
wc
's ouput directly:for string in Généralités Language Théorème Février "Left: ←" "Yin Yang ☯";do
read -r chrs byts < <(wc -mc <<<"$string")
printf " - %-$((14+$byts-chrs))s is %2d chars length, but uses %2d bytes\n" \
"$string" $((chrs-1)) $((byts-1))
done
But having to fork to wc
for each strings could consume a lot of system resources, I prefer to use the pure bash way! Have a look at bottom of this answer to know why!!
The first idea I had was to change locales environment to force bash to consider each characters as bytes:
myvar='Généralités'
chrlen=${#myvar}
oLang=$LANG oLcAll=$LC_ALL
LANG=C LC_ALL=C
bytlen=${#myvar}
LANG=$oLang LC_ALL=$oLcAll
printf "%s is %d char len, but %d bytes len.\n" "${myvar}" $chrlen $bytlen
will render:
Généralités is 11 char len, but 14 bytes len.
you could even have a look at stored chars:
myvar='Généralités'
chrlen=${#myvar}
oLang=$LANG oLcAll=$LC_ALL
LANG=C LC_ALL=C
bytlen=${#myvar}
printf -v myreal "%q" "$myvar"
LANG=$oLang LC_ALL=$oLcAll
printf "%s has %d chars, %d bytes: (%s).\n" "${myvar}" $chrlen $bytlen "$myreal"
will answer:
Généralités has 11 chars, 14 bytes: ($'G\303\251n\303\251ralit\303\251s').
Nota: According to Isabell Cowan's comment, I've added setting to $LC_ALL
along with $LANG
.
So function could be:
strU8DiffLen() {
local chLen=${#1} LANG=C LC_ALL=C
return $((${#1}-chLen))
}
But surprisingly, this is not the quickest way:
I recently learn %n
format of printf
command (builtin):
myvar='Généralités'
chrlen=${#myvar}
printf -v _ %s%n "$myvar" bytlen
printf "%s is %d char len, but %d bytes len.\n" "${myvar}" $chrlen $bytlen
Généralités is 11 char len, but 14 bytes len.
printf -v _
tell printf to store result into variable _
instead of ouptut them on STDOUT
._
is a garbage variable in this use.%n
tell printf to store byte count of already processed string into variable name at corresponding place in arguments.Syntax is a little counter-intuitive, but this is very efficient! (further function strU8DiffLen
is about 2 time quicker by using printf
than previous version using local LANG=C
.)
Argument work same as regular variables
showStrLen() {
local -i chrlen=${#1} bytlen
printf -v _ %s%n "$1" bytlen
LANG=$oLang LC_ALL=$oLcAll
printf "String '%s' is %d bytes, but %d chars len: %q.\n" "$1" $bytlen $chrlen "$1"
}
will work as
showStrLen théorème
String 'théorème' is 10 bytes, but 8 chars len: $'th\303\251or\303\250me'
printf
correction tool:If you:
for string in Généralités Language Théorème Février "Left: ←" "Yin Yang ☯";do
printf " - %-14s is %2d char length\n" "'$string'" ${#string}
done
- 'Généralités' is 11 char length
- 'Language' is 8 char length
- 'Théorème' is 8 char length
- 'Février' is 7 char length
- 'Left: ←' is 7 char length
- 'Yin Yang ☯' is 10 char length
Not really pretty output!
For this, here is a little function:
strU8DiffLen() {
local -i bytlen
printf -v _ %s%n "$1" bytlen
return $(( bytlen - ${#1} ))
}
or written in one line:
strU8DiffLen() { local -i _bl;printf -v _ %s%n "$1" _bl;return $((_bl-${#1}));}
Then now:
for string in Généralités Language Théorème Février "Left: ←" "Yin Yang ☯";do
strU8DiffLen "$string"
printf " - %-*s is %2d chars length, but uses %2d bytes\n" \
$((14+$?)) "'$string'" ${#string} $((${#string}+$?))
done
- 'Généralités' is 11 chars length, but uses 14 bytes
- 'Language' is 8 chars length, but uses 8 bytes
- 'Théorème' is 8 chars length, but uses 10 bytes
- 'Février' is 7 chars length, but uses 8 bytes
- 'Left: ←' is 7 chars length, but uses 9 bytes
- 'Yin Yang ☯' is 10 chars length, but uses 12 bytes
But there left some strange UTF-8 behaviour, like double-spaced chars, zero spaced chars, reverse deplacement and other that could not be as simple...
Have a look at diffU8test.sh or diffU8test.sh.txt for more limitations.
wc
vs pure bash:Making a little loop of 1'000 String length inquiries:
string="Généralité"
time for i in {1..1000};do strlens=$(echo -n "$string"|wc -mc);done;echo $strlens
real 0m2.637s
user 0m2.256s
sys 0m0.906s
10 13
string="Généralité"
time for i in {1..1000};do printf -v _ %s%n "$string" bytlen;chrlen=${#string};done;echo $chrlen $bytlen
real 0m0.005s
user 0m0.005s
sys 0m0.000s
10 13
Hopefully result (10 13
) is same, but execution time differ a lot, something like 500x quicker using pure bash!!