Length of string in bash

B

11

607

How do you get the length of a string stored in a variable and assign that to another variable?

myvar="some string"
echo ${#myvar}  
# 11

How do you set another variable to the output 11?

Benitobenjamen answered 28/6, 2013 at 15:14 Comment(0)

E

376

UTF-8 string length

By using `wc`

by using wc, you could (from man bc):

   -c, --bytes
          print the byte counts

   -m, --chars
          print the character counts

So you could under posix shell:

echo -n Généralité | wc -c

echo -n Généralité | wc -m

echo -n Généralité | wc -cm

 10      13

for string in Généralités Language Théorème Février  "Left: ←" "Yin Yang ☯";do
    strlens=$(echo -n "$string"|wc -mc)
    chrs=$((${strlens% *}))
    byts=$((${strlens#*$chrs }))
    printf " - %-*s is %2d chars length, but uses %2d bytes\n" \
        $(( 14 + $byts - $chrs )) "$string" $chrs $byts
done

 - Généralités    is 11 chars length, but uses 14 bytes
 - Language       is  8 chars length, but uses  8 bytes
 - Théorème       is  8 chars length, but uses 10 bytes
 - Février        is  7 chars length, but uses  8 bytes
 - Left: ←        is  7 chars length, but uses  9 bytes
 - Yin Yang ☯     is 10 chars length, but uses 12 bytes

See further, at Useful printf correction tool, for explanation about this syntax.

Under bash, you could split `wc`'s ouput directly:

for string in Généralités Language Théorème Février  "Left: ←" "Yin Yang ☯";do
    read -r chrs byts < <(wc -mc <<<"$string")
    printf " - %-$((14+$byts-chrs))s is %2d chars length, but uses %2d bytes\n" \
        "$string" $((chrs-1)) $((byts-1))
done

But having to fork to wc for each strings could consume a lot of system resources, I prefer to use the pure bash way! Have a look at bottom of this answer to know why!!

By using pure bash

The first idea I had was to change locales environment to force bash to consider each characters as bytes:

myvar='Généralités'
chrlen=${#myvar}
oLang=$LANG oLcAll=$LC_ALL
LANG=C LC_ALL=C
bytlen=${#myvar}
LANG=$oLang LC_ALL=$oLcAll
printf "%s is %d char len, but %d bytes len.\n" "${myvar}" $chrlen $bytlen

will render:

Généralités is 11 char len, but 14 bytes len.

you could even have a look at stored chars:

myvar='Généralités'
chrlen=${#myvar}
oLang=$LANG oLcAll=$LC_ALL
LANG=C LC_ALL=C
bytlen=${#myvar}
printf -v myreal "%q" "$myvar"
LANG=$oLang LC_ALL=$oLcAll
printf "%s has %d chars, %d bytes: (%s).\n" "${myvar}" $chrlen $bytlen "$myreal"

will answer:

Généralités has 11 chars, 14 bytes: ($'G\303\251n\303\251ralit\303\251s').

Nota: According to Isabell Cowan's comment, I've added setting to $LC_ALL along with $LANG.

So function could be:

strU8DiffLen() {
    local chLen=${#1} LANG=C LC_ALL=C
    return $((${#1}-chLen))
}

But surprisingly, this is not the quickest way:

Same, but without having to play with locales

I recently learn %n format of printf command (builtin):

myvar='Généralités'
chrlen=${#myvar}
printf -v _ %s%n "$myvar" bytlen
printf "%s is %d char len, but %d bytes len.\n" "${myvar}" $chrlen $bytlen
Généralités is 11 char len, but 14 bytes len.

printf -v _ tell printf to store result into variable _ instead of ouptut them on STDOUT.
_ is a garbage variable in this use.
%n tell printf to store byte count of already processed string into variable name at corresponding place in arguments.

Syntax is a little counter-intuitive, but this is very efficient! (further function strU8DiffLen is about 2 time quicker by using printf than previous version using local LANG=C.)

Length of an argument, working sample

Argument work same as regular variables

showStrLen() {
    local -i chrlen=${#1} bytlen
    printf -v _ %s%n "$1" bytlen
    LANG=$oLang LC_ALL=$oLcAll
    printf "String '%s' is %d bytes, but %d chars len: %q.\n" "$1" $bytlen $chrlen "$1"
}

will work as

showStrLen théorème

String 'théorème' is 10 bytes, but 8 chars len: $'th\303\251or\303\250me'

Useful `printf` correction tool:

If you:

for string in Généralités Language Théorème Février  "Left: ←" "Yin Yang ☯";do
    printf " - %-14s is %2d char length\n" "'$string'"  ${#string}
done

 - 'Généralités' is 11 char length
 - 'Language'     is  8 char length
 - 'Théorème'   is  8 char length
 - 'Février'     is  7 char length
 - 'Left: ←'    is  7 char length
 - 'Yin Yang ☯' is 10 char length

Not really pretty output!

For this, here is a little function:

strU8DiffLen() {
    local -i bytlen
    printf -v _ %s%n "$1" bytlen
    return $(( bytlen - ${#1} ))
}

or written in one line:

strU8DiffLen() { local -i _bl;printf -v _ %s%n "$1" _bl;return $((_bl-${#1}));}

Then now:

for string in Généralités Language Théorème Février  "Left: ←" "Yin Yang ☯";do
    strU8DiffLen "$string"
    printf " - %-*s is %2d chars length, but uses %2d bytes\n" \
        $((14+$?)) "'$string'" ${#string} $((${#string}+$?))
  done

 - 'Généralités'  is 11 chars length, but uses 14 bytes
 - 'Language'     is  8 chars length, but uses  8 bytes
 - 'Théorème'     is  8 chars length, but uses 10 bytes
 - 'Février'      is  7 chars length, but uses  8 bytes
 - 'Left: ←'      is  7 chars length, but uses  9 bytes
 - 'Yin Yang ☯'   is 10 chars length, but uses 12 bytes

Unfortunely, this is not perfect!

But there left some strange UTF-8 behaviour, like double-spaced chars, zero spaced chars, reverse deplacement and other that could not be as simple...

Have a look at diffU8test.sh or diffU8test.sh.txt for more limitations.

Comparison: fork to `wc` vs pure bash:

Making a little loop of 1'000 String length inquiries:

string="Généralité"
time for i in {1..1000};do strlens=$(echo -n "$string"|wc -mc);done;echo $strlens

real    0m2.637s
user    0m2.256s
sys 0m0.906s
10 13

string="Généralité"
time for i in {1..1000};do printf -v _ %s%n "$string" bytlen;chrlen=${#string};done;echo $chrlen $bytlen

real    0m0.005s
user    0m0.005s
sys 0m0.000s
10 13

Hopefully result (10 13) is same, but execution time differ a lot, something like 500x quicker using pure bash!!

Eda answered 23/6, 2015 at 17:50 Comment(20)

I appreciate this answer, as file systems impose name limitations in bytes and not characters. – Motteo 14/11, 2016 at 18:33

You may also need to set LC_ALL=C and perhaps others. – Surmullet 29/12, 2016 at 1:49

@IsabellCowan In wich case? I think no! You could prefer to use LC_ALL but if not used, this is not needed. But no other variable have to be used. – Eda 29/12, 2016 at 7:22

@F.Hauri try this code: /usr/bin/env -i LC_ALL=en_US.utf8 LANG=C bash -c 'v=€; echo ${#v}' LC_ALL might be unset by default on your system, but it is not on mine. – Surmullet 30/12, 2016 at 20:18

@IsabellCowan Yes, see man 7 locale, LC_ALL have precedence over all others. It's the reason I follow Debian rules, having LC_ALL= somewhere and change LANG only, by default (It could be very usefull to be able to just change LC_CTIME or LC_NUMERIC).. – Eda 31/12, 2016 at 0:23

@F.Hauri But, it none the less follows that on some systems your solution will not work, because it leaves LC_ALL alone. It might work fine on default installs of Debian and it's derivatives, but on others (like Arch Linux) it will fail to give the correct byte length of the string. – Surmullet 3/1, 2017 at 18:49

it didn't work for me and i couldn't find out why, i successed using iconv like this: STR=$(printf "$1" | iconv -f UTF-8 -t ISO-8859-15), and then ${#STR} worked well – Gross 20/4, 2017 at 11:52

@F.Hauri GNU bash, version 4.1.2(1)-release (x86_64-redhat-linux-gnu) I don't have the admin rights on the server, i tried the examples you gave and i always got the byte length. I'm trying this from a .sh file encoding in UTF-8.. – Gross 20/4, 2017 at 13:45

thanks for taking something simple and convoluting it :) – Genipap 6/11, 2018 at 16:45

Just to note that UTF8 is a variable width encoding from 1 to 6 bytes cf. other encodings i.e. UTF16 which is a fixed width 2 byte per character. – Periphrastic 6/11, 2018 at 18:22

A UTF8 encoded Oracle DB instance allows nvarchar2(4000) data types (4000 bytes, each character stored on 1 to 6 bytes) whereas a UTF16 encoded instance only allows for nvarchar2(2000) data types (4000 bytes, 2 bytes per character). Ex. UTF8 string truncation depends on the number of bytes required to store the data which is not necessarily (and most often not the case when dealing with internationalised software) equal to the number of characters. – Periphrastic 6/11, 2018 at 18:37

@Periphrastic Yes ☯ and ← will require 3 bytes, where é and ô require 2 bytes and a or z only 1 byte... – Eda 6/11, 2018 at 20:39

@Genipap I'm sorry, 對不起 Sometime simple is just an idea. – Eda 6/11, 2018 at 20:43

@F.Hauri correct for UTF8. Encoded in UTF16 each character ("☯", "←", "é", "ô", "a" and "z") is encoded with a fixed 2 bytes. If assuming that all text is ASCII then any mention of UTF8 is "good to know" but not necessary for say as it's 8-bit ASCII and the code points are identical in UTF8. Having taken the time to delve into encodings then it's worth while imho to note that the byte count is encoding dependent and there exists a plethora of different encodings. – Periphrastic 6/11, 2018 at 22:5

@Genipap previous chinese post warn me about another problem. see posted test script about limitation (bug) of this: diffU8test.sh.txt or diffU8test.sh – Eda 3/4, 2019 at 14:52

You can't necessarily guarantee that the default locale is UTF-8. To make sure you get character length rather than byte length, you may want to set LC_ALL=C.UTF-8 and LANG=C.UTF-8. – Bower 21/8, 2020 at 15:18

@nyuszika7h You're right, anyway, mostly my strU8DiffLen will return correct difference. In case current session usr Latin encoding, strU8DiffLen will return 0 (alway) wich will be correct too. – Eda 21/8, 2020 at 20:44

It's worth to mention that the function strU8DiffLen will fail if $(( bytlen - ${#1} )) is greater than 255. Why not just printf the result and call the function inside a sub-shell? Related: gnu.org/software/bash/manual/html_node/Exit-Status.html – Declare 7/3, 2021 at 19:27

@F8ER In order to prevent forks. For sample: Trying to replace return by echo, adding OFF=$(strU8DiffLen....) and replacing ? by OFF in last sample take 10ms in my host, where published proposition do the jobs in 1ms. (10x faster!) – Eda 7/3, 2021 at 19:38

@F8ER If you mind using return, you could replace them by printf -v ${2:-OFF} %d $(( bytlen - ${#1} )), then use $OFF or any other variable by specifying his name as second argument. – Eda 7/3, 2021 at 19:43

M

659

To get the length of a string stored in a variable, say:

myvar="some string"
size=${#myvar}

To confirm it was properly saved, echo it:

$ echo "$size"
11

Mindamindanao answered 28/6, 2013 at 15:15 Comment(4)

With UTF-8 stings, you could have a string length and a bytes length. see my answer – Eda 23/6, 2015 at 17:59

You can also use it directly in other parameter expansions - for example in this test I check that $rulename starts with the $RULE_PREFIX prefix: [ "${rulename:0:${#RULE_PREFIX}}" == "$RULE_PREFIX" ] – Evert 21/7, 2015 at 14:13

Could you please explain a bit the expressions of #myvar and {#myvar}? – Valenta 19/9, 2016 at 6:3

@lerneradams see Bash reference manual →3.5.3 Shell Parameter Expansion on ${#parameter}: The length in characters of the expanded value of parameter is substituted. – Mindamindanao 21/10, 2016 at 14:31

E

376