Generate variable containing number of characters in a string variable
Asked Answered
P

2

6

In a survey dataset I have a string variable (type: str244) with qualitative responses. I want to count the number of characters in each response/string and generate a new variable containing this number.

Using the egenmore I have already counted the number of words using nwords, but I cannot find the counterpart for counting characters.

EXAMPLE:

egen countvar = nwords(stringvar)

where countvar is the new variable name and stringvar is the string variable.

Does such an egen function exist for counting characters?

Potiche answered 5/8, 2015 at 17:48 Comment(3)
The function wordcount() in Stata makes the older add-on nwords() redundant. Note egenmore is downloaded using ssc inst egenmore.Deliquesce
The help for egenmore does point to wordcount(). N.B. nwords() (written for Stata 6) is very slow.Deliquesce
Thank you for mentioning this. gen countvar = wordcount(stringvar) works like a charm. I wasn't aware that wordcount was used with gen, not egen. Perfect!Potiche
D
11

There is no egen function because there has long [sic] been a function strict sense to do this. In recent versions of Stata, the function is called strlen() but the older name length() continues to work:

. sysuse auto
(1978 Automobile Data)

. gen l1 = length(make)

. gen l2 = strlen(make)

. su l?

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
          l1 |         74    11.77027    2.155257          6         17
          l2 |         74    11.77027    2.155257          6         17

See help functions and (e.g.) this tutorial column.

Deliquesce answered 5/8, 2015 at 18:0 Comment(4)
What about for counting digits in a numeric variable?Ballance
That's a new question really as there are subtle differences. Do you mean integers or you include decimal parts? If you mean integers, log10(x) + 1 is a good start. If you include numbers with decimal parts, the question is a lot messier without knowing a display format.Deliquesce
log10(x)+1 breaks down with larger numbers (at least it does in the console). You need to wrap it in floor before adding 1. Try these two 9 digit numbers di log10(999999999)+1 di log10(999099999)+1 di floor(log10(999999999))+1 di floor(log10(999099999))+1Jordan
@Jordan That's the kind of detail that made me say "a good start".Deliquesce
Q
-1
. sysuse auto,clear
(1978 Automobile Data)

. tostring price, gen(price1)
price1 generated as str5

. gen l3=length(price1)

. sum l3

    Variable |        Obs        Mean    Std. Dev.       Min        Max

          l3 |         74    4.135135    .3442015          4          5
Quickman answered 6/9, 2020 at 11:31 Comment(2)
in case u want the count of numeric variableQuickman
This has to seem naive. See my comment underneath my answer. The "length" of a numeric variable is well defined only in certain cases. In your example, price is reported as a positive integer, and for that you don't need to convert to a string variable. You just need to push the maximum value through ceil(log10()). Your code could be problematic for variables in which any numeric value was negative or contained fractional parts, depending on precision issues and what you want precisely.Deliquesce

© 2022 - 2024 — McMap. All rights reserved.