Count unique values in Stata
Asked Answered
C

4

6

codebook is a great command in Stata. It describes data contents but also simply identifies unique values

sysuse auto, clear
codebook mpg, compact

Number of unique values of mpg is 21. Looking at the help of the command, it does not seem possible to store this value. Am I wrong?

I am aware of other ways to compute the number of unique values in Stata, but it would be so convenient to add this feature to the codebook command.

Catarrhine answered 7/5, 2016 at 17:22 Comment(2)
codebook doesn't save the number of what it reports as unique values. For a review of this territory, see stata-journal.com/sjpdf.html?articlenum=dm0042Homecoming
Thanks Nick for this useful reference.Catarrhine
M
5

You can easily write a wrapper for codebook that uses Nick's distinct command from SSC to store the info you want as scalar(s).

In my experience, this wrapper approach has proven to be much more effective than asking the nice folks at StataCorp to change their command on an internet forum that they do not participate in.

Here's an example:

* (1) You can stick this into a file called mycodebook.ado in
* /ado/personal (use adopath to see exact location)
capture program drop mycodebook
program mycodebook, rclass
syntax [varlist] [if] [in][, *]
codebook `varlist' `if' `in', `options'
capture ssc install distinct
foreach var of varlist `varlist' {
    qui distinct `var' `if' `in'
    return scalar nv_`var' = r(ndistinct)
}
end

* (2) example with mycodebook
sysuse auto, clear
mycodebook price mpg rep78 if foreign==0, compact
return list

This last part will give you:

. mycodebook price mpg rep78 if foreign==0, compact

Variable   Obs Unique      Mean   Min    Max  Label
----------------------------------------------------------------------------------
price       52     52  6072.423  3291  15906  Price
mpg         52     17  19.82692    12     34  Mileage (mpg)
rep78       48      5  3.020833     1      5  Repair Record 1978
----------------------------------------------------------------------------------

. return list

scalars:
           r(nv_rep78) =  5
             r(nv_mpg) =  17
           r(nv_price) =  52

You can then do things like (or whatever it is you want to do with these):

gen x=r(nv_rep78)
Multiphase answered 20/5, 2016 at 1:21 Comment(1)
The latest version of distinct (authors Gary Longton and myself) is to be downloaded from the Stata Journal website. search distinct, sj in Stata to get a link for installation.Homecoming
N
4

A convenient alternative is provided by the "unique" package. Here's a quick example:

* Install the unique package
ssc inst unique

* Load toy dataset
sysuse auto, clear

* Get a quick report of unique (and total) values for a variable
unique mpg

* The result will be available as r(unique)
return list
Niggerhead answered 30/4, 2021 at 18:5 Comment(1)
unique clearly can be useful, but this doesn't address the question, which is about extending or modifying codebook.Homecoming
A
0

At least in my application, this is considerably faster (though not more elegant) than distinct and nvals:

bys mpg: gen one = 1 if _n==1
qui sum one, meanonly
local N=r(N)
global N: di %12.0fc `N'
drop one
Arlina answered 3/6, 2022 at 20:33 Comment(4)
You will speed that up by using the meanonly option with summarize.Homecoming
Did you compare distinct (Stata Journal)?Homecoming
@NickCox 1) I did compare distinct. 2) I made a mistake and accidentally wrote codebook when I meant distinct. I edited my answer to fix this mistake. Sorry! 3) Thanks for the meanonly tip! I also edited my answer to include this. 4) For reference, my approach (with meanonly) on a dataset with ~20M observations and ~3.7M distinct values takes 0.73 seconds, while distinct takes 1.89 seconds.Arlina
Look at the code for distinct: it uses the same idea.Homecoming
P
-1

yes, you are wrong

ssc install egenmore
egen unique_values=nvals(mpg)
Parris answered 19/5, 2016 at 21:52 Comment(6)
I am afraid I am not wrong! Thanks but I am interested in storing this value after codebook and it seems not possible. See Nick's answer.Catarrhine
you can basically reproduce each of codebook s output manually. and nvals gives you thr unique valuesOnstad
This does not provide an answer to the question. To critique or request clarification from an author, leave a comment below their post. - From ReviewNarine
@David 'mArm' Ansermot It is an answer. Generating a new variable with the number of distinct values is an alternative to codebook. This is concise, but not cryptic if you read the documentation for the package mentioned. As the original author of both code solutions mentioned so far in this thread, I'd emphasise a different answer personally, but Noobie is answering here.Homecoming
@NickCox While this code may answer the question, providing additional context regarding how and/or why it solves the problem would improve the answer's long-term value. meta.#301337Narine
I don't disagree with that, but the assertion made "does not provide an answer" is too harsh and (I have to suggest) based on seeing a brief answer rather than knowing a lot about this language. I don't see that people answering a question are obliged to remind others of basics about a language, in this case to read the help page associated with a command or package.Homecoming

© 2022 - 2024 — McMap. All rights reserved.