R Language: How do I print / see summary statistics for sample subset?

Asked 29/1, 2011 at 8:11 Answered 7/11, 2016 at 18:54

These are some newbie questions about statistical programming for R for which I haven't been able to find an answer online. My dataframe is labeled "eitc" in the code below.

1) Once I've loaded in a data frame, I would like to look at summary statistics. I've used the functions:

eitc <- read.dta(file="/Users/Documents/eitc.dta")
summary(eitc)
sapply(eitc,mean,na.rm=TRUE) #for sample mean, min, max, etc.

How do I find summary statistics on my dataframe when certain qualifications are met. For example, I would like to see the summary statistics on all variables when the variable "children" is greater than or equal to 1. The equivalent Stata code is:

summarize if children >= 1

2) Similarly, how do I find specific parameters when certain qualifications are met? For example, I want to find the mean of the variable "work" when both "post93" variable is equal to zero and "anykids" variable is equal to 1. The equivalent Stata code is:

mean work if post93==0 & anykids==1

3) Ideally, when I run the summary statistics above, I would like to find out how many observations were included in the calculation / fit the criteria.

4) When I read in my data frame, it would also be nice to see how many observations are included in the data set (and perhaps how many rows have missing values or "NA" in them).

5) Also, I have been creating dummy variables using the following code. Is this the correct way to do it or is there a more efficient route?

post93.dummy <- as.numeric(eitc$year>1993)
eitc=cbind(eitc,post93.dummy)

Bludge answered 29/1, 2011 at 8:11 Comment(1)

Welcome to StackOverflow. Well done for including some code and a good description, but if I could give two points to be helpfull, 1) Try to keep one question per question, even if you start four questions at once and 2) Read r-bloggers.com/… which has great advice on posting self contained, simple code examples. – Venetian 29/1, 2011 at 10:58

A lot of your requirements are answered by subset, e.g.

summary(subset(eitc, post93 == 0 & anykids == 1, select=work))
nrow(subset(eitc, post93 == 0 & anykids == 1, select=work)) # for number of obs.

The ?subset documentation has good examples.

The cbind method of attaching dummy variables is unneccesary. Just do:

eitc$post93.dummy <- as.numeric(eitc$year>1993)

Gainless answered 29/1, 2011 at 8:51 Comment(0)

I'll use mtcars data available in datasets package. See ?mtcars.

Ad 1. You can see the summary of mtcars when gear is greater than 3:

summary(mtcars[mtcars$gear > 3, ])
## or by using Tukey's five number summary
sapply(mtcars[mtcars$gear > 3, ], fivenum)

Ad 2. Use with:

with(mtcars, mean(hp[gear > 3 & mpg > 20]))

Ad 3. Ibid (but use length):

with(mtcars, length(hp[gear > 3 & mpg > 20]))
## or
sapply(mtcars[mtcars$gear > 3, ], length) ## which is trivial when there are no NA's
sapply(mtcars[mtcars$gear > 3, ], length, na.rm = TRUE) ## but this one's good when there are NA's
nrow(mtcars[mtcars$gear > 3, ])

Ad 4. See previous, but to find out

how many rows have missing values or "NA" in them

do something like this:

apply(dtf, 1, function(x) length(is.na(x)))

Ad 5. This is not a dummy variable, this is some kind of subset of original data, columnwise concatenated. What are you trying to achieve anyway?

Please be concise. One question per question, please!

Turpin answered 29/1, 2011 at 10:37 Comment(0)

I would recomend you look at the plyr package for generating summaries. Here's some quick code (not run);

#Generate a new factor based on the numeric value of children with 5 levels
eitc$childfac<-cut(eitc$children,5)

# Generate mean and sd of the variables foo and bar based on that factor
ddply(eitc, .(childfac), function(df) {
  return(data.frame(meanfoo=mean(df$foo), sdfoo=stdev(df$foo),
    meanbar=mean(df$bar), sdbar=stdev(df$bar))
  })

You might also want to look at the hmisc and psych packages for more descriptive stat routines. (Check out Quick-R for more info)

Venetian answered 29/1, 2011 at 10:54 Comment(0)

Here's how you might quickly display some summary statistics for a subset of your data using data.table.

library(data.table)

dt <- data.table(mtcars)

var.names <- c("cyl", "disp", "hp")
dt[mpg > 20, 
   list(name=var.names, N=.N, mean=lapply(.SD, mean), sd=lapply(.SD, sd)), 
   .SDcols=var.names]

You can use model.matrix for creating dummy variables, see here.

Omega answered 7/11, 2016 at 18:54 Comment(0)

Recommended topics

Hot tags