Summary Statistics table with factors and continuous variables
Asked Answered
A

1

0

I am trying to create a simple summary statistics table (min, max, mean, n, etc) that handles both factor variables and continuous variables, even when there is more than one factor variable. I'm trying to produce good looking HTML output, eg stargazer or huxtable output.

For a simple reproducible example, I'll use mtcars but change two of the variables to factors, and simplify to three variables.

library(tidyverse)
library(stargazer)

mtcars_df <- mtcars
mtcars_df <- mtcars_df %>% 
  mutate(vs = factor(vs),
         am = factor(am)) %>% 
  select(mpg, vs, am)
head(mtcars_df)

So the data has two factor variables, vs and am. mpg is left as a double:

#>    mpg vs am
#>  <dbl> <fctr> <fctr>
#> 1 21.0  0  1
#> 2 21.0  0  1
#> 3 22.8  1  1
#> 4 21.4  1  0
#> 5 18.7  0  0
#> 6 18.1  1  0

My desired output would look something like this (format only, the numbers aren't all correct for am0):

======================================================
Statistic N   Mean  St. Dev. Min Pctl(25) Pctl(75) Max
------------------------------------------------------
mpg       32 20.091  6.027   10    15.4     22.8   34 
vs0       32 0.562   0.504    0     0        1      1 
vs1       32 0.438   0.504    0     0        1      1 
am0       32 0.594   0.499    0     0        1      1 
am1       32 0.406   0.499    0     0        1      1 
------------------------------------------------------

A straight call to stargazer does not handle factors (but we have a solution for summarising one factor, below)

# this doesn't give factors
stargazer(mtcars_df, type = "text")
======================================================
Statistic N   Mean  St. Dev. Min Pctl(25) Pctl(75) Max
------------------------------------------------------
mpg       32 20.091  6.027   10    15.4     22.8   34 
------------------------------------------------------

This previous answer from @jake-fisher works very well to summarise one factor variable. https://mcmap.net/q/1708362/-output-each-factor-level-as-dummy-variable-in-stargazer-summary-statistics-table

The code below from the previous answer gives both values of the first factor vs, i.e. vs0 and vs1 but when it comes to the second factor, am, it only lists summary statistics for one value of am:

  • am0 is missing.

I do realise that this is because we want to avoid the dummy variable trap when modeling, but my issue is not about modeling, it's about creating a summary table with all values of all factor variables.

options(na.action = "na.pass")  # so that we keep missing values in the data
X <- model.matrix(~ . - 1, data = mtcars_df)
X.df <- data.frame(X)  # stargazer only does summary tables of data.frame objects
#names(X) <- colnames(X)
stargazer(X.df, type = "text")

======================================================
Statistic N   Mean  St. Dev. Min Pctl(25) Pctl(75) Max
------------------------------------------------------
mpg       32 20.091  6.027   10    15.4     22.8   34 
vs0       32 0.562   0.504    0     0        1      1 
vs1       32 0.438   0.504    0     0        1      1 
am1       32 0.406   0.499    0     0        1      1 
------------------------------------------------------

While use of stargazer or huxtable would be preferred, if there's an easier way to produce this sort of summary table with a different library, that would still be very helpful.

Asphaltite answered 11/6, 2020 at 0:7 Comment(5)
How will you calculate summary stats for factor variables?Tabard
@RonakShah I'm hoping to expand and one-hot encode all of the factors, as in the examples above, eg vs0, vs1, so that mean will show what proportion of vs is ==0 and ==1. For factors with more values, I'd be thinking to create more dummies, eg from mtcars: cyl4, cyl6, cyl8Asphaltite
have you tried skimr?Judsen
Not html format, but epiDisplay::codebook(mtcars_df) gives appropriate summaries of numeric and factors.Iatric
gtsummary might be helpful and has both gt and huxtable outputHolcman
A
1

In the end, instead of using model.matrix(), which is designed to drop the base case when creating dummy variables, a simple fix is to use mlr::createDummyFeatures(), which creates a Dummy for all values, even the base case.

library(tidyverse)
library(stargazer)
library(mlr)

mtcars_df <- mtcars
mtcars_df <- mtcars_df %>% 
  mutate(vs = factor(vs),
         am = factor(am)) %>% 
  select(mpg, vs, am)
head(mtcars_df)


X <- mlr::createDummyFeatures(obj = mtcars_df)
X.df <- data.frame(X)  # stargazer only does summary tables of data.frame objects
#names(X) <- colnames(X)
stargazer(X.df, type = "text")

which does give the desired output:

======================================================
Statistic N   Mean  St. Dev. Min Pctl(25) Pctl(75) Max
------------------------------------------------------
mpg       32 20.091  6.027   10    15.4     22.8   34 
vs.0      32 0.562   0.504    0     0        1      1 
vs.1      32 0.438   0.504    0     0        1      1 
am.0      32 0.594   0.499    0     0        1      1 
am.1      32 0.406   0.499    0     0        1      1 
------------------------------------------------------
Asphaltite answered 11/6, 2020 at 0:59 Comment(1)
Hi Jeremy! Your answer is still helping people 4 years later! I followed your comment to @jake-fisher‘s previous answer all the way here and finally solved my problem. Thanks!Adalia

© 2022 - 2024 — McMap. All rights reserved.