R - describe() output to a data frame
Asked Answered
M

3

5

I want to create a data frame using describe() function. Dataset under consideration is iris. The data frame should look like this:

    Variable    n   missing unique  Info    Mean    0.05    0.1   0.25  0.5    0.75 0.9   0.95
   Sepal.Length 150    0    35      1       5.843   4.6     4.8   5.1   5.8    6.4  6.9   7.255
   Sepal.Width  150    0    23      0.99    3.057   2.345   2.5   2.8   3      3.3  3.61  3.8
Petal.Length    150    0    43      1       3.758   1.3     1.4   1.6   4.35   5.1  5.8   6.1
 Petal.Width    150    0    22      0.99    1.199   0.2     0.2   0.3   1.3    1.8  2.2   2.3
     Species    150    0    3                                   

Is there a way out to coerce the output of describe() to data.frame type? When I try to coerce, I get an error as shown below:

library(Hmisc)
statistics <- describe(iris)
statistics[1]
first_vec <- statistics[1]$Sepal.Length
as.data.frame(first_vec)
#Error in as.data.frame.default(first_vec) : cannot coerce class ""describe"" to a data.frame

Thanks

Macfarlane answered 19/6, 2016 at 14:55 Comment(2)
You should modify the code for describe.vector and alter it so that it produces numeric output of a constant length.Ingeingeberg
@akrun - the table in my post is expected output. Thank you for sharing your inputs.Macfarlane
V
7

The way to figure this out is to examine the objects with str():

data(iris)
library(Hmisc)
di <- describe(iris)
di
# iris 
# 
# 5  Variables      150  Observations
# -------------------------------------------------------------
# Sepal.Length 
#       n missing  unique    Info    Mean     .05     .10     .25     .50     .75     .90     .95 
#     150       0      35       1   5.843   4.600   4.800   5.100   5.800   6.400   6.900   7.255
# 
# lowest : 4.3 4.4 4.5 4.6 4.7, highest: 7.3 7.4 7.6 7.7 7.9 
# -------------------------------------------------------------
# ...
# -------------------------------------------------------------
# Species 
#       n missing  unique 
#     150       0       3 
# 
# setosa (50, 33%), versicolor (50, 33%) 
# virginica (50, 33%) 
# -------------------------------------------------------------
str(di)
# List of 5
# $ Sepal.Length:List of 6
# ..$ descript    : chr "Sepal.Length"
# ..$ units       : NULL
# ..$ format      : NULL
# ..$ counts      : Named chr [1:12] "150" "0" "35" "1" ...
# .. ..- attr(*, "names")= chr [1:12] "n" "missing" "unique" "Info" ...
# ..$ intervalFreq:List of 2
# .. ..$ range: atomic [1:2] 4.3 7.9
# .. .. ..- attr(*, "Csingle")= logi TRUE
# .. ..$ count: int [1:100] 1 0 3 0 0 1 0 0 4 0 ...
# ..$ values      : Named chr [1:10] "4.3" "4.4" "4.5" "4.6" ...
# .. ..- attr(*, "names")= chr [1:10] "L1" "L2" "L3" "L4" ...
# ..- attr(*, "class")= chr "describe"
# $ Sepal.Width :List of 6
# ...
# $ Species     :List of 5
# ..$ descript: chr "Species"
# ..$ units   : NULL
# ..$ format  : NULL
# ..$ counts  : Named num [1:3] 150 0 3
# .. ..- attr(*, "names")= chr [1:3] "n" "missing" "unique"
# ..$ values  : num [1:2, 1:3] 50 33 50 33 50 33
# .. ..- attr(*, "dimnames")=List of 2
# .. .. ..$ : chr [1:2] "Frequency" "%"
# .. .. ..$ : chr [1:3] "setosa" "versicolor" "virginica"
# ..- attr(*, "class")= chr "describe"
# - attr(*, "descript")= chr "iris"
# - attr(*, "dimensions")= int [1:2] 150 5
# - attr(*, "class")= chr "describe"

We see that di is a list of lists. We can take it apart by looking at just the first sublist. You can convert that into a vector:

unlist(di[[1]])
#             descript              counts.n 
#       "Sepal.Length"                 "150" 
#       counts.missing         counts.unique 
#                  "0"                  "35" 
#          counts.Info           counts.Mean 
#                  "1"               "5.843" 
#           counts..05            counts..10 
#              "4.600"               "4.800" 
#           counts..25            counts..50 
#              "5.100"               "5.800" 
#           counts..75            counts..90 
#              "6.400"               "6.900" 
#           counts..95   intervalFreq.range1 
#              "7.255"                 "4.3" 
#  intervalFreq.range2   intervalFreq.count1 
#                "7.9"                   "1" 
#  ...
#            values.H3             values.H2 
#                "7.6"                 "7.7" 
#            values.H1 
#                 "7.9" 
str(unlist(di[[1]]))
# Named chr [1:125] "Sepal.Length" "150" "0" "35" ...
# - attr(*, "names")= chr [1:125] "descript" "counts.n" "counts.missing" "counts.unique" ...

It is very, very long (125). The elements have been coerced to all be of the same (and most inclusive) type, namely, character. It seems you want the 2nd through 12th elements:

unlist(di[[1]])[2:12]
#     counts.n counts.missing  counts.unique    counts.Info 
#        "150"            "0"           "35"            "1" 
#  counts.Mean     counts..05     counts..10     counts..25 
#      "5.843"        "4.600"        "4.800"        "5.100" 
#   counts..50     counts..75     counts..90 
#      "5.800"        "6.400"        "6.900" 

Now you have something you can start to work with. But notice that this only seems to be the case for numerical variables; the factor variable species is different:

unlist(di[[5]])
#     descript       counts.n counts.missing  counts.unique 
#    "Species"          "150"            "0"            "3" 
#      values1        values2        values3        values4 
#         "50"           "33"           "50"           "33" 
#      values5        values6 
#         "50"           "33" 

In that case, it seems you only want elements two through four.

Using this process of discovery and problem solving, you can see how you'd take the output of describe apart and put the information you want into a data frame. However, this will take a lot of work. You'll presumably need to use loops and lots of if(){ ... } else{ ... } blocks. You might just want to code your own dataset description function from scratch.

Va answered 19/6, 2016 at 15:12 Comment(5)
One possible starting fpoint for this sort of effort might be: mtx <- do.call(rbind, sapply(statistics , "[[", "counts")[1:3]). It is a bit annoying for this effort that the result is character, but that is how Frank handles the varying precision of the columns.Ingeingeberg
That's a great start, @42-. It still seems like it's going to take a bit of tedium to get it the rest of the way (eg, the recycling of the vector from the factor variable). I think my preference would still be to decide what I want & code it from scratch.Va
@gung - thank you so much for sharing such a descriptive email. This is really helpful. It has solved my purpose.Macfarlane
@42- thank you for giving a pointer to get the required output with a shorter approach using do.call and sapply functions, instead of following a longer approach. I think, we can treat numeric and factor variables separately as shown below to get required output: num_vars <- do.call(rbind, sapply(statistics , "[[", "counts")[1:4]) fact_var <- do.call(rbind, sapply(statistics , "[[", "counts")[5]) rbind.fill(as.data.frame(num_vars), as.data.frame(fact_var))Macfarlane
A furter refinement: Consider adding print(as.data.frame(mtx))Ingeingeberg
D
5

You can do this by using the stat.desc function from the pastecs package:

library(pastecs)
summary_df <- stat.desc(mydata) 

The summary_df is the dataframe you wanted. See more info here.

Dressmaker answered 9/5, 2017 at 17:20 Comment(0)
P
2

In R, you just have to use the summary(iris) function instead of describe(iris) function in Python.

Photocomposition answered 28/3, 2019 at 13:24 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.