pandas: pandas.DataFrame.describe returns information on only one column

Asked 29/8, 2016 at 8:16 Answered 6/11, 2016 at 11:44

For a certain Kaggle dataset (rules prohibit me from sharing the data here, but is readily accessible here),

import pandas
df_train = pandas.read_csv(
    "01 - Data/act_train.csv.zip"
)
df_train.describe()

I get:

>>> df_train.describe()
            outcome
count  2.197291e+06
mean   4.439544e-01
std    4.968491e-01
min    0.000000e+00
25%    0.000000e+00
50%    0.000000e+00
75%    1.000000e+00
max    1.000000e+00

whereas for the same dataset df_train.columns gives me:

>>> df_train.columns
Index(['people_id', 'activity_id', 'date', 'activity_category', 'char_1',
       'char_2', 'char_3', 'char_4', 'char_5', 'char_6', 'char_7', 'char_8',
       'char_9', 'char_10', 'outcome'],
      dtype='object')

and df_train.dtypes gives me:

>>> df_train.dtypes
people_id            object
activity_id          object
date                 object
activity_category    object
char_1               object
char_2               object
char_3               object
char_4               object
char_5               object
char_6               object
char_7               object
char_8               object
char_9               object
char_10              object
outcome               int64
dtype: object

Am I missing some reason why pandas only describes one column in the dataset?

Sholokhov answered 29/8, 2016 at 8:16 Comment(0)

By default, describe only works on numeric dtype columns. Add a keyword-argument include='all'. From the documentation:

If include is the string ‘all’, the output column-set will match the input one.

To clarify, the default arguments to describe are include=None, exclude=None. The behavior that results is:

None to both (default). The result will include only numeric-typed columns or, if none are, only categorical columns.

Also, from the Notes section:

The output DataFrame index depends on the requested dtypes:

For numeric dtypes, it will include: count, mean, std, min, max, and lower, 50, and upper percentiles.

For object dtypes (e.g. timestamps or strings), the index will include the count, unique, most common, and frequency of the most common. Timestamps also include the first and last items.

Omophagia answered 29/8, 2016 at 8:23 Comment(7)

But include='all' is the default if all the columns in the dataset are objects (strings)? – Sholokhov 29/8, 2016 at 8:25

Then the question changes to -- are object columns interpreted as categorical columns by pandas in the latter case? – Sholokhov 29/8, 2016 at 8:29

@Sholokhov Yes, object dtype columns are interpreted as categorical. – Omophagia 29/8, 2016 at 8:30

Can you include some documentation that mentions that object types are interpreted as categorical types? – Sholokhov 29/8, 2016 at 8:32

@Sholokhov Yes, it's included in my final edit. i.e.

For object dtypes (e.g. timestamps or strings), the index will include the count, unique, most common, and frequency of the most common

– Omophagia 29/8, 2016 at 8:37

No, I was referring to your claim that object columns are interpreted as categorical as they are distinct types. – Sholokhov 29/8, 2016 at 8:39

Let us continue this discussion in chat. – Omophagia 29/8, 2016 at 8:46

-1

try the follwing code

import pandas
df_train = pandas.read_csv(
    "01 - Data/act_train.csv.zip"
)

def describe_categorical(df_train):
    from Ipython.display import display, HTML
    display (HTML(df_train[df_train.columns[df_train.dtypes=="object"]].describe().to_html()))

describe_categorical(df_train)

Larine answered 6/11, 2016 at 11:44 Comment(0)

Recommended topics

Hot tags