Categorical variables usage in pandas for ANOVA and regression?
Asked Answered
F

1

8

To prepare a little toy example:

import pandas as pd
import numpy as np

high, size = 100, 20
df = pd.DataFrame({'perception': np.random.randint(0, high, size),
                   'age': np.random.randint(0, high, size),
                   'outlook': pd.Categorical(np.tile(['positive', 'neutral', 'negative'], size//3+1)[:size]),
                   'smokes': pd.Categorical(np.tile(['lots', 'little', 'not'], size//3+1)[:size]),
                   'outcome': np.random.randint(0, high, size)
                  })
df['age_range'] = pd.Categorical(pd.cut(df.age, range(0, high+5, size//2), right=False,
                             labels=["{0} - {1}".format(i, i + 9) for i in range(0, high, size//2)]))
np.random.shuffle(df['smokes'])

Which will give you something like:

In [2]: df.head(10)
Out[2]:
   perception  age   outlook  smokes  outcome age_range
0          13   65  positive  little       22   60 - 69
1          95   21   neutral    lots       95   20 - 29
2          61   53  negative     not        4   50 - 59
3          27   98  positive     not       42   90 - 99
4          55   99   neutral  little       93   90 - 99
5          28    5  negative     not        4     0 - 9
6          84   83  positive    lots       18   80 - 89
7          66   22   neutral    lots       35   20 - 29
8          13   22  negative    lots       71   20 - 29
9          58   95  positive     not       77   90 - 99

Goal: figure out likelihood of outcome, given {perception, age, outlook, smokes}.

Secondary goal: figure out how important each column is in determining outcome.

Third goal: prove attributes about distribution (here we have randomly generated, so a random distribution should imply the null hypothesis is true?)


Clearly these are all questions findable with statistical hypothesis testing. What's the right way of answering these questions in pandas?

Felly answered 23/5, 2019 at 1:52 Comment(5)
One-hot encoder and softmax?Haunted
Tempted to just build out a NN for this in TensorFlow. However, I do want to get p-values and all also. So will likely end up with two approaches, the p-value one seems ripe for pandas/statsmodel/numpy/researchpy. How am I meant do this?Felly
you've asked an important question but now you're digressing from it. Suggest to forget about building models for now and rather focus on statistically correct approach for the categorical variable treatment. The question can further be enriched by asking how to measure the interplay between categorical and continuous variables. Think about it.Montane
This sounds like a good use case for one versus all classification. For your predictors you can use pd.get_dummies or one hot encoder from sklearn.Artilleryman
linear regression from statsmodels will give you p-values for each feature. If you're looking for confidence in the regression prediction take a look at this: docs.seldon.io/projects/alibi/en/v0.2.0/methods/…, maybe you can adapt it for regression instead of classificationHomebrew
H
6

Finding out likelihood of outcome given columns and Feature importance (1 and 2)

Categorical data

As the dataset contains categorical values, we can use the LabelEncoder() to convert the categorical data into numeric data.

from sklearn.preprocessing import LabelEncoder

enc = LabelEncoder()
df['outlook'] = enc.fit_transform(df['outlook'])
df['smokes'] = enc.fit_transform(df['smokes'])

Result

df.head()

   perception  age  outlook  smokes  outcome age_range
0          67   43        2       1       78     0 - 9
1          77   66        1       1       13     0 - 9
2          33   10        0       1        1     0 - 9
3          74   46        2       1       22     0 - 9
4          14   26        1       2       16     0 - 9

Without creating any model, we can make use of the chi-squared test, p-value and correlation matrix to determine the relation.

Correlation matrix

import matplotlib.pyplot as plt
import seaborn as sns

corr = df.iloc[:, :-1].corr()
sns.heatmap(corr,
            xticklabels=corr.columns,
            yticklabels=corr.columns)
plt.show()

Correlation matrix

Chi-squared test and p-value

from sklearn.feature_selection import chi2

res = chi2(df.iloc[:, :4], df['outcome'])
features = pd.DataFrame({
    'features': df.columns[:4],
    'chi2': res[0],
    'p-value': res[1]
})

Result

features.head()

     features         chi2        p-value
0  perception  1436.012987  1.022335e-243
1         age  1416.063117  1.221377e-239
2     outlook    61.139303   9.805304e-01
3      smokes    57.147404   9.929925e-01

Randomly generated data, so null hypothesis is true. We can verify this by trying to fit a normal curve to the outcome.

Distribution

import scipy as sp

sns.distplot(df['outcome'], fit=sp.stats.norm, kde=False)
plt.show()

Distribution

From the plot we can conclude that the data does not fit a normal distribution (as it is randomly generated.)

Note: As the data is all randomly generated, you results can vary, based on the size of the data set.

References

Homogenize answered 27/5, 2019 at 13:3 Comment(4)
For the categorical data encoding one could also use pd.get_dummies().Abomasum
get_dummies only gives me 0 or 1, here I have 3 options. Thanks @skillsmuggler, so I should take it that these p-values indicate that the columns are independent; and that the χ² test values don't fit the χ²-distribution, so is unable to reject the null hypothesis? - Finally your correlation matrix shows a strong diagonal, and is therefore true?Felly
get_dummies does OneHot encoding. It splits a single categorical column into a binary (0/1) column of each factor.Homogenize
You are right about the null hypothesis. The diagonal of the correlation matrix must not be considered. We only consider the upper or lower triangle in the correlation matrix. The diagonal elements correspond to the coefficient of correlation between the same column and is always 1. ReferenceHomogenize

© 2022 - 2024 — McMap. All rights reserved.