Categorical variables usage in pandas for ANOVA and regression?

import pandas as pd import numpy as np high, size = 100, 20 df = pd.DataFrame({'perception': np.random.randint(0, high, size), 'age': np.random.randint(0, high, size), 'outlook': pd.Categorical(np.tile(['positive', 'neutral', 'negative'], size//3+1)[:size]), 'smokes': pd.Categorical(np.tile(['lots', 'little', 'not'], size//3+1)[:size]), 'outcome': np.random.randint(0, high, size) }) df['age_range'] = pd.Categorical(pd.cut(df.age, range(0, high+5, size//2), right=False, labels=["{0} - {1}".format(i, i + 9) for i in range(0, high, size//2)])) np.random.shuffle(df['smokes'])

In [2]: df.head(10) Out[2]: perception age outlook smokes outcome age_range 0 13 65 positive little 22 60 - 69 1 95 21 neutral lots 95 20 - 29 2 61 53 negative not 4 50 - 59 3 27 98 positive not 42 90 - 99 4 55 99 neutral little 93 90 - 99 5 28 5 negative not 4 0 - 9 6 84 83 positive lots 18 80 - 89 7 66 22 neutral lots 35 20 - 29 8 13 22 negative lots 71 20 - 29 9 58 95 positive not 77 90 - 99

Finding out likelihood of outcome given columns and Feature importance (1 and 2)

Categorical data

As the dataset contains categorical values, we can use the LabelEncoder() to convert the categorical data into numeric data.

from sklearn.preprocessing import LabelEncoder

enc = LabelEncoder()
df['outlook'] = enc.fit_transform(df['outlook'])
df['smokes'] = enc.fit_transform(df['smokes'])

Result

df.head()

   perception  age  outlook  smokes  outcome age_range
0          67   43        2       1       78     0 - 9
1          77   66        1       1       13     0 - 9
2          33   10        0       1        1     0 - 9
3          74   46        2       1       22     0 - 9
4          14   26        1       2       16     0 - 9

Without creating any model, we can make use of the chi-squared test, p-value and correlation matrix to determine the relation.

Correlation matrix

import matplotlib.pyplot as plt
import seaborn as sns

corr = df.iloc[:, :-1].corr()
sns.heatmap(corr,
            xticklabels=corr.columns,
            yticklabels=corr.columns)
plt.show()

Chi-squared test and p-value

from sklearn.feature_selection import chi2

res = chi2(df.iloc[:, :4], df['outcome'])
features = pd.DataFrame({
    'features': df.columns[:4],
    'chi2': res[0],
    'p-value': res[1]
})

Result

features.head()

     features         chi2        p-value
0  perception  1436.012987  1.022335e-243
1         age  1416.063117  1.221377e-239
2     outlook    61.139303   9.805304e-01
3      smokes    57.147404   9.929925e-01

Randomly generated data, so null hypothesis is true. We can verify this by trying to fit a normal curve to the outcome.

Distribution

import scipy as sp

sns.distplot(df['outcome'], fit=sp.stats.norm, kde=False)
plt.show()

From the plot we can conclude that the data does not fit a normal distribution (as it is randomly generated.)

Note: As the data is all randomly generated, you results can vary, based on the size of the data set.

References

Recommended topics

Hot tags