Building multi-regression model throws error: `Pandas data cast to numpy dtype of object. Check input data with np.asarray(data).`

Asked 20/11, 2015 at 18:42 Answered 17/4 at 17:19

python numpy pandas statsmodels

I have pandas dataframe with some categorical predictors (i.e. variables) as 0 & 1, and some numeric variables. When I fit that to a stasmodel like:

est = sm.OLS(y, X).fit()

It throws:

Pandas data cast to numpy dtype of object. Check input data with np.asarray(data).

I converted all the dtypes of the DataFrame using df.convert_objects(convert_numeric=True)

After this all dtypes of dataframe variables appear as int32 or int64. But at the end it still shows dtype: object, like this:

4516        int32
4523        int32
4525        int32
4531        int32
4533        int32
4542        int32
4562        int32
sex         int64
race        int64
dispstd     int64
age_days    int64
dtype: object

Here 4516, 4523 are variable labels.

Any idea? I need to build a multi-regression model on more than hundreds of variables. For that I have concatenated 3 pandas DataFrames to come up with final DataFrame to be used in model building.

Alphabetize answered 20/11, 2015 at 18:42 Comment(3)

The output you're seeing is as expected. The dtype listed at the end of your output is the dtype of the dtypes series (the result of calling pd.DataFrame.dtypes) and has nothing to do with the types inside of your dataframe. Just try pd.DataFrame(range(100)).dtypes – Nissie 20/11, 2015 at 18:53

check np.asarray(X).dtype which should be float64, or int64 which, I think, will be converted to float64 inside statsmodels. Best to check again est.model.exog.dtype to make sure float64 is used in the calculations. – Airborne 20/11, 2015 at 21:19

Note, if the 'object' dtypes are categorical variables like strings, then it is better to use the formula interface which will automatically create a (numerical) dummy encoding for string variables. Otherwise, use pandas categorical to convert to dummy endoding. – Airborne 17/4 at 20:1

If X is your dataframe, try using the .astype method to convert to float when running the model:

est = sm.OLS(y, X.astype(float)).fit()

Randazzo answered 17/2, 2016 at 17:43 Comment(5)

so .. converting categorical variables to floats? – Heraclea 15/1, 2018 at 16:50

all categorical variables should be converted into dummy variables before sticking them in the model, so yes – Randazzo 17/1, 2018 at 4:49

And integers are not good enough, they must be floats! Int64 produces the same error as object or category ... sigh. – Capernaum 25/10, 2021 at 9:46

There could be a different possibility that the data types would just need to be corresponding. e.g., strings go with strings, float goes with floats, etc. – Scarcely 13/10, 2022 at 0:54

This is only true if the categorical variable is binary. If not, other coding schemes like dummy or one-hot coding should be used. Otherwise, regression has no way of knowing that the numerical variable is only on nominal scale and mistakes it as continuous. – Pyrology 14/7 at 21:15

if both y(dependent) and X are taken from a data frame then type cast both:-

est = sm.OLS(y.astype(float), X.astype(float)).fit()

Salliesallow answered 8/7, 2016 at 4:56 Comment(1)

so .. converting categorical variables to floats? – Heraclea 15/1, 2018 at 16:49

As Mário and Daniel suggested, yes, the issue is due to categorical values not previously converted into dummy variables.

I faced this issue reviewing StatLearning book lab on linear regression for the "Carseats" dataset from statsmodels, where the columns 'ShelveLoc', 'US' and 'Urban' are categorical values, I assume the categorical values causing issues in your dataset are also strings like in this one. Considering the previous, I will use this as an example since you didn't provide dataframes for the question.

The columns we have at the beginning are the following, as stated before 'ShelveLoc', 'US' and 'Urban'are categorical:

Index(['Sales', 'CompPrice', 'Income', 'Advertising', 'Population', 'Price',
       'ShelveLoc', 'Age', 'Education', 'Urban', 'US'],
      dtype='object')

In a simple line for Python, I converted them to categorical values and dropped the ones that had "No" and "Bad" labels (as this is what was being requested from the lab in the book).

carseats = pd.get_dummies(carseats, columns=['ShelveLoc', 'US', 'Urban'], drop_first = True)

This will return a dataframe with the following columns:

Index(['Sales', 'CompPrice', 'Income', 'Advertising', 'Population', 'Price',
       'Age', 'Education', 'ShelveLoc_Good', 'ShelveLoc_Medium', 'US_Yes',
       'Urban_Yes'],
      dtype='object')

And that's it, you have dummy variables ready for OLS. Hope this is useful.

Selfassurance answered 24/11, 2020 at 1:4 Comment(0)

This is because you have NOT generated the dummy values step to all predictors so how can the regression take place over literals ? that is what the error message is saying it is trying to convert to numpy valid entries.

Just go back to your pipeline and include the dummies properly.

Eleneeleni answered 21/4, 2020 at 20:28 Comment(0)

Before you run the OLS, check X.info(), this will reveal the available columns and their dataTypes. Confirm all of them are numeric in nature. If they are Object or bool then do appropriate conversions on the values such that they are converted into 1 and 0 and then run the OLS.

Featherston answered 17/4 at 17:19 Comment(0)

Recommended topics

Hot tags