Ignoring missing values in multiple OLS regression with statsmodels

Asked 6/3, 2014 at 19:55 Answered 27/3, 2019 at 14:22

I'm trying to run a multiple OLS regression using statsmodels and a pandas dataframe. There are missing values in different columns for different rows, and I keep getting the error message: ValueError: array must not contain infs or NaNs I saw this SO question, which is similar but doesn't exactly answer my question: statsmodel.api.Logit: valueerror array must not contain infs or nans

What I would like to do is run the regression and ignore all rows where there are missing variables for the variables I am using in this regression. Right now I have:

import pandas as pd
import numpy as np
import statsmodels.formula.api as sm

df = pd.read_csv('cl_030314.csv')

results = sm.ols(formula = "da ~ cfo + rm_proxy + cpi + year", data=df).fit()

I want something like missing = "drop". Any suggestions would be greatly appreciated. Thanks so much.

Incommunicative answered 6/3, 2014 at 19:55 Comment(0)

You answered your own question. Just pass

missing = 'drop'

to ols

import statsmodels.formula.api as smf
...
results = smf.ols(formula = "da ~ cfo + rm_proxy + cpi + year", 
                 data=df, missing='drop').fit()

If this doesn't work then it's a bug and please report it with a MWE on github.

FYI, note the import above. Not everything is available in the formula.api namespace, so you should keep it separate from statsmodels.api. Or just use

import statsmodels.api as sm
sm.formula.ols(...)

Hundred answered 6/3, 2014 at 20:57 Comment(2)

Thank you so, so much for the help. In case anyone else comes across this, you also need to remove any possible inifinities by using: pd.set_option('use_inf_as_null', True) – Incommunicative 6/3, 2014 at 22:56

This info is missing in the docs. It seems that **kwargs is used for missing=. But the docs don't tell to which function that argument is passed throught. – Therapist 6/6, 2023 at 10:3

The answer from jseabold works very well, but it may be not enough if you the want to do some computation on the predicted values and true values, e.g. if you want to use the function mean_squared_error. In that case, it may be better to get definitely rid of NaN

df = pd.read_csv('cl_030314.csv')
df_cleaned = df.dropna()
results = sm.ols(formula = "da ~ cfo + rm_proxy + cpi + year", data=df_cleaned).fit()

Methinks answered 27/3, 2019 at 14:22 Comment(0)

Recommended topics

Hot tags