Fixed effect in Pandas or Statsmodels

Asked 12/6, 2014 at 23:23 Answered 28/11, 2023 at 22:16

Solved python pandas regression statsmodels

Is there an existing function to estimate fixed effect (one-way or two-way) from Pandas or Statsmodels.

There used to be a function in Statsmodels but it seems discontinued. And in Pandas, there is something called plm, but I can't import it or run it using pd.plm().

Ceceliacecil answered 12/6, 2014 at 23:23 Comment(11)

Please keep it to one question per question. Also, please explain what you mean by "i can't". Please include full tracebacks (if they exist) and a sample that is small and runnable on its own and that reproduces the problem. – Funicular 12/6, 2014 at 23:34

Also don't avoid telling us relevant information. "there used to be a function" implies you know what that function is, so why you avoid telling us confuses me. – Funicular 12/6, 2014 at 23:36

@EMS Fixed effect is just a routine in my profession to control un-observable effect under assumption that these unobservables won't change over time. I am not with statistics, so I don't know nothing about a Bayesian's perspective.. – Ceceliacecil 12/6, 2014 at 23:49

@Funicular Thanks for your suggestions. I don't see there are 2 questions because they are closely related. "I can't" simply means "I can't", because I can see plm in pandas source code, but I cant't find them out inside python. – Ceceliacecil 12/6, 2014 at 23:52

@Ceceliacecil Closely related is not the same as "the same". It's OK to have two closely related questions. Unless asking that would get closed as a duplicate of this, they are different questions. – Funicular 13/6, 2014 at 0:1

@Ceceliacecil "I can't" means lots of things. "I can't find it", "I can't run it", "I don't know how to use it", "I can't reproduce the documentation". If you just gave a one-line example and a traceback I wouldn't be asking for clarification. – Funicular 13/6, 2014 at 0:3

@EMS what do you mean by "the theory behind it"? Is that something deeper and beyond a within transformation? Could you elaborate? – Ceceliacecil 13/6, 2014 at 0:29

@EMS Even after not being a student anymore, econometricians still need to use them, see Klaus' and Jennifer's comments. I'm just reading both both Cameron/Trivedi and Wooldridge, again. (Not everybody has to buy into the assumptions of Bayesian multilevel models.) – Latreshia 13/6, 2014 at 1:54

@EMS Could you point out some better tools do you use to solve cross-sectional correlation other than Fama-Macbeth reg? – Ceceliacecil 13/6, 2014 at 5:4

rfs.oxfordjournals.org/content/22/1/435.short – Latreshia 13/6, 2014 at 12:19

Please change the accepted answer to the linearmodels one, as pandas deprecated and dropped PanelOLS bashtage.github.io/linearmodels/doc/panel/pandas.html – Metaphysical 6/10, 2018 at 15:46

As noted in the comments, PanelOLS has been removed from Pandas as of version 0.20.0. So you really have three options:

If you use Python 3 you can use linearmodels as specified in the more recent answer: https://mcmap.net/q/515857/-fixed-effect-in-pandas-or-statsmodels
Just specify various dummies in your statsmodels specification, e.g. using pd.get_dummies. May not be feasible if the number of fixed effects is large.

Or do some groupby based demeaning and then use statsmodels (this would work if you're estimating lots of fixed effects). Here is a barebones version of what you could do for one way fixed effects:

import statsmodels.api as sm
import statsmodels.formula.api as smf
import patsy

def areg(formula,data=None,absorb=None,cluster=None): 

    y,X = patsy.dmatrices(formula,data,return_type='dataframe')

    ybar = y.mean()
    y = y -  y.groupby(data[absorb]).transform('mean') + ybar

    Xbar = X.mean()
    X = X - X.groupby(data[absorb]).transform('mean') + Xbar

    reg = sm.OLS(y,X)
    # Account for df loss from FE transform
    reg.df_resid -= (data[absorb].nunique() - 1)

    return reg.fit(cov_type='cluster',cov_kwds={'groups':data[cluster].values})

For example, suppose you have a panel of stock data: stock returns and other stock data for all stocks, every month over a number of months and you want to regress returns on lagged returns with calendar month fixed effects (where the calender month variable is called caldt) and you also want to cluster the standard errors by calendar month. You can estimate such a fixed effect model with the following:

reg0 = areg('ret~retlag',data=df,absorb='caldt',cluster='caldt')

And here is what you can do if using an older version of Pandas:

An example with time fixed effects using pandas' PanelOLS (which is in the plm module). Notice, the import of PanelOLS:

>>> from pandas.stats.plm import PanelOLS
>>> df

                y    x
date       id
2012-01-01 1   0.1  0.2
           2   0.3  0.5
           3   0.4  0.8
           4   0.0  0.2
2012-02-01 1   0.2  0.7 
           2   0.4  0.5
           3   0.2  0.3
           4   0.1  0.1
2012-03-01 1   0.6  0.9
           2   0.7  0.5
           3   0.9  0.6
           4   0.4  0.5

Note, the dataframe must have a multindex set ; panelOLS determines the time and entity effects based on the index:

>>> reg  = PanelOLS(y=df['y'],x=df[['x']],time_effects=True)
>>> reg

-------------------------Summary of Regression Analysis-------------------------

Formula: Y ~ <x>

Number of Observations:         12
Number of Degrees of Freedom:   4

R-squared:         0.2729
Adj R-squared:     0.0002

Rmse:              0.1588

F-stat (1, 8):     1.0007, p-value:     0.3464

Degrees of Freedom: model 3, resid 8

-----------------------Summary of Estimated Coefficients------------------------
      Variable       Coef    Std Err     t-stat    p-value    CI 2.5%   CI 97.5%
--------------------------------------------------------------------------------
             x     0.3694     0.2132       1.73     0.1214    -0.0485     0.7872
---------------------------------End of Summary---------------------------------

Docstring:

PanelOLS(self, y, x, weights = None, intercept = True, nw_lags = None,
entity_effects = False, time_effects = False, x_effects = None,
cluster = None, dropped_dummies = None, verbose = False,
nw_overlap = False)

Implements panel OLS.

See ols function docs

This is another function (like fama_macbeth) where I believe the plan is to move this functionality to statsmodels.

Glassman answered 13/6, 2014 at 1:12 Comment(20)

If you use the time index or group index id as a categorical variable in a formula for statsmodels ols, then it creates the fixed effects dummies for you. However, removing the fixed effects by demeaning is not yet supported. – Latreshia 13/6, 2014 at 1:44

@Karl D. Thanks a lot, your answers are always very useful! – Ceceliacecil 13/6, 2014 at 4:2

Can I use random effects with pandas? I'm looking for something similar to stata's xtreg, re. Thanks! – Derekderelict 3/5, 2015 at 23:4

Statsmodels will do random effects. – Glassman 4/5, 2015 at 21:35

what is the difference between time_effects and entity_effects? – Dumpish 27/10, 2015 at 14:24

@Moj, In a typical use case, your entity would be stocks, so entity_effects essentially creates dummy variable for each stock, and time_effects create dummy variables for every date. – Glassman 27/10, 2015 at 15:14

when I'm reading tutorials on fixed regression effects they only introduce terms for entities to account for unobserved variables that are constant for each entity over time. Having 2 terms here confuses me a bit. can you point me to a source with good explanation? Thanks – Dumpish 28/10, 2015 at 9:8

and is there any documentation for PanelOLS'? I find nothing but a docstring – Dumpish 28/10, 2015 at 9:54

No, just the docstring, and it's unlikely to get any better documentation. panelOLS will probably be deprecated in pandas by 0.18: github.com/pydata/pandas/issues/6077. – Glassman 28/10, 2015 at 18:40

Hi, @KarlD. I wonder if you could take a look at this question: #37195001. I can't used this method and I don't know why. Thank you. – Receipt 12/5, 2016 at 20:45

@KarlD. I have a question, I"m running this with my data set and the results are all 'nan'. Do you have any idea what the problem might be? – Subsumption 15/2, 2018 at 3:10

pandas dropped PanelOLS in 0.20.0 bashtage.github.io/linearmodels/doc/panel/pandas.html – Metaphysical 6/10, 2018 at 15:44

Thanks, do you know if linearmodels is more efficient with large numbers of fixed effects than adding dummies is, as suggested in the post? – Metaphysical 8/10, 2018 at 16:3

I haven't looked at their code but I imagine that linearmodels is approaching fixed effects like I do in option #3 in my outline above. That's going to be pretty efficient because it avoids doing matrix decomposition with very large matrices filled with dummy variables. – Glassman 9/10, 2018 at 1:53

I'm not sure I understand the function from the 3rd option correctly. I understand that for data I'd include a df with the DV, all IVs/contros and the clusterID. for cluster I'd include cluster = 'clusterID'. But what does the formula and absorb part do? How do I make use of it? – Avocado 6/6, 2020 at 10:35

absorb refers to the variable that contains the fixed effects: for example, a datetime column if you're estimating time fixed effects. The parameter naming comes for the areg function in stata and formula just refers to using patsy formula notation for a regression (statsmodels uses that too) – Glassman 10/6, 2020 at 6:54

Does the order of the two indices in the Multi-index matter? I had to use df = df.set_index(['entity', 'date']) instead of df = df.set_index(['date', 'entity']) to make the PanelOLS work. – Blunger 2/6, 2022 at 19:16

@KarlD. can you elaborate on your point 2. above using pd.get_dummies(). This would control for the fixed variables and one can then use good old OLS? – Sigismond 3/11, 2023 at 16:48

@oliver-angelil Yes, fixed effects models are equivalent to ols with explicit dummy variables. So, for example, if you're are estimating calendar month fixed effects, you need a dummy variable for each month. – Glassman 6/11, 2023 at 4:56

Thanks @KarlD. Let's say I add the dummy variables for my fixed effects categorical variables, and also perform the normalisation (e.g. mean subtraction), can I use any regression model to fit to the data. I.e. does not need to be OLS, could be Random Forest regression? – Sigismond 6/11, 2023 at 9:6

There is a package called linearmodels (https://pypi.org/project/linearmodels/) that has a fairly complete fixed effects and random effects implementation including clustered standard errors. It does not use high-dimensional OLS to eliminate effects and so can be used with large data sets.

# Outer is entity, inner is time
entity = list(map(chr,range(65,91)))
time = list(pd.date_range('1-1-2014',freq='A', periods=4))
index = pd.MultiIndex.from_product([entity, time])
df = pd.DataFrame(np.random.randn(26*4, 2),index=index, columns=['y','x'])

from linearmodels.panel import PanelOLS
mod = PanelOLS(df.y, df.x, entity_effects=True)
res = mod.fit(cov_type='clustered', cluster_entity=True)
print(res)

This produces the following output:

                          PanelOLS Estimation Summary                           
================================================================================
Dep. Variable:                      y   R-squared:                        0.0029
Estimator:                   PanelOLS   R-squared (Between):             -0.0109
No. Observations:                 104   R-squared (Within):               0.0029
Date:                Thu, Jun 29 2017   R-squared (Overall):             -0.0007
Time:                        23:52:28   Log-likelihood                   -125.69
Cov. Estimator:             Clustered                                           
                                        F-statistic:                      0.2256
Entities:                          26   P-value                           0.6362
Avg Obs:                       4.0000   Distribution:                    F(1,77)
Min Obs:                       4.0000                                           
Max Obs:                       4.0000   F-statistic (robust):             0.1784
                                        P-value                           0.6739
Time periods:                       4   Distribution:                    F(1,77)
Avg Obs:                       26.000                                           
Min Obs:                       26.000                                           
Max Obs:                       26.000                                           

                             Parameter Estimates                              
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
x              0.0573     0.1356     0.4224     0.6739     -0.2127      0.3273
==============================================================================

F-test for Poolability: 1.0903
P-value: 0.3739
Distribution: F(25,77)

Included effects: Entity

It also has a formula interface which is similar to statsmodels,

mod = PanelOLS.from_formula('y ~ x + EntityEffects', df)

Pinot answered 29/6, 2017 at 22:57 Comment(5)

Correct answer should be changed to this, because PanelOLS has been droped from pandas in 0.20 and I also cannot find it in statsmodels. bashtage.github.io/linearmodels/doc/panel/pandas.html – Sunken 24/9, 2017 at 16:51

Buyer beware: linearmodels requires Python 3. – Buskirk 7/10, 2017 at 23:51

Also, it does not make out of sample predictions. You have to code that yourself. – Saito 27/8, 2018 at 9:29

linearmodels also doesn't currently work with stargazer github.com/mwburke/stargazer/issues/26 – Metaphysical 13/7, 2020 at 4:17

How to declare entity and time? i.e., how could this function know which variable is the entity and which is time? For those ran into this, check this:bashtage.github.io/linearmodels/panel/examples/… – Boyette 8/10, 2020 at 12:30

I have written a new package, PyFixest, that implements several routines for high dimensional fixed effects regression, following syntax innovations of the R package fixest. PyFixestsupports OLS, IV and Poisson Regression with as many fixed effects as you'd want and a range of inference procedures (iid, HC1-3, CRV1 and CRV3 inference as well as the wild cluster bootstrap).

Here is a small code example:

from pyfixest.estimation import feols
from pyfixest.utils import get_data

data = get_data()

# fit a model via OLS
fit = feols("Y ~ X1 | f1 + f2", data=data)
fit.summary()

# Estimation:  OLS
# Dep. var.: Y, Fixed effects: f1+f2
# Inference:  CRV1
# Observations:  997
# 
# | Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |   2.5 % # |   97.5 % |
# |:--------------|-----------:|-------------:|----------:|-----------:|--------:|---------:|
# | X1            |      0.292 |        0.040 |     7.256 |      0.000 |   0.210 |    0.374 |
# ---
# RMSE: 1.199   R2: 0.554   R2 Within: 0.037

Commandeer answered 28/11, 2023 at 22:16 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags