Is it ok to run a plm fixed effect model and add a factor dummy variable (tree way fixed effects)?
Asked Answered
A

1

1

Is it ok to run a "plm" fixed effect model and add a factor dummy variable in R as below?

The three factors "Time", "Firm” and "Country" are all separate indices which I want to fix all together.

Instead of making two indices in total by combining "Firm” and "Country", I find the below specification works much better for my case.

Is this an acceptable format?

plm(y ~ lag(x1, 1) + x2 + x3 + x4 + x5 + factor(Country), data=DATA,
    index=c("Firm","Time"), model="within")
Archiplasm answered 25/1, 2022 at 1:22 Comment(13)
It appears that you have confused your index which in plm should be id-time, i.e. index=c('id', 'time').Rexanna
@Rexanna Does the sequence matter even if my DATA column sequence is time then id?Archiplasm
Yes, of course, plm is picky there.Rexanna
@jay.sf: Thank you.Archiplasm
@jay.sf: The problem is that the plm setting above gives out the best result for me. I am thinking of how to keep this plm setting. Of course I changed the index order between time and id.Archiplasm
@jay.sf: If you do fixed effects for id and time, then does this necessarily mean I need to cluster according to id and time as well? What is the implication of the model and result if I do not cluster the summary according to country? With simply id and time fixed effects, I could simply run using summary without any further specifications like clustering. What does this mean?Archiplasm
It appears you need statistical help, which is more on-topic on Cross Validated than on Stack Overflow. Regardless, you might want to read Cameron and Abadie or this post on Cross Validated. Cheers!Rexanna
@jay.sf: Thanks. I also have put this question as a statistical one in the following thread. Please see: stats.stackexchange.com/questions/561731/…Archiplasm
@jay.sf: Shall I still need to worry about the clustering effect even if I use "pooled" panel regression? Can't summary command without clustering work here?Archiplasm
As said before, this is off-topic here, but if you have reason to believe that there is a common variance within a country (which actually sounds very reasonable), you should account for it, regardless of the method you use. Read the papers I linked above!Rexanna
@jay.sf: Even if it is a pooled panel regression which does not take fixed effects that might have to do with clustering?Archiplasm
@jay.sf: The second option I am thinking is to merge the country and firm indices to one index by concatenating them. If so, there will be one ID and one time indices. I think this can be an alternative.Archiplasm
In this case consider: https://mcmap.net/q/753782/-duplicate-couples-id-time-error-in-plm-with-only-two-idsRexanna
R
1

It is okay to add additional factors. We can prove this by calculating an LSDV model. As a preliminary note, you will of course need robust standard errors, usually clustered at the highest aggregate level, i.e. country in this case.

Note: R >= 4.1 is used in the following.

LSDV

fit1 <- 
  lm(y ~ d + x1 + x2 + x3 + x4 + factor(id) + factor(time) + factor(country), 
     dat)
lmtest::coeftest(
  fit1, vcov.=sandwich::vcovCL(fit1, cluster=dat$country, type='HC0')) |>
  {\(.) .[!grepl('\\(|factor', rownames(.)), ]}()
#      Estimate Std. Error    t value      Pr(>|t|)
# d  10.1398727  0.3181993 31.8664223 4.518874e-191
# x1  1.1217514  1.6509390  0.6794627  4.968995e-01
# x2  3.4913273  2.7782157  1.2566797  2.089718e-01
# x3  0.6257981  3.3162148  0.1887085  8.503346e-01
# x4  0.1942742  0.8998307  0.2159008  8.290804e-01

After adding factor(country), the estimators we get with plm::plm are identical to LSDV:

plm::plm

fit2 <- plm::plm(y ~ d + x1 + x2 + x3 + x4 + factor(country), 
                 index=c('id', 'time'), model='within', effect='twoways', dat)
summary(fit2, vcov=plm::vcovHC(fit2, cluster='group', type='HC1'))$coe
#      Estimate Std. Error    t-value      Pr(>|t|)
# d  10.1398727  0.3232850 31.3651179 5.836597e-186
# x1  1.1217514  1.9440165  0.5770277  5.639660e-01
# x2  3.4913273  3.2646905  1.0694206  2.849701e-01
# x3  0.6257981  3.1189939  0.2006410  8.409935e-01
# x4  0.1942742  0.9250759  0.2100089  8.336756e-01

However, cluster='group' will refer to "id" and not to "country", so the standard errors are wrong. It seems that clustering by the additional factor with plm is currently not possible, at least I am not aware of anything.

Alternatively you may use lfe::felm to not have to do without the immensely reduced computing times relative to LSDV:

lfe::felm

summary(lfe::felm(y ~ d + x1 + x2 + x3 + x4 | id + time + country | 0 | country,
                  dat))$coe
#      Estimate Cluster s.e.    t value     Pr(>|t|)
# d  10.1398727    0.3184067 31.8456637 1.826374e-33
# x1  1.1217514    1.6520151  0.6790201 5.004554e-01
# x2  3.4913273    2.7800267  1.2558611 2.153737e-01
# x3  0.6257981    3.3183765  0.1885856 8.512296e-01
# x4  0.1942742    0.9004173  0.2157602 8.301083e-01

For comparison, here is what Stata computes, the standard errors closely resemble those of LSDV and lfe::felm:

Stata

. reghdfe y d x1 x2 x3 x4, absorb (country time id) vce(cluster country) 

           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           d |   10.13987   .3185313    31.83   0.000      9.49907    10.78068
          x1 |   1.121751   1.652662     0.68   0.501    -2.202975    4.446478
          x2 |   3.491327   2.781115     1.26   0.216    -2.103554    9.086209
          x3 |   .6257981   3.319675     0.19   0.851    -6.052528    7.304124
          x4 |   .1942742   .9007698     0.22   0.830    -1.617841    2.006389
       _cons |   14.26801   23.65769     0.60   0.549    -33.32511    61.86114

Simulated Panel Data:

n1 <- 20; t1 <- 4; n2 <- 48
dat <- expand.grid(id=1:n1, time=1:t1, country=1:n2)
set.seed(42)
dat <- within(dat, {
  id <- as.vector(apply(matrix(1:(n1*n2), n1), 2, rep, t1))
  d <- runif(nrow(dat), 70, 80)
  x1 <- sample(0:1, nrow(dat), replace=TRUE)
  x2 <- runif(nrow(dat))
  x3 <- runif(nrow(dat))
  x4 <- rnorm(nrow(dat))
  y <-
    10*d +  ## treatment effect
    as.vector(replicate(n2, rep(runif(n1, 2, 5), t1))) +  ## id FE
    rep(runif(n1, 10, 12), each=t1) +  ## time FE
    rep(runif(n2, 10, 12), each=n1*t1) +  ## country FE
    - .7*x1 + 1.3*x2 + 2.4*x3 +
    .5 * x4 + rnorm(nrow(dat), 0, 50)
})
readstata13::save.dta13(dat, 'panel.dta')  ## for Stata
Rexanna answered 25/1, 2022 at 8:39 Comment(10)
Thank you. Is it necessary to make effect to be twoways? I specified nothing since the run time is very long. Besides I have updated my question changing ID into Firm and hope your answer is still valid. Besides having index=(“Time”, “Firm”, “Country”) seemed awkward and did not work properly so I had to do in the way as above.Archiplasm
@Archiplasm You want to add unit fixed effects for each aggregate where you believe there is heterogeneity, e.g. for ID's (or firms) and countries, and time fixed effects to account for time trends.Rexanna
I don’t know a simpler way to explain this but yes I simply want fixed effects on time, firm and country simultaneously that’s it.Archiplasm
@Archiplasm Then you should now simply have the solution you need :)Rexanna
Thank you, but from the answer you have given you say plm clustering written within the summary command might not work as I desire. Does this mean I cannot use plm at all? What if I do not use clustering command of the summary but simply run the summary to use as for my result?Archiplasm
@Archiplasm You probably will have false standard errors that will affect your inference accordingly.Rexanna
If you do fixed effects for id and time, then does this necessarily mean I need to cluster according to id and time as well? What is the implication of the model and result if I do not cluster the summary according to country? With simply id and time fixed effects, I could simply run using summary without any further specifications like clustering. What does this mean?Archiplasm
@Archiplasm It appears you need statistical help, which is more on-topic on Cross Validated than on Stack Overflow. Regardless, you might want to read Cameron and Abadie or this post on Cross Validated. Cheers!Rexanna
Just another thing. If I use method='arellano' within the vcovHC for the plm you made above, won't this (somehow) solve the problem?Archiplasm
Furthermore, the main variables of my country level ones are not time varying by their nature. If this is the case, I do not think clustering for country is necessary. So clustering with id and time are needed but not necessarily for country due to the nature of the analysis.Archiplasm

© 2022 - 2024 — McMap. All rights reserved.