error in plm regression
Asked Answered
L

2

6

colleagues! I have panel data:

    Company year       Beta     NI   Sales  Export Hedge      FL     QR     AT Foreign
1       1 2010 -2.2052800 293000 1881000 78.6816     0 23.5158  1.289 0.6554    3000
2       1 2011 -2.2536069 316000 2647000 81.4885     0 21.7945 1.1787 0.8282   22000
3       1 2012  0.3258693 363000 2987000 82.4908     0 24.5782 1.2428  0.813  -11000
4       1 2013  0.4006030 549000 4546000 79.4325     0 31.4168 0.6038 0.7905   71000
5       1 2014 -0.4508811 348000 5376000 79.2411     0 37.1451 0.6563  0.661  -64000
6       1 2015  0.1494696 355000 5038000 77.1735     0 33.3852 0.9798 0.5483   37000

But R shows the mistake when I try to use plm package for the regression:

panel <- read.csv("Panel.csv",  header=T, sep=";")
p=plm(data=panel,Beta~NI, model="within",index=c("id","year"))


Error in pdim.default(index[[1]], index[[2]]) : 
  duplicate couples (id-time)
In addition: Warning messages:
1: In pdata.frame(data, index) :
  duplicate couples (id-time) in resulting pdata.frame
 to find out which, use e.g. table(index(your_pdataframe), useNA = "ifany")
2: In is.pbalanced.default(index[[1]], index[[2]]) :
  duplicate couples (id-time)

3: In is.pbalanced.default(index[[1]], index[[2]]) :
  duplicate couples (id-time)

I searched this error in the Internet and read that it's connected with the id of company and year. But I did not find the way how to avoid this problem. Also, when I do na.omit(panel), R does not show the error, but it's significant to stay NA data and companies in the data. Please, tell me to do with this problem. Thank you.

Laryngology answered 27/4, 2017 at 16:45 Comment(0)
C
13

Let consider the Produc dataset in the plm package.

data("Produc", package = "plm")
head(Produc)

    state year region     pcap     hwy   water    util       pc   gsp    emp unemp
1 ALABAMA 1970      6 15032.67 7325.80 1655.68 6051.20 35793.80 28418 1010.5   4.7
2 ALABAMA 1971      6 15501.94 7525.94 1721.02 6254.98 37299.91 29375 1021.9   5.2
3 ALABAMA 1972      6 15972.41 7765.42 1764.75 6442.23 38670.30 31303 1072.3   4.7
4 ALABAMA 1973      6 16406.26 7907.66 1742.41 6756.19 40084.01 33430 1135.5   3.9
5 ALABAMA 1974      6 16762.67 8025.52 1734.85 7002.29 42057.31 33749 1169.8   5.5
6 ALABAMA 1975      6 17316.26 8158.23 1752.27 7405.76 43971.71 33604 1155.4   7.7

In this dataset information are collected over time (17 years) and over the same sample units (48 US States).

table(Produc$state, Produc$year)
                 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986
  ALABAMA           1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1
  ARIZONA           1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1
  ARKANSAS          1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1
  CALIFORNIA        1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1
  ...

plm requires that each (state, year) pair be unique.

any(table(Produc$state, Produc$year)!=1)
[1] FALSE

The command plm works nicely with this dataset:

plmFit1 <- plm(log(gsp) ~ log(pcap) + log(pc) + log(emp) + unemp,
          data = Produc, index = c("state","year"))
summary(plmFit1)


Oneway (individual) effect Within Model
Call:
plm(formula = log(gsp) ~ log(pcap) + log(pc) + log(emp) + unemp, 
    data = Produc, index = c("state", "year"))

Balanced Panel: n=48, T=17, N=816

Residuals :
    Min.  1st Qu.   Median  3rd Qu.     Max. 
-0.12000 -0.02370 -0.00204  0.01810  0.17500 

Coefficients :
             Estimate  Std. Error t-value  Pr(>|t|)    
log(pcap) -0.02614965  0.02900158 -0.9017    0.3675    
log(pc)    0.29200693  0.02511967 11.6246 < 2.2e-16 ***
log(emp)   0.76815947  0.03009174 25.5273 < 2.2e-16 ***
unemp     -0.00529774  0.00098873 -5.3582 1.114e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Total Sum of Squares:    18.941
Residual Sum of Squares: 1.1112
R-Squared:      0.94134
Adj. R-Squared: 0.93742
F-statistic: 3064.81 on 4 and 764 DF, p-value: < 2.22e-16

Now we duplicate one of the (state, year) pairs:

 Produc[2,2] <- 1970
 any(table(Produc$state, Produc$year)>1)
 [1] TRUE

and plm now generates the same error message that you described above:

zz <- plm(log(gsp) ~ log(pcap) + log(pc) + log(emp) + unemp,
      data = Produc, index = c("state","year"))

Error in pdim.default(index[[1]], index[[2]]) : 
  duplicate couples (id-time)
Inoltre: Warning messages:
1: In pdata.frame(data, index) :
  duplicate couples (id-time) in resulting pdata.frame
 to find out which, use e.g. table(index(your_pdataframe), useNA = "ifany")
2: In is.pbalanced.default(index[[1]], index[[2]]) :
  duplicate couples (id-time)

3: In is.pbalanced.default(index[[1]], index[[2]]) :
  duplicate couples (id-time)

Hope this can help you.

Colonial answered 27/4, 2017 at 18:3 Comment(2)
A table of an unbalanced panel without duplicates has zero values, therefore I suggest to use any(table(Produc$state, Produc$year) > 1).Spreader
Then how can I identify the duplicate rows in the data frame?Ishii
S
0

Just discovered another case that warns of duplicate couples (id-time) although there are none, and which might be worth sharing here.

Namely, if you try to name the time variable "id" for some reason.

library(plm)
data(Produc)

## duplicate time variable and name it "id"
Produc <- transform(Produc, id=year)

## check duplicate couples (id-time)
stopifnot(!any(table(Produc[, "state"], Produc[, "id"]) > 1))

f1 <- plm(gsp ~ pcap, Produc, index=c("state", "year"), model="within", effect="twoways")
## OK

f2 <- plm(gsp ~ pcap, Produc, index=c("state", "id"), model="within", effect="twoways")
# Warning messages:
# 1: In pdata.frame(x, index) :
#   duplicate couples (id-time) in resulting pdata.frame
#  to find out which, use e.g. table(index(your_pdataframe), useNA = "ifany")
# 2: In is.pbalanced.default(id, time) : duplicate couples (id-time)

The reason gets visible, when we explicitly create plm panel data frames, which plm :plm internally does if you don't provide one.

## create pdata.frames
p1 <- pdata.frame(Produc, index=c("state", "year"))
p2 <- pdata.frame(Produc, index=c("state", "id"))

head(index(p1))
#     state year
# 1 ALABAMA 1970
# 2 ALABAMA 1971
# 3 ALABAMA 1972
# 4 ALABAMA 1973
# 5 ALABAMA 1974
# 6 ALABAMA 1975

head(index(p2))
#     state state.1
# 1 ALABAMA ALABAMA
# 2 ALABAMA ALABAMA
# 3 ALABAMA ALABAMA
# 4 ALABAMA ALABAMA
# 5 ALABAMA ALABAMA
# 6 ALABAMA ALABAMA

As we can see, "id" is not used as a variable, but somehow associated with the column "state". Though I am not sure, what exactly goes wrong, since all.equal(str(p1), str(p2)) throws TRUE.

Spreader answered 4/11, 2020 at 11:48 Comment(4)
"id", "time", and "group" are column names internally used for indexes. If they refer to an index with a different meaning than the name implies, things get messy. The development version has a warning for this, see NEWS.md: "index: gives warning if argument 'which' contains "confusing" values. "confusing": an index variable called by user 'id', 'time', or 'group' if it does not refer to the respective index (e.g., time index variable is called 'id' in the user's data frame)."Impossibly
@Impossibly Thanks for informing, I should have put this as a ticket on Github. Please let me know when the new version of this great package is released and I can delete the answer.Spreader
new version is released since quite a whileImpossibly
@Impossibly Thanks, it now throws at least the announced warning.Spreader

© 2022 - 2024 — McMap. All rights reserved.