duplicate couples (id-time) error in plm with only two IDs
Asked Answered
C

1

2

I'm trying to run a fixed effects regression using the plm package. The regression code is as following:

fixed = plm(hp~crime,index=c('year','country'),data=data,model='within')

which returns the following error code:

error in pdim.default(index[[1]], index[[2]]) : duplicate couples (id-time)

I have searched the web, including stackoverflow. What I understand is that plm can only run with two ID's, so if you have several ID's, you'll how to 'cheat' plm by merging these before indexing. However, my data only consists of the columns: country, year, hp and crime, so I do not understand how this is possible.

Essentially what I'm asking, am I doing something wrong? Do I still need to merge these two IDs or is the fault within my duplicates of my rows, if that is the case is it possibly to find the duplicates by coding? (I have manually tried to look through my panel data to find duplicates of IDs, i.e. several values of house prices for year 1 for country 1.

If I run

any(table(data$country,data$year)!=1) 

I get TRUE. As I can understand this shows that there aren't any duplicates of country+year combination.

Cedeno answered 7/5, 2020 at 14:21 Comment(0)
S
6

Consider the following appropriate data.

set.seed(42)
(d1 <- transform(expand.grid(id=1:2, time=1:2), X=rnorm(4), y=rnorm(4)))
#   id time          X           y
# 1  1    1  1.3709584  0.40426832
# 2  2    1 -0.5646982 -0.10612452
# 3  1    2  0.3631284  1.51152200
# 4  2    2  0.6328626 -0.09465904

library(plm)
plm(y ~ X, index=c("id", "time"), d1)
# works

Now let's duplicate the last row to simulate a flaw in the data,

(d1 <- rbind(d1, d1[nrow(d1), ]))
#    id time          X           y
# 1   1    1  1.3709584  0.40426832
# 2   2    1 -0.5646982 -0.10612452
# 3   1    2  0.3631284  1.51152200
# 4   2    2  0.6328626 -0.09465904
# 41  2    2  0.6328626 -0.09465904  ## duplicated (X and y may be different though)

where we get an error:

plm(y ~ X, index=c("id", "time"), d1)
# Error in pdim.default(index[[1]], index[[2]]) : 
#   duplicate couples (id-time)

Similarly we get an error if we have data with id, time and some condition:

(d2 <- transform(expand.grid(id=1:2, time=1:2, cond=0:1), X=rnorm(4), y=rnorm(4)))
#   id time cond          X          y
# 1  1    1    0  2.0184237 -1.3888607
# 2  2    1    0 -0.0627141 -0.2787888
# 3  1    2    0  1.3048697 -0.1333213
# 4  2    2    0  2.2866454  0.6359504
# 5  1    1    1  2.0184237 -1.3888607
# 6  2    1    1 -0.0627141 -0.2787888
# 7  1    2    1  1.3048697 -0.1333213
# 8  2    2    1  2.2866454  0.6359504


plm(y ~ X, index=c("id", "time"), d2)
# Error in pdim.default(index[[1]], index[[2]]) : 
#   duplicate couples (id-time)

To overcome this, we can technically merge the two indices, whatever that means statistically:

(d2 <- transform(d2, id2=apply(d2[c("id", "cond")], 1, paste, collapse=".")))
#   id time cond          X          y id2
# 1  1    1    0  2.0184237 -1.3888607 1.0
# 2  2    1    0 -0.0627141 -0.2787888 2.0
# 3  1    2    0  1.3048697 -0.1333213 1.0
# 4  2    2    0  2.2866454  0.6359504 2.0
# 5  1    1    1  2.0184237 -1.3888607 1.1
# 6  2    1    1 -0.0627141 -0.2787888 2.1
# 7  1    2    1  1.3048697 -0.1333213 1.1
# 8  2    2    1  2.2866454  0.6359504 2.1

plm(y ~ X, index=c("id2", "time"), d2)
# works

At the end, this stopifnot should not yield an error, where c("id", "time") corresponds to what you have defined in plm(..., index=c("id", "time")):

stopifnot(!any(duplicated(d1[c("id", "time")])))
# Error: !any(duplicated(d1[c("id", "time")])) is not TRUE
Shool answered 7/5, 2020 at 16:17 Comment(1)
Thank you for taking your time to response! I was under the belief that too many IDs was the root of the error but with your intution I was able to resolve the problem quickly! To anyone who might find themself with similar problems, this #6987157 shows how to find the duplicates in large datasets.Cedeno

© 2022 - 2024 — McMap. All rights reserved.