R: avoiding summary.plm
Asked Answered
T

3

7

I'm using R to run a Monte-Carlo simulation studying the performance of panel data estimators. Because I'll be running a large number of trials, I need to get at least decent performance from my code.

Using Rprof on 10 trials of my simulation shows that a significant portion of time is spent in calls to summary.plm. The first few lines of Rprofsummary are provided below:

$by.total
                            total.time total.pct self.time self.pct
"trial"                          54.48     100.0      0.00      0.0
"coefs"                          53.90      98.9      0.06      0.1
"model.matrix"                   36.72      67.4      0.10      0.2
"model.matrix.pFormula"          35.98      66.0      0.06      0.1
"summary"                        33.82      62.1      0.00      0.0
"summary.plm"                    33.80      62.0      0.08      0.1
"r.squared"                      29.00      53.2      0.02      0.0
"FUN"                            24.84      45.6      7.52     13.8

I'm calling summary in my code because I need to get the standard errors of the coefficient estimates as well as the coefficients themselves (which I could get from just the plm object). My call looks like

regression <- plm(g ~ y0 + Xit, data=panel_data, model=model, index=c("country","period"))

coefficients_estimated <- summary(regression)$coefficients[,"Estimate"]
ses_estimated <- summary(regression)$coefficients[,"Std. Error"]

I have a nagging feeling that this is a huge waste of cpu time, but I don't know enough about how R does things to avoid calling summary. I'd appreciate any information on what's going on behind the scenes here, or some way of reducing the time it takes for this to excecute.

Thaxton answered 11/4, 2011 at 2:59 Comment(2)
Hmmm, a first thought is to combine the two summary calls. I'm going to try that, and see if it helps things...Thaxton
Combining the two summary calls does help quite a bit, but summary.plm is still a major hog.Thaxton
S
6

You just need to look inside plm:::summary.plm to see what it is doing. When you do, you'll see that your two lines calling summary() on your model fit can be replaced with:

coefficients_estimated <- coef(regression)
ses_estimated <- sqrt(diag(vcov(regression)))

For example:

require(plm)
data("Produc", package = "plm")
zz <- plm(log(gsp) ~ log(pcap) + log(pc) + log(emp) + unemp, 
          data = Produc, index = c("state","year"))

summary(zz) gives:

> summary(zz)
Oneway (individual) effect Within Model

....

Coefficients :
             Estimate  Std. Error t-value  Pr(>|t|)    
log(pcap) -0.02614965  0.02900158 -0.9017    0.3675    
log(pc)    0.29200693  0.02511967 11.6246 < 2.2e-16 ***
log(emp)   0.76815947  0.03009174 25.5273 < 2.2e-16 ***
unemp     -0.00529774  0.00098873 -5.3582 1.114e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
....

and the two lines I showed return for zz:

> coef(zz)
   log(pcap)      log(pc)     log(emp)        unemp 
-0.026149654  0.292006925  0.768159473 -0.005297741 
> sqrt(diag(vcov(zz)))
   log(pcap)      log(pc)     log(emp)        unemp 
0.0290015755 0.0251196728 0.0300917394 0.0009887257

You don't really provide enough information (your simulation code nor the full output from Rprof() for example) to say whether this will help - it certainly doesn't look like vast amounts of time are spent in summary(); FUN is far more costly than anything else you show, and of the elements you do show, r.squared() is the only one that appears in plm:::summary.plm() and it seems to take no time at all.

So, whether the above speeds things up appreciably remains to be seen.

Sexagenary answered 11/4, 2011 at 7:37 Comment(3)
You're right, there is a negligible effect on speed. As I noted in my comment on the question, I had a significant increase from reducing to one call to summary (~30% reduction in time). I didn't think to look in to what summary.plm was doing, so I appreciate that insight for future problems. Regarding your advice to look at "FUN", I'm not sure what "FUN" is refering to. I would think an argument to apply, but I don't have any apply loops... Is there any way to easily find out what "FUN" is? In any case, answer accepted. Thank you.Thaxton
@Thaxton FUN is usually the argument name for a function that is to be applied to chunks of data. We don't have your simulation code, so it is difficult to say what FUN is, but are you using any of the apply(), sapply(), lapply() family etc?Sexagenary
yeah, that's pretty much what I thought. I'm not using any of the apply family on large chunks of data, so it must be something else. I'm not going to make you go over my simulation code before I've gone through it with a fine toothed comb. Thanks for all your help.Thaxton
I
2

If you want to take things further, then have a look at the actual function code of plm:::plm You will notice that there is a lot of argument checking, before a final call to plm:::plm.fit You could (if really wanted), skip straight to plm.fit.

One final point. You mention that your problem is a Monte Carlo simulation. Can you leverage parallel computing for your speed increases?

Isle answered 11/4, 2011 at 8:11 Comment(3)
Good point, shame that function isn't linked to from ?plm as I wondered if there was a heavy lifting compute core that @Thaxton could leverage. I gave up looking after a cursory glance at the help for plm().Sexagenary
As you point out, I think looking at FUN may be @wilduck's best bet.Isle
I'm planning on leveraging parallel computing, since monte carlo is embarassingly parallel, I was thinking foreach would work pretty well.Thaxton
A
2

Just use coeftest(zz). coeftest is in the lmtest package; it will give you the coefficients and standard errors from plm objects much more quickly than summary.plm.

At answered 4/3, 2012 at 21:55 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.