prcomp and ggbiplot: invalid 'rot' value
Asked Answered
R

1

7

I'm trying to do a PCA analysis of my data using R, and I found this nice guide, using prcomp and ggbiplot. My data is two sample types with three biological replicates each (i.e. 6 rows) and around 20000 genes (i.e. variables). First, getting the PCA model with the code described in the guide doesn't work:

>pca=prcomp(data,center=T,scale.=T)
Error in prcomp.default(data, center = T, scale. = T) : 
cannot rescale a constant/zero column to unit variance

However, if I remove the scale. = T part, it works just fine and I get a model. Why is this, and is this the cause of the error below?

> summary(pca)
Importance of components:
                             PC1       PC2       PC3       PC4       PC5
Standard deviation     4662.8657 3570.7164 2717.8351 1419.3137 819.15844
Proportion of Variance    0.4879    0.2861    0.1658    0.0452   0.01506
Cumulative Proportion     0.4879    0.7740    0.9397    0.9849   1.00000

Secondly, plotting the PCA. Even just using the basic code, I get an error and an empty plot image:

> ggbiplot(pca)
Error: invalid 'rot' value

What does this mean, and how can I fix it? Does it have something to do with the (non)scale in making the PCA, or is it something different? It must be something with my data, I think, since if I use a standard example code (below) I get a really nice PCA plot.

> data(wine)
> wine.pca=prcomp(wine,scale.=T)
> print(ggbiplot(wine.pca, obs.scale = 1, var.scale = 1, groups = wine.class, 
  ellipse = TRUE, circle = TRUE))

[EDIT 1] I have tried subsetting my data in two ways: 1) remove all columns were all rows are 0, and 2) remove all columns were any rows are 0. The first subsetting still gives me the scale error, but not the ones that have removed columns with any 0's. Why is this? How does this affect my PCA?

Also, I tried doing using the normal biplot command for both the original data (non-scaled) and the subsetted data above, and it works in both cases. So it's something to do with with ggbiplot?

[EDIT 2] I have uploaded a subset of my data that gives me the error when I don't remove all the zeroes and works when I do. I haven't used gist before, but I think this is it. Or this...

Religiose answered 19/11, 2014 at 12:16 Comment(4)
Is there any way for you to provide you data such as a dput of your dataset on gist? Or if it is large, a subset that still produces the error? It is difficult to try and diagnose a problem that we can't reproduce.Omentum
I've now added some data, any help is appreciated!Religiose
The data you provided on gist doesn't reproduce the error. I downloaded the file and prcomp and ggbiplot ran without error.Omentum
I realize now that the data I uploaded is not transposed (since I do that inside my script), and that I can also run prcomp on this data as-is. What I'm interested in is a PCA with the 10k variables (or however many variables i subset to) with the 20 or so different sample types. Does prcomp for the transposed dataset work for you?Religiose
O
11

After transposing your data, I was able to replicate your error. The first error is the primary problem. PCA seeks to maximize the variance of each component so it is important that it doesn't focus on just one variable that may have very high variance. The first error:

Error in prcomp.default(tdf, center = T, scale. = T) : 
  cannot rescale a constant/zero column to unit variance

This is telling you that some of your variables have zero variance (i.e. no variability). Seeing how PCA is trying to group things by maximizing variance there is no point in retaining these variables. They can easily be removed with the following call:

df_f <- data[,apply(data, 2, var, na.rm=TRUE) != 0]

Once you do this filter, the remaining calls work appropriately

pca=prcomp(df_f,center=T,scale.=T)
ggbiplot(pca)
Omentum answered 4/12, 2014 at 13:34 Comment(2)
Okay, that's great! I don't fully understand your code, though... You remove columns where there is no variance (var)? In what way is that different than removing all columns were there is a zero? (I understand there is a difference obviously, but not exactly how). My code removing zeroes looks like this: nonzero = data[ , apply(data, 2, function(x) all(x > 0))]Religiose
The difference is that I am remove columns with 0 variance not those that contain a 0. A zero could be important but a variable with no variance is not valuable in PCA.Omentum

© 2022 - 2024 — McMap. All rights reserved.