R, filter matrix based on variance cut-offs
Asked Answered
P

3

10

See edit below Using R, I would like to filter a matrix (of gene expression data) and keep only the rows (genes/probes) that have values with high variance. For example, I'd like to only keep the rows that have values in the bottom and top percentiles (e.g. below 20% and above 80%). I want to limit my study to only genes under high variance for downstream analyses. Are there common ways for gene filtering in R?

My matrix has 18 samples (columns) and 47000 probes (rows) with values that are log2 transformed and normalized. I know the quantile() function can identify the 20% and 80% cutoffs within each sample column. I can't figure out how to find these values for the entire matrix, and then subset the original matrix to remove all the "non-varying" rows.

Example matrix with a mean of 5.97, thus the last three rows should be removed because they contain values between the 20% and 80% cutoffs:

> m

                sample1 sample2 sample3 sample4 sample5 sample6
ILMN_1762337    7.86    5.05    4.89    5.74    6.78    6.41
ILMN_2055271    5.72    4.29    4.64    5.00    6.30    8.02
ILMN_1736007    3.82    6.48    6.06    7.13    8.20    4.06
ILMN_2383229    6.34    4.34    6.12    6.83    4.82    5.57
ILMN_1806310    6.15    6.37    5.54    5.22    4.59    6.28
ILMN_1653355    7.01    4.73    6.62    6.27    4.77    6.12
ILMN_1705025    6.09    6.68    6.80    6.85    8.35    4.15
ILMN_1814316    5.77    5.17    5.94    6.51    7.12    7.20
ILMN_1814317    5.97    5.97    5.97    5.97    5.97    5.97
ILMN_1814318    5.97    5.97    5.97    5.97    5.97    5.97
ILMN_1814319    5.97    5.97    5.97    5.97    5.97    5.97

I'd appreciate any suggestions, or functions that I should look into. Thanks!

EDIT

Sorry, I was not very clear in the OP. (1) I'd like to know the 20% and 80% cutoff values for the entire matrix (not just for each individual sample). (2) Then, if any row contains a value in the upper or lower percentiles, R will keep these rows. If a row contains values (for all samples) that fall near the mean, these rows are thrown out.

Phlebosclerosis answered 8/6, 2013 at 21:17 Comment(4)
Thanks for the clarification. I have also updated my answer to reflect what you were hoping to acheive. Quick question - do you have a matrix or a dataframe (i.e. is the ID column the rownames of your matrix or the first column of your dataframe?). A quick way to check would be class(m).Acrospire
It should be a matrix (only expression data) and the ID column is the rownames for my matrix (I should have left the "ID" name off my example).Phlebosclerosis
Ok great! That's what I thought with my example.Acrospire
I ended up using the bioconductor package 'genefilter' that was suggested below, using this code: m.var <- varFilter(m, var.func=IQR, var.cutoff=0.6, filterByQuantile=TRUE) and used nrow(m) and row(m.var) to compare the number of probes remaining after filtering.Phlebosclerosis
L
9

Ok, assuming you have a matrix (so I am assuming that your ID column is actually rownames) then this is very simple to do.

#  First find the desired quantile breaks for the entire matrix
qt <- quantile( m , probs = c(0.2,0.8) )
# 20%  80% 
#5.17 6.62 
#  Next get a logical vector of the rows that have any values outside these breaks
rows <- apply( m , 1 , function(x) any( x < qt[1] | x > qt[2] ) )
#  Subset on this vector
m[ rows , ]
#            sample1 sample2 sample3 sample4 sample5 sample6
#ILMN_1762337    7.86    5.05    4.89    5.74    6.78    6.41
#ILMN_2055271    5.72    4.29    4.64    5.00    6.30    8.02
#ILMN_1736007    3.82    6.48    6.06    7.13    8.20    4.06
#ILMN_2383229    6.34    4.34    6.12    6.83    4.82    5.57
#ILMN_1806310    6.15    6.37    5.54    5.22    4.59    6.28
#ILMN_1653355    7.01    4.73    6.62    6.27    4.77    6.12
#ILMN_1705025    6.09    6.68    6.80    6.85    8.35    4.15
#ILMN_1814316    5.77    5.17    5.94    6.51    7.12    7.20

The any( x < qt[1] | x > qt[2] ) part of the apply function (which is designed to apply a function across the margins of a matrix) returns TRUE if any value in that row is outside the 20% and 80% quantiles of your sample matrix. By definition, if no value is outside these bounds it returns FALSE indicating we will drop that row in the next line.

Landlady answered 8/6, 2013 at 21:49 Comment(4)
Thanks for your edit! Your solution seems very straight-forward.Phlebosclerosis
@Phlebosclerosis Great! If any of the 3 solutions presented adequately solved your problem, you may want to consider marking one of them as the accepted answer so that this question no longer appears unanswered (of course if you need something more please tell us). Here is the about page in case you need any more information about how SO works. Cheers!Acrospire
This solution does exactly what I was asking in my OP! However, I am also exploring the Bioconductor genefilter package for alternative methods that may be very useful. Thank you everyone for your great support and patience with my limited knowledge of R and stackoverflow.Phlebosclerosis
@SimonO'Hanlon I have the same problem. For selecting top 20% highly variable genes in the entire matrix Can I use in this way? h.var <-varFilter(h, var.func=IQR, var.cutoff=0.8, filterByQuantile=TRUE)Baseboard
M
5

The Biocondcutor genefilter package provides common filters relevant to microarray analysis. A typical filter based on row-wise variability would be

m = matrix(rnorm(47000 * 6), 47000)
varFilter(m)

The package landing page references vignettes illustrating basic operation and providing diagnostic guidance for use of filtering.

A basic principle in the analysis of microarrays is that values in a row are comparable, but not values between rows. This is because the probes associated with each row have distinct characteristics that introduce row-specific bias -- a value in the first row could reasonably indicate more, less or equal gene expression compared to a value for the same sample in a second row. This means that @Todd's desire to normalize based on between-row comparison (largest and smallest values in the entire matrix) is not recommended. Instead, varFilter calculates a measure of variability of each row (row inter-quartile range) and selects a fraction (the var.cutoff argument) with most variability.

A quick peak at the definition of varFilter shows that in general this is no more tricky than, for some measure of row-wise variability var.func and a (single) quantile var.cutoff

vars <- apply(m, 1, var.func)
m[vars > quantile(vars, var.cutoff), ]
Martlet answered 8/6, 2013 at 22:15 Comment(4)
+1! looks nice. I did the following ,mm <- as.matrix(dat[,-1]);rownames(mm) <- dat[,1];rr <- varFilter(mm,var.cutoff=c(0.2,0.8)), but I get a warnings, does I miss something?Vacant
Thanks for the tip - I will look at this immediately, as there may be standard ways to do this.Phlebosclerosis
Thanks @Martin for your explanation of "best practices" for gene array filtering. I plan to use your solution for my data set.Phlebosclerosis
@Martin I have the same problem. For selecting top 20% highly variable genes in the entire matrix Can I use in this way? h.var <-varFilter(h, var.func=IQR, var.cutoff=0.8, filterByQuantile=TRUE)Baseboard
V
1

I am not a statistician, So I don't know if there is a general method to resolve this. For me the problem will be simpler if you reshape your data in the long format.

library(reshape2)
dat.m <- melt(dat)
dat.m$value <- as.numeric(dat.m$value)
head(dat.m)
            ID variable value
1 ILMN_1762337  sample1  7.86
2 ILMN_2055271  sample1  5.72
3 ILMN_1736007  sample1  3.82
4 ILMN_2383229  sample1  6.34
5 ILMN_1806310  sample1  6.15
6 ILMN_1653355  sample1  7.01

Then for each variable you do the following :

  1. Compute limits using quantile
  2. remove genes that don't satisfy the condition.

You can do this for example , using ddply from plyr:

res <- ddply(dat.m,.(variable),function(x){
  ## compute limits for each sample
  z <- x$value
  qq <- quantile(z, probs = c(0.2,0.8))
  ## keep only genes with high or low variance
  dd <- x[z < qq[1] | z > qq[2],]
})
## return to the wide format
acast(res,ID~variable)

            sample1 sample2 sample3 sample4 sample5 sample6
ILMN_1653355    7.01      NA    6.62      NA    4.77      NA
ILMN_1705025      NA    6.68    6.80    6.85    8.35    4.15
ILMN_1736007    3.82    6.48      NA    7.13    8.20    4.06
ILMN_1762337    7.86      NA    4.89      NA      NA      NA
ILMN_1806310      NA      NA      NA    5.22    4.59      NA
ILMN_1814316      NA      NA      NA      NA      NA    7.20
ILMN_2055271    5.72    4.29    4.64    5.00      NA    8.02
ILMN_2383229      NA    4.34      NA      NA      NA      NA

EDIT after OP clarification , if you want the 20% and 80% cutoff values for the entire matrix not just for each individual sample, you compute qq outside the ddply

   qq <- quantile(dat.m$value, probs = c(0.2,0.8))

Then you comment the corresponding line , like this :

res <- ddply(dat.m,.(variable),function(x){
  z <- x$value
  ## keep only genes with high or low variance
  dd <- x[z < qq[1] | z > qq[2],]
})

PS here dat is :

dat <- read.table(text='         ID    sample1 sample2 sample3 sample4 sample5 sample6
ILMN_1762337    7.86    5.05    4.89    5.74    6.78    6.41
ILMN_2055271    5.72    4.29    4.64    5.00    6.30    8.02
ILMN_1736007    3.82    6.48    6.06    7.13    8.20    4.06
ILMN_2383229    6.34    4.34    6.12    6.83    4.82    5.57
ILMN_1806310    6.15    6.37    5.54    5.22    4.59    6.28
ILMN_1653355    7.01    4.73    6.62    6.27    4.77    6.12
ILMN_1705025    6.09    6.68    6.80    6.85    8.35    4.15
ILMN_1814316    5.77    5.17    5.94    6.51    7.12    7.20
ILMN_1814317    5.97    5.97    5.97    5.97    5.97    5.97
ILMN_1814318    5.97    5.97    5.97    5.97    5.97    5.97
ILMN_1814319    5.97    5.97    5.97    5.97    5.97    5.97',header=TRUE)
Vacant answered 8/6, 2013 at 22:1 Comment(5)
+1 I think we may both have misunderstood the OP though. I think they want to calculate the 20 and 80% cutoffs of the entire matrix and then subset on these values.Acrospire
Yes, it's the "I can't figure out how to find these values for the entire matrix" and then the "the last three rows should be removed because they contain values between the 20% and 80% cutoffs" that make me doubt our original assumption. Perhaps the OP can clarify?Acrospire
Sorry, I was not very clear in the OP. (1) I'd like to know the 20% and 80% cutoff values for the entire matrix (not just for each individual sample). Then, if any row contains a value in the upper or lower percentiles, R will keep these rows. If a row contains values (for all samples) that fall near the mean, these rows are thrown out. Does that help? Thanks a lot for your suggestions -- I'm pretty new at R.Phlebosclerosis
@Phlebosclerosis I edit my answer. Actually, I recommend you to use M.M solution.Vacant
@Vacant one more clarification - the dataframe is actually a matrix and the ID column is actually rownames - OP put ID there for aesthetic purposes rather than that being the actual data.Acrospire

© 2022 - 2024 — McMap. All rights reserved.