See edit below Using R, I would like to filter a matrix (of gene expression data) and keep only the rows (genes/probes) that have values with high variance. For example, I'd like to only keep the rows that have values in the bottom and top percentiles (e.g. below 20% and above 80%). I want to limit my study to only genes under high variance for downstream analyses. Are there common ways for gene filtering in R?
My matrix has 18 samples (columns) and 47000 probes (rows) with values that are log2 transformed and normalized. I know the quantile()
function can identify the 20% and 80% cutoffs within each sample column. I can't figure out how to find these values for the entire matrix, and then subset the original matrix to remove all the "non-varying" rows.
Example matrix with a mean of 5.97, thus the last three rows should be removed because they contain values between the 20% and 80% cutoffs:
> m
sample1 sample2 sample3 sample4 sample5 sample6
ILMN_1762337 7.86 5.05 4.89 5.74 6.78 6.41
ILMN_2055271 5.72 4.29 4.64 5.00 6.30 8.02
ILMN_1736007 3.82 6.48 6.06 7.13 8.20 4.06
ILMN_2383229 6.34 4.34 6.12 6.83 4.82 5.57
ILMN_1806310 6.15 6.37 5.54 5.22 4.59 6.28
ILMN_1653355 7.01 4.73 6.62 6.27 4.77 6.12
ILMN_1705025 6.09 6.68 6.80 6.85 8.35 4.15
ILMN_1814316 5.77 5.17 5.94 6.51 7.12 7.20
ILMN_1814317 5.97 5.97 5.97 5.97 5.97 5.97
ILMN_1814318 5.97 5.97 5.97 5.97 5.97 5.97
ILMN_1814319 5.97 5.97 5.97 5.97 5.97 5.97
I'd appreciate any suggestions, or functions that I should look into. Thanks!
EDIT
Sorry, I was not very clear in the OP. (1) I'd like to know the 20% and 80% cutoff values for the entire matrix (not just for each individual sample). (2) Then, if any row contains a value in the upper or lower percentiles, R will keep these rows. If a row contains values (for all samples) that fall near the mean, these rows are thrown out.
class(m)
. – Acrospirem.var <- varFilter(m, var.func=IQR, var.cutoff=0.6, filterByQuantile=TRUE)
and usednrow(m)
androw(m.var)
to compare the number of probes remaining after filtering. – Phlebosclerosis