Efficiently perform row-wise distribution test
Asked Answered
G

4

8

I have a matrix in which each row is a sample from a distribution. I want to do a rolling comparison of the distributions using ks.test and save the test statistic in each case. The simplest way to implement this conceptually is with a loop:

set.seed(1942)
mt <- rbind(rnorm(5), rnorm(5), rnorm(5), rnorm(5))

results <- matrix(as.numeric(rep(NA, nrow(mt))))

for (i in 2 : nrow(mt)) {

  results[i] <- ks.test(x = mt[i - 1, ], y = mt[i, ])$statistic

}

However, my real data has ~400 columns and ~300,000 rows for a single example, and I have a lot of examples. So I'd like this to be fast. The Kolmogorov-Smirnov test isn't all that mathematically complicated, so if the answer is "implement it in Rcpp," I'll grudgingly accept that, but I'd be somewhat surprised -- it's already very fast to compute on a single pair in R.

Methods I've tried but have been unable to get working: dplyr using rowwise/do/lag, zoo using rollapply (which is what I use to generate the distributions), and populating a data.table in a loop (edit: this one works, but it's still slow).

Guv answered 24/4, 2015 at 15:32 Comment(2)
Are you actually using the KernSmooth package? ks.test is in the stats package.Miyamoto
You are correct! I am using KernSmooth, but not for this function -- I'm using it to generate the distributions. I'll edit.Guv
S
7

A quick and dirty implementation in Rcpp

// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h> 

double KS(arma::colvec x, arma::colvec y) {
  int n = x.n_rows;
  arma::colvec w = join_cols(x, y);
  arma::uvec z = arma::sort_index(w);
  w.fill(-1); w.elem( find(z <= n-1) ).ones();
  return max(abs(cumsum(w)))/n;
}
// [[Rcpp::export]]
Rcpp::NumericVector K_S(arma::mat mt) {
  int n = mt.n_cols; 
  Rcpp::NumericVector results(n);
  for (int i=1; i<n;i++) {
    arma::colvec x=mt.col(i-1);
    arma::colvec y=mt.col(i);
    results[i] = KS(x, y);
    }
  return results;
}

for matrix of size (400, 30000), it completes under 1s.

system.time(K_S(t(mt)))[3]
#elapsed 
#   0.98 

And the result seems to be accurate.

set.seed(1942)
mt <- matrix(rnorm(400*30000), nrow=30000)
results <- rep(0, nrow(mt))
for (i in 2 : nrow(mt)) {
  results[i] <- ks.test(x = mt[i - 1, ], y = mt[i, ])$statistic
}
result <- K_S(t(mt))
all.equal(result, results)
#[1] TRUE
Structural answered 24/4, 2015 at 18:17 Comment(6)
That's fast. I'll test it out!Guv
That's crazy fast. Excellent work. For comparison, I stopped my rollapplyr() solution after about 2 hours (it had generated nearly all the results at that point but was still running). Does it match the results from ks.test()?Lampert
I didn't check for the accuracy, hence the identifier "dirty".Structural
Not quite, but it's damn close: all.equal(results.ks2, results.cpp[2:280007]) [1] "Mean relative difference: 7.642923e-05". And it's approximately 9x faster than ks.test2 on my actual data.Guv
Given the performance and acceptable accuracy, I'd say this is likely to be your best solution, @Ajar.Lampert
Indeed. Thanks all for the great input!Guv
M
3

One source of speed up is to write a smaller version of ks.test that does less. ks.test2 below is more restrictive than ks.test. It assumes, for example, that you have no missing values and that you always want the statistic associated with a two-sided test.

ks.test2 <- function(x, y){

  n.x <- length(x)
  n.y <- length(y)
  w <- c(x, y)
  z <- cumsum(ifelse(order(w) <= n.x, 1/n.x, -1/n.y))

  max(abs(z))

}

Verify that the output is consistent with ks.test.

set.seed(999)
x <- rnorm(400)
y <- rnorm(400)

ks.test(x, y)$statistic

    D 
0.045

ks.test2(x, y)

[1] 0.045

Now determine the savings from the smaller function:

library(microbenchmark)

microbenchmark(
  ks.test(x, y),
  ks.test2(x, y)
  )

Unit: microseconds
           expr      min       lq      mean   median        uq      max neval cld
  ks.test(x, y) 1030.238 1070.303 1347.3296 1227.207 1313.8490 6338.918   100   b
 ks.test2(x, y)  709.719  730.048  832.9532  833.861  888.5305 1281.284   100  a 
Miyamoto answered 24/4, 2015 at 17:27 Comment(2)
I'd be interested to see a benchmark of my rollapplyr() solution using this function in place of ks.test(). I'll test that out once the current benchmark finishes.Lampert
I'll be very interested in this as well! I'm currently testing out some of these answers myself.Guv
L
2

I was able to compute the pairwise Kruskal-Wallis statistic using ks.test() with rollapplyr().

results <- rollapplyr(data = big,
                      width = 2,
                      FUN = function(x) ks.test(x[1, ], x[2, ])$statistic,
                      by.column = FALSE)

This gets the expected result, but it's slow for a dataset of your size. Slow slow slow. This may be because ks.test() is computing a lot more than just the statistic at each iteration; it also gets the p-value and does a lot of error checking.

Indeed, if we simulate a large dataset like so:

big <- NULL
for (i in 1:400) {
    big <- cbind(big, rnorm(300000))
}

The rollapplyr() solution takes a long time; I halted execution after about 2 hours, at which point it had computed nearly all (but not all) results.

It seems that while rollapplyr() is likely faster than a for loop, it will not likely be the best overall solution in terms of performance.

Lampert answered 24/4, 2015 at 17:33 Comment(0)
A
1

Here's a dplyr solution that gets the same result as your loop. I have my doubts if this is actually faster than the loop, but perhaps it can serve as a first step towards a solution.

require(dplyr)
mt %>% 
  as.data.frame %>%
  mutate_each(funs(lag)) %>%
  cbind(mt) %>%
  slice(-1) %>%
  rowwise %>%
  do({
    x = unlist(.)
    n <- length(x)
    data.frame(ks = ks.test(head(x, n/2), tail(x, n/2))$statistic)
  }) %>%
  unlist %>%
  c(NA, .) %>%
  matrix
Agneta answered 24/4, 2015 at 17:29 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.