In R ggplot2, include stat_ecdf() endpoints (0,0) and (1,1)

#libraries require(ggplot2) require(scales) # fake data for reproducibility set.seed(123) n <- 200 df <- data.frame(model_score= rexp(n=n,rate=1:n), obs_set= sample(c("training","validation"),n,replace=TRUE)) df$model_rank <- rank(df$model_score)/n df$target_outcome <- rbinom(n,1,1-df$model_rank) # Plot Gain Chart using stat_ecdf() ggplot(subset(df,target_outcome==1),aes(x = model_rank)) + stat_ecdf(aes(colour = obs_set), size=1) + scale_x_continuous(limits=c(0,1), labels=percent,breaks=seq(0,1,.1)) + xlab("Model Percentile") + ylab("Percent of Target Outcome") + scale_y_continuous(limits=c(0,1), labels=percent) + geom_segment(aes(x=0,y=0,xend=1,yend=1), colour = "gray", linetype="longdash", size=1) + ggtitle("Gain Chart")

Unfortunately, the definition of stat_ecdf gives no wiggle room here; it determines the endpoints internally.

There is a somewhat advanced solution. With the latest version of ggplot2 (devtools::install_github("hadley/ggplot2")), the extensibility is improved, to the point where it is possible to override this behavior, but not without some boilerplate.

stat_ecdf2 <- function(mapping = NULL, data = NULL, geom = "step",
                      position = "identity", n = NULL, show.legend = NA,
                      inherit.aes = TRUE, minval=NULL, maxval=NULL,...) {
  layer(
    data = data,
    mapping = mapping,
    stat = StatEcdf2,
    geom = geom,
    position = position,
    show.legend = show.legend,
    inherit.aes = inherit.aes,
    stat_params = list(n = n, minval=minval,maxval=maxval),
    params = list(...)
  )
}


StatEcdf2 <- ggproto("StatEcdf2", StatEcdf,
  calculate = function(data, scales, n = NULL, minval=NULL, maxval=NULL, ...) {
    df <- StatEcdf$calculate(data, scales, n, ...)
    if (!is.null(minval)) { df$x[1] <- minval }
    if (!is.null(maxval)) { df$x[length(df$x)] <- maxval }
    df
  }
)

Now, stat_ecdf2 will behave the same as stat_ecdf, but with an optional minval and maxval parameter. So this will do the trick:

ggplot(subset(df,target_outcome==1),aes(x = model_rank)) +
  stat_ecdf2(aes(colour = obs_set), size=1, minval=0, maxval=1) +
  scale_x_continuous(limits=c(0,1), labels=percent,breaks=seq(0,1,.1)) +
  xlab("Model Percentile") + ylab("Percent of Target Outcome") +
  scale_y_continuous(limits=c(0,1), labels=percent) +
  geom_segment(aes(x=0,y=0,xend=1,yend=1),
               colour = "gray", linetype="longdash", size=1) +
  ggtitle("Gain Chart")

The big caveat here is that I don't know if the current extensibility model will be supported in the future; it has changed several times in the past, and the change to use "ggproto" is recent -- like July 15th 2015 recent.

As a plus, this gave me a chance to really dig into ggplot's internals, which is something that I've been meaning to do for a while.

Recommended topics

Hot tags