How to combine stat_ecdf with geom_ribbon?
Asked Answered
P

3

7

I am trying to draw an ECDF of some data with a "confidence interval" represented via a shaded region using ggplot2. I am having trouble combining geom_ribbon() with stat_ecdf() to achieve the effect I am after.

Consider the following example data:

set.seed(1)
dat <- data.frame(variable = rlnorm(100) + 2)
dat <- transform(dat, lower = variable - 2, upper = variable + 2)

> head(dat)
  variable     lower    upper
1 2.534484 0.5344838 4.534484
2 3.201587 1.2015872 5.201587
3 2.433602 0.4336018 4.433602
4 6.929713 4.9297132 8.929713
5 3.390284 1.3902836 5.390284
6 2.440225 0.4402254 4.440225

I am able to produce an ECDF of variable using

library("ggplot2")
ggplot(dat, aes(x = variable)) +
    geom_step(stat = "ecdf")

However I am unable to use lower and upper as the ymin and ymax aesthetics of geom_ribbon() to superimpose the confidence interval on the plot as another layer. I have tried:

ggplot(dat, aes(x = variable)) +
    geom_ribbon(aes(ymin = lower, ymax = upper), stat = "ecdf") +
    geom_step(stat = "ecdf")

but this raises the following error

Error: geom_ribbon requires the following missing aesthetics: ymin, ymax

Is there a way to coax geom_ribbon() into working with stat_ecdf() to produce a shaded confidence interval? Or, can anyone suggest an alternative means of adding a shaded polygon defined by lower and upper as a layer to the ECDF plot?

Postaxial answered 29/11, 2013 at 2:55 Comment(0)
V
3

Try this (a bit of shot in the dark):

ggplot(dat, aes(x = variable)) +
  geom_ribbon(aes(x = variable,ymin = ..y..-2,ymax = ..y..+2), stat = "ecdf",alpha=0.2) +
  geom_step(stat = "ecdf")

Ok, so that's not the same thing as what you trying to do, but it should explain what's going on. The stat is returning a data frame with just the original x and the computed y, so I think that's all you have to work with. i.e. stat_ecdf only computes the cumulative distribution function for a single x at a time.

The only other thing I can think of is the obvious, calculating the lower and upper separately, something like this:

l <- ecdf(dat$lower)
u <- ecdf(dat$upper)
v <- ecdf(dat$variable)
dat$lower1 <- l(dat$variable)
dat$upper1 <- u(dat$variable)
dat$variable1 <- v(dat$variable)

ggplot(dat,aes(x = variable)) + 
  geom_step(aes(y = variable1)) + 
  geom_ribbon(aes(ymin = upper1,ymax = lower1),alpha = 0.2)
Verbal answered 29/11, 2013 at 3:46 Comment(3)
Thanks Joran. Could you expand on your last sentence? Not sure I follow this fully, but as far as I can tell from your answer, I can't do this via stat_ecdf if lower and upper already exist? The +/- 2 bit was just dummy data; the CI info I have is the result of posterior simulation of a derived statistic computed from a fitted model.Postaxial
@GavinSimpson Yes, I think it's not possible directly in ggplot (although that would be a nice feature to add, I think). All I meant with the last bit was that you might have to calculate all the ECDF values manually and then plot them.Verbal
Thanks, I see what you mean, compute the cumulative proportion directly. I'll give that a do. +1Postaxial
E
2

Not sure exactly how you want to reflect the CI, but ggplot_build() lets you get the generated data back from the plot, you can then overplot what you like.

This chart shows:

  • red = original ribbon
  • blue = takes the original CI vectors and applies to the ecdf curve
  • green = calculates the ecdf of upper and lower series and plots

enter image description here

    g<-ggplot(dat, aes(x = variable)) +
      geom_step(stat = "ecdf") +
      geom_ribbon(aes(ymin = lower, ymax = upper), alpha=0.5, fill="red") 

    inside<-ggplot_build(g)
    matched<-merge(inside$data[[1]],data.frame(x=dat$variable,dat$lower,dat$upper),by=("x"))

    g + 
      geom_ribbon(data=matched, aes(x = x, 
                                      ymin = y + dat.upper-x,
                                      ymax = y - x + dat.lower), 
                    alpha=0.5, fill="blue") +
      geom_ribbon(data=matched, aes(x = x, 
                                      ymin = ecdf(dat.lower)(x),
                                      ymax = ecdf(dat.upper)(x)), 
                    alpha=0.5, fill="green")
Euhemerism answered 29/11, 2013 at 4:41 Comment(1)
Thanks Troy; your final idea, once I realised what the plot showed, is similar to @joran's idea, namely that one could compute the appropriate y data for the lower and upper CIs using ecdf(). The green ribbon is essentially what I want to depict.Postaxial
H
0

I was looking for a solution to a similar problem and came across this answer. I slightly modified the procedure to make it more universal - it also worked on charts with different aesthetics and facets.

First, I load the necessary packages and generate the dummy data.

library(reprex)
library(tidyverse)

set.seed(118)
data1 <- tibble(a = rnorm(100, mean = 20, sd = 3),
                ID = "a",
                gr = "A") %>% 
  bind_rows(tibble(a = rnorm(100, mean = 30, sd = 3),
                   ID = "z",
                   gr = "A")) %>% 
  bind_rows(tibble(a = rnorm(100, mean = 25, sd = 3),
                   ID = "a",
                   gr = "B")) %>% 
  bind_rows(tibble(a = rnorm(100, mean = 40, sd = 3),
                   ID = "z",
                   gr = "B")) %>% 
  mutate(CI_low = a - 3,
         CI_high = a + 3)

Then I generate a chart:

data1 %>% 
  ggplot(aes(x = a, color = gr))+
  geom_step(stat = "ecdf")+
  facet_wrap(~ ID)

After generating the chart, you can preview its data, along with the calculated ecdf:


ggplot_build(last_plot())$data[[1]] %>% 
  as_tibble()
#> # A tibble: 408 × 10
#>    colour      y      x  ecdf flipped_aes PANEL group linewidth linetype alpha
#>    <chr>   <dbl>  <dbl> <dbl> <lgl>       <fct> <int>     <dbl>    <dbl> <lgl>
#>  1 #F8766D  0    -Inf    0    FALSE       1         1       0.5        1 NA   
#>  2 #F8766D  0.09   15.0  0.09 FALSE       1         1       0.5        1 NA   
#>  3 #F8766D  0.55   20.5  0.55 FALSE       1         1       0.5        1 NA   
#>  4 #F8766D  0.51   20.0  0.51 FALSE       1         1       0.5        1 NA   
#>  5 #F8766D  0.6    20.9  0.6  FALSE       1         1       0.5        1 NA   
#>  6 #F8766D  0.08   14.9  0.08 FALSE       1         1       0.5        1 NA   
#>  7 #F8766D  0.19   16.9  0.19 FALSE       1         1       0.5        1 NA   
#>  8 #F8766D  0.47   19.7  0.47 FALSE       1         1       0.5        1 NA   
#>  9 #F8766D  0.73   22.0  0.73 FALSE       1         1       0.5        1 NA   
#> 10 #F8766D  0.34   18.5  0.34 FALSE       1         1       0.5        1 NA   
#> # ℹ 398 more rows

The next step is to assign the values ​​of the grouping variables to individual aesthetics (in my case, color) and groups of variables (facets' names). This must be done manually, paying special attention to ensure that the assignment is correct.

graph_data <- ggplot_build(last_plot())$data[[1]] %>% 
  as_tibble() %>% 
  mutate(gr = case_match(group,
                         1 ~ "A",
                         2 ~ "B")) %>% 
  mutate(ID = case_match(PANEL,
                         "1" ~ "a",
                         "2" ~ "z")) %>% 
  select(a = x, ecdf, gr, ID)

In the next step, I combine the data with the original data and generate a new chart:

data1 %>% 
  left_join(graph_data) %>% 
  ggplot(aes(x = a, color = gr))+
  geom_ribbon(aes(xmin = CI_low, xmax = CI_high, y = ecdf, fill = gr, color = NULL), alpha = 0.1)+
  geom_step(stat = "ecdf")+
  facet_wrap(~ ID)
#> Joining with `by = join_by(a, ID, gr)`

Created on 2024-07-25 with reprex v2.1.1

Halliehallman answered 25/7 at 9:47 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.