Different number of outliers with ggplot2
Asked Answered
W

1

10

Can somebody explain to me why I get a different number of outliers with the normal command and with the geom_boxplot of ? Here you have an example:

x <- c(280.9, 135.9, 321.4, 333.7, 0.2, 71.3, 33.0, 102.6, 126.8, 194.8, 35.5, 
       107.3, 45.1, 107.2, 55.2, 28.1, 36.9, 24.3, 68.7, 163.5, 0.8, 31.8, 121.4, 
       84.7, 34.3, 25.2, 101.4, 203.2, 194.1, 27.9, 42.5, 47.0, 85.1, 90.4, 103.8, 
       45.1, 94.0, 36.0, 60.9, 97.1, 42.5, 96.4, 58.4, 174.0, 173.2, 164.1, 92.1, 
       41.9, 130.2, 94.7, 121.5, 261.4, 46.7, 16.3, 50.7, 112.9, 112.2, 242.5, 140.6, 
       112.6, 31.2, 36.7, 97.4, 140.5, 123.5, 42.9, 59.4, 94.5, 37.4, 232.2, 114.6, 
       60.7, 27.8, 115.5, 111.9, 60.1)
data <- data.frame(x)
boxplot(data$x)
ggplot(data, aes(y=x)) + geom_boxplot()

With the boxplot command I get the plot below with 4 outliers. enter image description here

And with ggplot2 I get the plot below with 5 outliers. enter image description here

Whitleywhitlock answered 15/12, 2018 at 16:10 Comment(4)
Look at the ylimits. You're essentially zooming in.Integer
given that both plots show data from 200-300, and that's where the extra outlier is, this isn't a zoom issueDeer
ggplot2 and base boxplot use same range (1.5), but do they use same way to calculate quantiles?Mortify
(boxplot(data$x)) shows that its upper hinge is at 122.5, not 122.0 as suggested by quantile(data$x). This would put the end of the whisker at 242.5, which is above the 241.25 point. @dww's excellent answer demonstrates a way to mitigate this.Toole
T
15

ggplot and boxplot use slightly different methods to calculate the statistics. From ?geom_boxplot we can see

The lower and upper hinges correspond to the first and third quartiles (the 25th and 75th percentiles). This differs slightly from the method used by the boxplot() function, and may be apparent with small samples. See boxplot.stats() for for more information on how hinge positions are calculated for boxplot().

You can get ggplot to use boxplot.stats if you want the same results

# Function to use boxplot.stats to set the box-and-whisker locations  
f.bxp = function(x) {
  bxp = boxplot.stats(x)[["stats"]]
  names(bxp) = c("ymin","lower", "middle","upper","ymax")
  bxp
}  

# Function to use boxplot.stats for the outliers
f.out = function(x) {
  data.frame(y=boxplot.stats(x)[["out"]])
}

To use those functions in ggplot:

ggplot(data, aes(0, y=x)) + 
  stat_summary(fun.data=f.bxp, geom="boxplot") + 
  stat_summary(fun.data=f.out, geom="point")

enter image description here

If you want to replicate the statistics that ggplot uses natively, these are explained in ?geom_boxplot as follows:

ymin = lower whisker = smallest observation greater than or equal to lower hinge - 1.5 * IQR

lower = lower hinge, 25% quantile

notchlower = lower edge of notch = median - 1.58 * IQR / sqrt(n)

middle = median, 50% quantile

notchupper = upper edge of notch = median + 1.58 * IQR / sqrt(n)

upper = upper hinge, 75% quantile

ymax = upper whisker = largest observation less than or equal to upper hinge + 1.5 * IQR

We can calculate these accordingly:

y = sort(x)
iqr = quantile(y,0.75) - quantile(y,0.25)
ymin = y[which(y >= quantile(y,0.25) - 1.5*iqr)][1]
ymax = tail(y[which(y <= quantile(y,0.75) + 1.5*iqr)],1)
lower = quantile(y,0.25)
upper = quantile(y,0.75)
middle = quantile(y,0.5)

ggplot(data, aes(y=x)) + 
  geom_boxplot() +
  geom_hline(aes(yintercept=c(ymin)), color='red', linetype='dashed') +
  geom_hline(aes(yintercept=c(ymax)), color='red', linetype='dashed') +
  geom_hline(aes(yintercept=c(lower)), color='red', linetype='dashed') +
  geom_hline(aes(yintercept=c(upper)), color='red', linetype='dashed') +
  geom_hline(aes(yintercept=c(middle)), color='red', linetype='dashed') 

enter image description here

We can also extract these statistics directly from a ggplot object using ggplot_build

p <- ggplot(data, aes(y=x)) + geom_boxplot() 
ggplot_build(p)$data[1:5]

#   ymin lower middle upper  ymax 
# 1  0.2  42.5  93.05   122 232.2 
Tugboat answered 15/12, 2018 at 16:29 Comment(2)
Is it possible to get the stats from the geom_boxplot like in boxplot.stats()?Wagram
Thanks @Tugboat for answering so quickly. Just one thing, in the computation of ymin you must use >=, and in the computation of ymax <=, must'n you?Wagram

© 2022 - 2024 — McMap. All rights reserved.