The discrepancy arises from an ambiguity in the definition of quantiles. No single method is strictly correct or incorrect - there are simply different ways to estimate quantiles in situations (such as an an even number of data points) when they do not neatly coincide with a specific data point and must be interpolated. Somewhat disconcertingly, boxplot
and quantile
(and other functions that provide summary statistics) use different default methods to calculate quantiles, although these defaults can be over-ridden using the type =
argument in quantile
We can see these differences more clearly in action by looking at some of the various ways to generate quantile statistics in R.
Both boxplot
and fivenum
give the same values:
boxplot.stats(X)$stats
# [1] 18.0 25.5 32.0 48.0 63.0
fivenum(X)
# [1] 18.0 25.5 32.0 48.0 63.0
In boxplot
and fivenum
, the lower (upper) quartile is equivalent to the median of the lower (upper) half of the data (including the median of the complete data):
c(median(X[ X <= median(X) ]), median(X[ X >= median(X) ]))
# [1] 25.5 48.0
But, quartile
and summary
do things differently:
summary(X)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 18.00 26.25 32.00 35.75 46.50 63.00
quantile(X, c(0.25,0.5,0.75))
# 25% 50% 75%
# 26.25 32.00 46.50
The difference between this and the results from boxplot
and fivenum
hinges on how the functions interpolate between data. quartile
attempts to interpolate by estimating the shape of the cumulative distribution function. According to ?quantile
:
quantile returns estimates of underlying distribution quantiles based
on one or two order statistics from the supplied elements in x at
probabilities in probs. One of the nine quantile algorithms discussed
in Hyndman and Fan (1996), selected by type, is employed.
The full details of the nine different methods quantile
employs to estimate the distribution function of the data can be found in ?quantile
, and are too lengthy to reproduce in full here. The important point to note is that the 9 methods are taken from Hyndman and Fan (1996) who recommended type 8. The default method used by quantile
is type 7, for historical reasons of compatibility with S.
We can see the estimates of the quartiles provided by different methods in quantile using:
quantile_methods = data.frame(
q25 = sapply(1:9, function(method) quantile(X, 0.25, type = method)),
q50 = sapply(1:9, function(method) quantile(X, 0.50, type = method)),
q75 = sapply(1:9, function(method) quantile(X, 0.75, type = method)))
# q25 q50 q75
# 1 24.0000 30 45.000
# 2 25.5000 32 48.000
# 3 24.0000 30 45.000
# 4 24.0000 30 45.000
# 5 25.5000 32 48.000
# 6 24.7500 32 49.500
# 7 26.2500 32 46.500
# 8 25.2500 32 48.500
# 9 25.3125 32 48.375
In which type = 5
provides the same estimated values of the quartiles as does boxplot
. However, when there are an odd number of data, it is type=7
that will coincide with boxplot stats.
We can show this works by automatically selecting the type to be either 5 or 7 depending on whether there are an odd or even number of data. Boxplot in image below show quantiles for data sets with 1 to 30 values, with boxplot
and quantile
giving the same values for both odd and even N:
layout(matrix(1:30,5,6, byrow = T), respect = T)
par(mar=c(0.2,0.2,0.2,0.2), bty="n", yaxt="n", xaxt="n")
for (N in 1:30){
X = sample(100, N)
boxplot(X)
abline(h=quantile(X, c(0.25, 0.5, 0.75), type=c(5,7)[(N %% 2) + 1]), col="red", lty=2)
}
Hyndman, R. J. and Fan, Y. (1996) Sample quantiles in statistical packages, American Statistician 50, 361–365
boxplot
returns an object that can be used as needed:bX = boxplot(X); abline(h = bX$stats[c(2, 4), 1], col = "red",lty = 2)
– Olericulture