cut
, when not given explicit break points divides values into bins of same width, they won't contain an equal number of items in general:
x <- c(1:4,10)
lengths(split(x, cut(x, 2)))
# (0.991,5.5] (5.5,10]
# 4 1
Hmisc::cut2
and ggplot2::cut_number
use quantiles, which will usually create groups of same size (in term of number of elements) if the data is well spread and of decent size, it's not always the case however. mltools::bin_data
can give different results but is also based on quantiles.
These functions don't always give neat results when the data contains a small number of distinct values :
x <- rep(c(1:20),c(15, 7, 10, 3, 9, 3, 4, 9, 3, 2,
23, 2, 4, 1, 1, 7, 18, 37, 6, 2))
table(x)
# x
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
# 15 7 10 3 9 3 4 9 3 2 23 2 4 1 1 7 18 37 6 2
table(Hmisc::cut2(x, g=4))
# [ 1, 6) [ 6,12) [12,19) [19,20]
# 44 44 70 8
table(ggplot2::cut_number(x, 4))
# [1,5] (5,11] (11,18] (18,20]
# 44 44 70 8
table(mltools::bin_data(x, bins=4, binType = "quantile"))
# [1, 5) [5, 11) [11, 18) [18, 20]
# 35 30 56 45
This is not clear if the optimal solution has been found here.
What is the best binning approach is a subjective matter, but one reasonable way to approach it is to look for the bins that minimize the variance around the expected bin size.
The function smart_cut
from (my) package cutr
proposes such feature. It's computationally heavy though and should be reserved to cases where cut points and unique values are few (which happen to be usually the case where it matters).
# devtools::install_github("moodymudskipper/cutr")
table(cutr::smart_cut(x, list(4, "balanced"), "g"))
# [1,6) [6,12) [12,18) [18,20]
# 44 44 33 45
We see the groups are much better balanced.
"balanced"
in the call can in fact be replaced by a custom function to optimize or restrict the bins as desired if the method based on variance isn't enough.