Splitting a continuous variable into equal sized groups
Asked Answered
U

11

74

I need to split/divide up a continuous variable into 3 equal sized groups.

Example data frame:

das <- data.frame(anim = 1:15,
                  wt = c(181,179,180.5,201,201.5,245,246.4,
                         189.3,301,354,369,205,199,394,231.3))

After being cut up (according to the value of wt), I would need to have the 3 classes under the new variable wt2 like this:

> das 
   anim    wt wt2
1     1 181.0   1
2     2 179.0   1
3     3 180.5   1
4     4 201.0   2
5     5 201.5   2
6     6 245.0   2
7     7 246.4   3
8     8 189.3   1
9     9 301.0   3
10   10 354.0   3
11   11 369.0   3
12   12 205.0   2
13   13 199.0   1
14   14 394.0   3
15   15 231.3   2

This would be applied to a large data set.

Unerring answered 24/5, 2011 at 1:24 Comment(2)
See eg : #5916416 , #2648139 , #5570793 , #5161555 , #5731616 , #3288861 , ...Fork
Are you sure that the answer by @Singapore Bolker is not the correct one? You specify that you want equal sized groups.Colleencollege
S
70

try this:

split(das, cut(das$anim, 3))

if you want to split based on the value of wt, then

library(Hmisc) # cut2
split(das, cut2(das$wt, g=3))

anyway, you can do that by combining cut, cut2 and split.

UPDATED

if you want a group index as an additional column, then

das$group <- cut(das$anim, 3)

if the column should be index like 1, 2, ..., then

das$group <- as.numeric(cut(das$anim, 3))

UPDATED AGAIN

try this:

> das$wt2 <- as.numeric(cut2(das$wt, g=3))
> das
   anim    wt wt2
1     1 181.0   1
2     2 179.0   1
3     3 180.5   1
4     4 201.0   2
5     5 201.5   2
6     6 245.0   2
7     7 246.4   3
8     8 189.3   1
9     9 301.0   3
10   10 354.0   3
11   11 369.0   3
12   12 205.0   2
13   13 199.0   1
14   14 394.0   3
15   15 231.3   2
Seifert answered 24/5, 2011 at 1:31 Comment(4)
You can remove the as.numeric and use cut(das$anim, 3, labels=FALSE)Singapore
This should be updated so it is clear that it is different from the answer by @Singapore below. I mistakenly used this code in the belief that it would divide the observations evenly.Colleencollege
are you sure that the Hmisc::cut2() solution doesn't? Can you give a small example where it doesn't?Magnet
Confusing to me why this is the accepted answer, when the question specifically says "equal sized groups", which cut() doesn't achieve.Girasol
M
79

Or see cut_number from the ggplot2 package, e.g.

das$wt_2 <- as.numeric(cut_number(das$wt,3))

Note that cut(...,3) divides the range of the original data into three ranges of equal lengths; it doesn't necessarily result in the same number of observations per group if the data are unevenly distributed (you can replicate what cut_number does by using quantile appropriately, but it's a nice convenience function). On the other hand, Hmisc::cut2() using the g= argument does split by quantiles, so is more or less equivalent to ggplot2::cut_number. I might have thought that something like cut_number would have made its way into dplyr by so far, but as far as I can tell it hasn't.

Magnet answered 1/11, 2011 at 11:41 Comment(1)
This should be best answer, wish I had seen this first...!Palomo
S
70

try this:

split(das, cut(das$anim, 3))

if you want to split based on the value of wt, then

library(Hmisc) # cut2
split(das, cut2(das$wt, g=3))

anyway, you can do that by combining cut, cut2 and split.

UPDATED

if you want a group index as an additional column, then

das$group <- cut(das$anim, 3)

if the column should be index like 1, 2, ..., then

das$group <- as.numeric(cut(das$anim, 3))

UPDATED AGAIN

try this:

> das$wt2 <- as.numeric(cut2(das$wt, g=3))
> das
   anim    wt wt2
1     1 181.0   1
2     2 179.0   1
3     3 180.5   1
4     4 201.0   2
5     5 201.5   2
6     6 245.0   2
7     7 246.4   3
8     8 189.3   1
9     9 301.0   3
10   10 354.0   3
11   11 369.0   3
12   12 205.0   2
13   13 199.0   1
14   14 394.0   3
15   15 231.3   2
Seifert answered 24/5, 2011 at 1:31 Comment(4)
You can remove the as.numeric and use cut(das$anim, 3, labels=FALSE)Singapore
This should be updated so it is clear that it is different from the answer by @Singapore below. I mistakenly used this code in the belief that it would divide the observations evenly.Colleencollege
are you sure that the Hmisc::cut2() solution doesn't? Can you give a small example where it doesn't?Magnet
Confusing to me why this is the accepted answer, when the question specifically says "equal sized groups", which cut() doesn't achieve.Girasol
T
11

If you want to split into 3 equally distributed groups, the answer is the same as Ben Bolker's answer above - use ggplot2::cut_number(). For sake of completion here are the 3 methods of converting continuous to categorical (binning).

  • cut_number(): Makes n groups with (approximately) equal numbers of observation
  • cut_interval(): Makes n groups with equal range
  • cut_width(): Makes groups of width

My go-to is cut_number() because this uses evenly spaced quantiles for binning observations. Here's an example with skewed data.

library(tidyverse)

skewed_tbl <- tibble(
    counts = c(1:100, 1:50, 1:20, rep(1:10, 3), 
               rep(1:5, 5), rep(1:2, 10), rep(1, 20))
    ) %>%
    mutate(
        counts_cut_number   = cut_number(counts, n = 4),
        counts_cut_interval = cut_interval(counts, n = 4),
        counts_cut_width    = cut_width(counts, width = 25)
        ) 

# Data
skewed_tbl
#> # A tibble: 265 x 4
#>    counts counts_cut_number counts_cut_interval counts_cut_width
#>     <dbl> <fct>             <fct>               <fct>           
#>  1      1 [1,3]             [1,25.8]            [-12.5,12.5]    
#>  2      2 [1,3]             [1,25.8]            [-12.5,12.5]    
#>  3      3 [1,3]             [1,25.8]            [-12.5,12.5]    
#>  4      4 (3,13]            [1,25.8]            [-12.5,12.5]    
#>  5      5 (3,13]            [1,25.8]            [-12.5,12.5]    
#>  6      6 (3,13]            [1,25.8]            [-12.5,12.5]    
#>  7      7 (3,13]            [1,25.8]            [-12.5,12.5]    
#>  8      8 (3,13]            [1,25.8]            [-12.5,12.5]    
#>  9      9 (3,13]            [1,25.8]            [-12.5,12.5]    
#> 10     10 (3,13]            [1,25.8]            [-12.5,12.5]    
#> # ... with 255 more rows

summary(skewed_tbl$counts)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>    1.00    3.00   13.00   25.75   42.00  100.00

# Histogram showing skew
skewed_tbl %>%
    ggplot(aes(counts)) +
    geom_histogram(bins = 30)

# cut_number() evenly distributes observations into bins by quantile
skewed_tbl %>%
    ggplot(aes(counts_cut_number)) +
    geom_bar()

# cut_interval() evenly splits the interval across the range
skewed_tbl %>%
    ggplot(aes(counts_cut_interval)) +
    geom_bar()

# cut_width() uses the width = 25 to create bins that are 25 in width
skewed_tbl %>%
    ggplot(aes(counts_cut_width)) +
    geom_bar()

Created on 2018-11-01 by the reprex package (v0.2.1)

Teacup answered 1/11, 2018 at 20:54 Comment(0)
S
10

Here's another solution using the bin_data() function from the mltools package.

library(mltools)

# Resulting bins have an equal number of observations in each group
das[, "wt2"] <- bin_data(das$wt, bins=3, binType = "quantile")

# Resulting bins are equally spaced from min to max
das[, "wt3"] <- bin_data(das$wt, bins=3, binType = "explicit")

# Or if you'd rather define the bins yourself
das[, "wt4"] <- bin_data(das$wt, bins=c(-Inf, 250, 322, Inf), binType = "explicit")

das
   anim    wt                                  wt2                                  wt3         wt4
1     1 181.0              [179, 200.333333333333)              [179, 250.666666666667) [-Inf, 250)
2     2 179.0              [179, 200.333333333333)              [179, 250.666666666667) [-Inf, 250)
3     3 180.5              [179, 200.333333333333)              [179, 250.666666666667) [-Inf, 250)
4     4 201.0 [200.333333333333, 245.466666666667)              [179, 250.666666666667) [-Inf, 250)
5     5 201.5 [200.333333333333, 245.466666666667)              [179, 250.666666666667) [-Inf, 250)
6     6 245.0 [200.333333333333, 245.466666666667)              [179, 250.666666666667) [-Inf, 250)
7     7 246.4              [245.466666666667, 394]              [179, 250.666666666667) [-Inf, 250)
8     8 189.3              [179, 200.333333333333)              [179, 250.666666666667) [-Inf, 250)
9     9 301.0              [245.466666666667, 394] [250.666666666667, 322.333333333333)  [250, 322)
10   10 354.0              [245.466666666667, 394]              [322.333333333333, 394]  [322, Inf]
11   11 369.0              [245.466666666667, 394]              [322.333333333333, 394]  [322, Inf]
12   12 205.0 [200.333333333333, 245.466666666667)              [179, 250.666666666667) [-Inf, 250)
13   13 199.0              [179, 200.333333333333)              [179, 250.666666666667) [-Inf, 250)
14   14 394.0              [245.466666666667, 394]              [322.333333333333, 394]  [322, Inf]
15   15 231.3 [200.333333333333, 245.466666666667)              [179, 250.666666666667) [-Inf, 250)
Singapore answered 13/7, 2017 at 4:2 Comment(0)
U
8

Alternative without using cut2.

das$wt2 <- as.factor( as.numeric( cut(das$wt,3)))

or

das$wt2 <- as.factor( cut(das$wt,3, labels=F))

As pointed out by @ben-bolker this splits into equal-widths rather occupancy. I think that using quantiles one can approximate equal-occupancy

x = rnorm(10)
x
 [1] -0.1074316  0.6690681 -1.7168853  0.5144931  1.6460280  0.7014368
 [7]  1.1170587 -0.8503069  0.4462932 -0.1089427
bin = 3 #for 1/3 rd, 4 for 1/4, 100 for 1/100th etc
xx = cut(x, quantile(x, breaks=1/bin*c(1:bin)), labels=F, include.lowest=T)
table(xx)
1 2 3 4
3 2 2 3
Utley answered 5/10, 2011 at 10:27 Comment(1)
I think this splits into equal-width rather than equal-occupancy bins ?Magnet
P
8

ntile from dplyr now does this but behaves weirdly with NA's.

I've used similar code in the following function that works in base R and does the equivalent of the cut2 solution above:

ntile_ <- function(x, n) {
    b <- x[!is.na(x)]
    q <- floor((n * (rank(b, ties.method = "first") - 1)/length(b)) + 1)
    d <- rep(NA, length(x))
    d[!is.na(x)] <- q
    return(d)
}
Phil answered 15/10, 2016 at 1:22 Comment(0)
W
5

cut, when not given explicit break points divides values into bins of same width, they won't contain an equal number of items in general:

x <- c(1:4,10)
lengths(split(x, cut(x, 2)))
# (0.991,5.5]    (5.5,10] 
#           4           1 

Hmisc::cut2 and ggplot2::cut_number use quantiles, which will usually create groups of same size (in term of number of elements) if the data is well spread and of decent size, it's not always the case however. mltools::bin_data can give different results but is also based on quantiles.

These functions don't always give neat results when the data contains a small number of distinct values :

x <- rep(c(1:20),c(15, 7, 10, 3, 9, 3, 4, 9, 3, 2,
                   23, 2, 4, 1, 1, 7, 18, 37, 6, 2))

table(x)
# x
#  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
# 15  7 10  3  9  3  4  9  3  2 23  2  4  1  1  7 18 37  6  2   

table(Hmisc::cut2(x, g=4))
# [ 1, 6) [ 6,12) [12,19) [19,20] 
#      44      44      70       8

table(ggplot2::cut_number(x, 4))
# [1,5]  (5,11] (11,18] (18,20] 
#    44      44      70       8

table(mltools::bin_data(x, bins=4, binType = "quantile"))
# [1, 5)  [5, 11) [11, 18) [18, 20] 
#     35       30       56       45

This is not clear if the optimal solution has been found here.

What is the best binning approach is a subjective matter, but one reasonable way to approach it is to look for the bins that minimize the variance around the expected bin size.

The function smart_cut from (my) package cutr proposes such feature. It's computationally heavy though and should be reserved to cases where cut points and unique values are few (which happen to be usually the case where it matters).

# devtools::install_github("moodymudskipper/cutr")
table(cutr::smart_cut(x, list(4, "balanced"), "g"))
# [1,6)  [6,12) [12,18) [18,20] 
# 44      44      33      45 

We see the groups are much better balanced.

"balanced" in the call can in fact be replaced by a custom function to optimize or restrict the bins as desired if the method based on variance isn't enough.

Waldack answered 14/9, 2018 at 13:50 Comment(0)
S
1

equal_freq from funModeling takes a vector and the number of bins (based on equal frequency):

das <- data.frame(anim=1:15,
                  wt=c(181,179,180.5,201,201.5,245,246.4,
                       189.3,301,354,369,205,199,394,231.3))

das$wt_bin=funModeling::equal_freq(das$wt, 3)

table(das$wt_bin)

#[179,201) [201,246) [246,394] 
#        5         5         5 
Soissons answered 10/4, 2019 at 14:23 Comment(0)
C
1

You can also use the bin function with method = "content" from the OneR package for that:

library(OneR)
das$wt_2 <- as.numeric(bin(das$wt, nbins = 3, method = "content"))
das
##    anim    wt wt_2
## 1     1 181.0    1
## 2     2 179.0    1
## 3     3 180.5    1
## 4     4 201.0    2
## 5     5 201.5    2
## 6     6 245.0    2
## 7     7 246.4    3
## 8     8 189.3    1
## 9     9 301.0    3
## 10   10 354.0    3
## 11   11 369.0    3
## 12   12 205.0    2
## 13   13 199.0    1
## 14   14 394.0    3
## 15   15 231.3    2
Colourable answered 2/9, 2019 at 21:32 Comment(0)
A
0

Without any extra package, 3 being the number of groups:

> findInterval(das$wt, unique(quantile(das$wt, seq(0, 1, length.out = 3 + 1))), rightmost.closed = TRUE)
 [1] 1 1 1 2 2 2 3 1 3 3 3 2 1 3 2

You can speed up the quantile computation by using a representative sample of the values of interest. Double check the documentation of the FindInterval function.

Almonte answered 17/12, 2017 at 16:28 Comment(0)
F
0

So, interestingly if you would like to cut the variable "wt", into sections with equal 3 sub-sections (i.e., 179-181, 181-183, etc); you could do this:

x<-table(as.matrix(cut(das$wt,breaks = ((max(das$wt)-min(das$wt))/3)),as.numeric(cut(das$wt,breaks = ((max(das$wt)-min(das$wt))/3)))))

Which gives the result, based on the dataset "das":

x
(179,182] (188,191] (197,200] (200,203] (203,206] (230,234] (243,246] 
    3         1         1         2         1         1         1 
(246,249] (300,303] (352,355] (367,370] (391,394] 
    1         1         1         1         1 

(that digit 3 in codes is an arbitrary item which can be changed by your interest.)

Fleabitten answered 8/12, 2021 at 10:0 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.