Categorize numeric variable with mutate
Asked Answered
D

2

35

I would like to a categorize numeric variable in my data.frame object with the use of dplyr (and have no idea how to do it).

Without dplyr, I would probably do something like:

df <- data.frame(a = rnorm(1e3), b = rnorm(1e3))
df$a <- cut(df$a , breaks=quantile(df$a, probs = seq(0, 1, 0.2)))

and it would be done. However, I strongly prefer to do it with the use of some dplyr function (mutate, I suppose) in the chain sequence of other actions I do perform over my data.frame.

Dialectic answered 18/4, 2014 at 22:48 Comment(1)
At a guess (from google and reading the online manual, I've never used dplyr) I'd say mutate( df , a = cut( a , breaks = quantile( a , probs = seq( 0 , 1 , 0.2 ) ) ) )...Cyrie
P
35
set.seed(123)
df <- data.frame(a = rnorm(10), b = rnorm(10))

df %>% mutate(a = cut(a, breaks = quantile(a, probs = seq(0, 1, 0.2))))

giving:

                 a          b
1  (-0.586,-0.316]  1.2240818
2   (-0.316,0.094]  0.3598138
3      (0.68,1.72]  0.4007715
4   (-0.316,0.094]  0.1106827
5     (0.094,0.68] -0.5558411
6      (0.68,1.72]  1.7869131
7     (0.094,0.68]  0.4978505
8             <NA> -1.9666172
9   (-1.27,-0.586]  0.7013559
10 (-0.586,-0.316] -0.4727914
Pacifist answered 18/4, 2014 at 23:3 Comment(2)
What is the dplyr benefit of doing df %.% mutate( a = ... ) over df <- mutate( df , a = ... ). Does it change by reference the first way?Cyrie
It improves readability while you are executing series of actions over one data.farme - instead of using nested functions, you can write them sequentially with the use of %.% and - therefore - read code in a from-left-to-right manner (not: from-inside-to-outside). More here: blog.rstudio.org/2014/01/17/introducing-dplyrDialectic
N
42

The ggplot2 package has 3 functions that work well for these tasks:

  • cut_number(): Makes n groups with (approximately) equal numbers of observation
  • cut_interval(): Makes n groups with equal range
  • cut_width: Makes groups of width width

My go-to is cut_number() because this uses evenly spaced quantiles for binning observations. Here's an example with skewed data.

library(tidyverse)

skewed_tbl <- tibble(
    counts = c(1:100, 1:50, 1:20, rep(1:10, 3), 
               rep(1:5, 5), rep(1:2, 10), rep(1, 20))
    ) %>%
    mutate(
        counts_cut_number   = cut_number(counts, n = 4),
        counts_cut_interval = cut_interval(counts, n = 4),
        counts_cut_width    = cut_width(counts, width = 25)
        ) 

# Data
skewed_tbl
#> # A tibble: 265 x 4
#>    counts counts_cut_number counts_cut_interval counts_cut_width
#>     <dbl> <fct>             <fct>               <fct>           
#>  1      1 [1,3]             [1,25.8]            [-12.5,12.5]    
#>  2      2 [1,3]             [1,25.8]            [-12.5,12.5]    
#>  3      3 [1,3]             [1,25.8]            [-12.5,12.5]    
#>  4      4 (3,13]            [1,25.8]            [-12.5,12.5]    
#>  5      5 (3,13]            [1,25.8]            [-12.5,12.5]    
#>  6      6 (3,13]            [1,25.8]            [-12.5,12.5]    
#>  7      7 (3,13]            [1,25.8]            [-12.5,12.5]    
#>  8      8 (3,13]            [1,25.8]            [-12.5,12.5]    
#>  9      9 (3,13]            [1,25.8]            [-12.5,12.5]    
#> 10     10 (3,13]            [1,25.8]            [-12.5,12.5]    
#> # ... with 255 more rows

summary(skewed_tbl$counts)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>    1.00    3.00   13.00   25.75   42.00  100.00

# Histogram showing skew
skewed_tbl %>%
    ggplot(aes(counts)) +
    geom_histogram(bins = 30)

# cut_number() evenly distributes observations into bins by quantile
skewed_tbl %>%
    ggplot(aes(counts_cut_number)) +
    geom_bar()

# cut_interval() evenly splits the interval across the range
skewed_tbl %>%
    ggplot(aes(counts_cut_interval)) +
    geom_bar()

# cut_width() uses the width = 25 to create bins that are 25 in width
skewed_tbl %>%
    ggplot(aes(counts_cut_width)) +
    geom_bar()

Created on 2018-11-01 by the reprex package (v0.2.1)

Nob answered 1/11, 2018 at 10:56 Comment(2)
I wasn't aware of these functions. Can they be used directly in dplyr chains?Violetteviolin
@Violetteviolin Use them inside of mutate(). I showed an example of using all three inside of a dplyr chain with mutate().Nob
P
35
set.seed(123)
df <- data.frame(a = rnorm(10), b = rnorm(10))

df %>% mutate(a = cut(a, breaks = quantile(a, probs = seq(0, 1, 0.2))))

giving:

                 a          b
1  (-0.586,-0.316]  1.2240818
2   (-0.316,0.094]  0.3598138
3      (0.68,1.72]  0.4007715
4   (-0.316,0.094]  0.1106827
5     (0.094,0.68] -0.5558411
6      (0.68,1.72]  1.7869131
7     (0.094,0.68]  0.4978505
8             <NA> -1.9666172
9   (-1.27,-0.586]  0.7013559
10 (-0.586,-0.316] -0.4727914
Pacifist answered 18/4, 2014 at 23:3 Comment(2)
What is the dplyr benefit of doing df %.% mutate( a = ... ) over df <- mutate( df , a = ... ). Does it change by reference the first way?Cyrie
It improves readability while you are executing series of actions over one data.farme - instead of using nested functions, you can write them sequentially with the use of %.% and - therefore - read code in a from-left-to-right manner (not: from-inside-to-outside). More here: blog.rstudio.org/2014/01/17/introducing-dplyrDialectic

© 2022 - 2024 — McMap. All rights reserved.