Categorize numeric variable into group/ bins/ breaks

Asked 19/10, 2012 at 17:34 Answered 23/12, 2019 at 2:28

I am trying to categorize a numeric variable (age) into groups defined by intervals so it will not be continuous. I have this code:

data$agegrp(data$age >= 40 & data$age <= 49) <- 3
data$agegrp(data$age >= 30 & data$age <= 39) <- 2
data$agegrp(data$age >= 20 & data$age <= 29) <- 1

the above code is not working under survival package. It's giving me:

invalid function in complex assignment

Can you point me where the error is? data is the dataframe I am using.

Orchidectomy answered 19/10, 2012 at 17:34 Comment(4)

Use [ for subsetting, not (. – Apomorphine 19/10, 2012 at 17:35

The function you'll want to use is cut. – Romanist 19/10, 2012 at 17:36

@joan can you show me how it is done using cut? – Orchidectomy 19/10, 2012 at 17:39

The answer depends on What result do you want? a) just an integer (or NA) b) factor labels or indeed c) an array of dichotomized/ dummy variables? findInterval() can only do the first, whereas cut() does both. findInterval() is faster (O(log(no. of bins)) although that's rarely an issue. – Letti 16/9, 2015 at 22:10

I would use findInterval() here:

First, make up some sample data

set.seed(1)
ages <- floor(runif(20, min = 20, max = 50))
ages
# [1] 27 31 37 47 26 46 48 39 38 21 26 25 40 31 43 34 41 49 31 43

Use findInterval() to categorize your "ages" vector.

findInterval(ages, c(20, 30, 40))
# [1] 1 2 2 3 1 3 3 2 2 1 1 1 3 2 3 2 3 3 2 3

Alternatively, as recommended in the comments, cut() is also useful here:

cut(ages, breaks=c(20, 30, 40, 50), right = FALSE)
cut(ages, breaks=c(20, 30, 40, 50), right = FALSE, labels = FALSE)

Allanadale answered 19/10, 2012 at 17:40 Comment(3)

@leian, have you tried the code? It should. However, when asking questions here in the R tag at SO, it is best to include a minimal reproducible example if you want more targeted help. – Allanadale 19/10, 2012 at 17:51

but what will be the variable name of the result of this findInterval()? – Orchidectomy 19/10, 2012 at 18:2

Whatever you want it to be! From your example, I would assume you would do something like data$agegrp <- findInterval(data$age, c(20, 30, 40)). – Allanadale 19/10, 2012 at 18:8

We can use dplyr:

library(dplyr)

data <- data %>% mutate(agegroup = case_when(age >= 40  & age <= 49 ~ '3',
                                             age >= 30  & age <= 39 ~ '2',
                                             age >= 20  & age <= 29 ~ '1')) # end function

Compared to other approaches, dplyr is easier to write and interpret.

Abstention answered 23/12, 2019 at 2:28 Comment(2)

You can also use cut in mutate instead of case_when. Eg. data %>% mutate(agegroup = cut(ages, breaks = c(20, 30, 40, 50), right = T, labels = F)) – Filigree 22/1, 2021 at 14:35

@Filigree This is /such/ a good answer, many thanks. labels=TRUE will even give reasonable labels. – Immix 25/10, 2021 at 2:10

This answer provides two ways to solve the problem using the data.table package, which would greatly improve the speed of the process. This is crucial if one is working with large data sets.

1s Approach: an adaptation of the previous answer but now using data.table + including labels:

library(data.table)

agebreaks <- c(0,1,5,10,15,20,25,30,35,40,45,50,55,60,65,70,75,80,85,500)
agelabels <- c("0-1","1-4","5-9","10-14","15-19","20-24","25-29","30-34",
               "35-39","40-44","45-49","50-54","55-59","60-64","65-69",
               "70-74","75-79","80-84","85+")

setDT(data)[ , agegroups := cut(age, 
                                breaks = agebreaks, 
                                right = FALSE, 
                                labels = agelabels)]

2nd Approach: This is a more wordy method, but it also makes it more clear what exactly falls within each age group:

setDT(data)[age <1, agegroup := "0-1"]
data[age >0 & age <5, agegroup := "1-4"]
data[age >4 & age <10, agegroup := "5-9"]
data[age >9 & age <15, agegroup := "10-14"]
data[age >14 & age <20, agegroup := "15-19"]
data[age >19 & age <25, agegroup := "20-24"]
data[age >24 & age <30, agegroup := "25-29"]
data[age >29 & age <35, agegroup := "30-34"]
data[age >34 & age <40, agegroup := "35-39"]
data[age >39 & age <45, agegroup := "40-44"]
data[age >44 & age <50, agegroup := "45-49"]
data[age >49 & age <55, agegroup := "50-54"]
data[age >54 & age <60, agegroup := "55-59"]
data[age >59 & age <65, agegroup := "60-64"]
data[age >64 & age <70, agegroup := "65-69"]
data[age >69 & age <75, agegroup := "70-74"]
data[age >74 & age <80, agegroup := "75-79"]
data[age >79 & age <85, agegroup := "80-84"]
data[age >84, agegroup := "85+"]

Although the two approaches should give the same result, I prefer the 1st one for two reasons. (a) It is shorter to write and (2) the age groups are ordered in the correct way, which is crucial when it comes to visualizing the data.

Year answered 22/8, 2015 at 19:40 Comment(2)

The second approach doesn't seem to work with R 3.2. It gives an error could not find function ":=" – Looker 15/3, 2016 at 9:54

It works for me. Make sure you load the data.table library library(data.table) ; and that you are working with a data.table (not a data frame) setDT(your_dataframe) # convert your DF into a data.table – Year 15/3, 2016 at 21:13

Let's say that your ages were stored in the dataframe column labeled age. Your dataframe is df, and you want a new column age_grouping containing the "bucket" that your ages fall in.

In this example, suppose that your ages ranged from 0 -> 100, and you wanted to group them every 10 years. The following code would accomplish this by storing these intervals in a new age grouping column:

df$age_grouping <- cut(df$age, seq(0, 100, 10))

Malvoisie answered 13/10, 2017 at 17:34 Comment(0)

Recommended topics

Hot tags