Numbering rows within groups in a data frame

Asked 16/10, 2012 at 23:38 Answered 7/6, 2023 at 13:57

248

Working with a data frame similar to this:

set.seed(100)  
df <- data.frame(cat = c(rep("aaa", 5), rep("bbb", 5), rep("ccc", 5)), val = runif(15))             
df <- df[order(df$cat, df$val), ]  
df

   cat        val  
1  aaa 0.05638315  
2  aaa 0.25767250  
3  aaa 0.30776611  
4  aaa 0.46854928  
5  aaa 0.55232243  
6  bbb 0.17026205  
7  bbb 0.37032054  
8  bbb 0.48377074  
9  bbb 0.54655860  
10 bbb 0.81240262  
11 ccc 0.28035384  
12 ccc 0.39848790  
13 ccc 0.62499648  
14 ccc 0.76255108  
15 ccc 0.88216552

I am trying to add a column with numbering within each group. Doing it this way obviously isn't using the powers of R:

 df$num <- 1  
 for (i in 2:(length(df[,1]))) {  
   if (df[i,"cat"]==df[(i-1),"cat"]) {  
     df[i,"num"]<-df[i-1,"num"]+1  
     }  
 }  
 df

   cat        val num  
1  aaa 0.05638315   1  
2  aaa 0.25767250   2  
3  aaa 0.30776611   3  
4  aaa 0.46854928   4  
5  aaa 0.55232243   5  
6  bbb 0.17026205   1  
7  bbb 0.37032054   2  
8  bbb 0.48377074   3  
9  bbb 0.54655860   4  
10 bbb 0.81240262   5  
11 ccc 0.28035384   1  
12 ccc 0.39848790   2  
13 ccc 0.62499648   3  
14 ccc 0.76255108   4  
15 ccc 0.88216552   5

What would be a good way to do this?

Neoprene answered 16/10, 2012 at 23:38 Comment(2)

I would suggest to add something like "seq along levels" or "counting along replicates" in the question title as this is how I found this question and it is exactly what I was looking for – Kovacev 17/12, 2019 at 9:25

@Kovacev If that were the title, I wouldn't have found what I was looking for :-( I was literally looking for "how to number rows within groups in a data frame" – Evadnee 30/1, 2020 at 15:47

396

Use ave, ddply, dplyr or data.table:

df$num <- ave(df$val, df$cat, FUN = seq_along)

or:

library(plyr)
ddply(df, .(cat), mutate, id = seq_along(val))

or:

library(dplyr)
df %>% group_by(cat) %>% mutate(id = row_number())

or (the most memory efficient, as it assigns by reference within DT):

library(data.table)
DT <- data.table(df)

DT[, id := seq_len(.N), by = cat]
DT[, id := rowid(cat)]

Pictish answered 16/10, 2012 at 23:41 Comment(9)

It might be worth mentioning that ave gives a float instead of an int here. Alternately, could change df$val to seq_len(nrow(df)). I just ran into this over here: #42797357 – Moncada 14/3, 2017 at 22:7

Interestingly this data.table solution seems to be quicker than using frank:

library(microbenchmark); microbenchmark(a = DT[, .(val                           ,num = frank(val)), by = list(cat)]                ,b =DT[, .(val                           , id = seq_len(.N)), by = list(cat)]                , times = 1000L)

– Affective 28/7, 2017 at 12:23

Thanks! The dplyr solution is good. But if, like me, you kept getting weird errors when trying this approach, make sure that you are not getting conflicts between plyr and dplyr as explained in this post It can be avoided by explicitly calling dplyr::mutate(...) – Minnesota 10/4, 2018 at 14:16

another data.table method is setDT(df)[, id:=rleid(val), by=.(cat)] – Zipangu 23/5, 2018 at 0:14

How to modify library(plyr) and library(dplyr) answers to make the ranking val column in descending order? – Gentlewoman 24/7, 2018 at 9:31

I tried using the plyr method and got an error: "Error in unique.default(x) : unique() applies only to vectors" - has anyone ever seen that happen? – Saccharin 17/9, 2018 at 1:7

@PrzemyslawRemin You can simply sort the whole dataset in advance. df <- df[order(df$val),] – Loredo 1/10, 2018 at 9:31

data.table was the moste effective way, it took not a second to compute about 17000 rows. Using ddply it was running for ever so I had to kill the R process. – Loredo 1/10, 2018 at 10:30

How could I do if I have some NA for the variable val and I don't want to consider this rows when creating the variable num although I want them to appear in my dataframe with NA in the column num? – Montpelier 15/9, 2020 at 17:44

For making this r-faq question more complete, a base R alternative with sequence and rle:

df$num <- sequence(rle(df$cat)$lengths)

which gives the intended result:

> df
   cat        val num
4  aaa 0.05638315   1
2  aaa 0.25767250   2
1  aaa 0.30776611   3
5  aaa 0.46854928   4
3  aaa 0.55232243   5
10 bbb 0.17026205   1
8  bbb 0.37032054   2
6  bbb 0.48377074   3
9  bbb 0.54655860   4
7  bbb 0.81240262   5
13 ccc 0.28035384   1
14 ccc 0.39848790   2
11 ccc 0.62499648   3
15 ccc 0.76255108   4
12 ccc 0.88216552   5

If df$cat is a factor variable, you need to wrap it in as.character first:

df$num <- sequence(rle(as.character(df$cat))$lengths)

Ramonramona answered 6/10, 2017 at 20:1 Comment(2)

Just noticed, this solutions requires cat column to be sorted? – Keystroke 26/4, 2019 at 20:1

@Keystroke yes, unless you want to number by consecutive occurances of cat – Ramonramona 26/4, 2019 at 20:44

Here is a small improvement trick that allows sort 'val' inside the groups:

# 1. Data set
set.seed(100)
df <- data.frame(
  cat = c(rep("aaa", 5), rep("ccc", 5), rep("bbb", 5)), 
  val = runif(15))             

# 2. 'dplyr' approach
df %>% 
  arrange(cat, val) %>% 
  group_by(cat) %>% 
  mutate(id = row_number())

Brader answered 22/9, 2018 at 7:40 Comment(1)

Can you not sort after the group_by? – Crosscut 9/1, 2019 at 20:40

Another dplyr possibility could be:

df %>%
 group_by(cat) %>%
 mutate(num = 1:n())

   cat      val   num
   <fct>  <dbl> <int>
 1 aaa   0.0564     1
 2 aaa   0.258      2
 3 aaa   0.308      3
 4 aaa   0.469      4
 5 aaa   0.552      5
 6 bbb   0.170      1
 7 bbb   0.370      2
 8 bbb   0.484      3
 9 bbb   0.547      4
10 bbb   0.812      5
11 ccc   0.280      1
12 ccc   0.398      2
13 ccc   0.625      3
14 ccc   0.763      4
15 ccc   0.882      5

Rod answered 28/6, 2019 at 19:52 Comment(1)

In some cases instead of 1:n() using seq_len(n()) is safer, in the event that in your sequence of operations you have a situation where n() might return 0, because 1:0 gives you a length two vector while seq_len(0) gives a length zero vector, thus avoiding a length mismatch error with mutate(). – Dovev 11/7, 2019 at 19:26

I would like to add a data.table variant using the rank() function which provides the additional possibility to change the ordering and thus makes it a bit more flexible than the seq_len() solution and is pretty similar to row_number functions in RDBMS.

# Variant with ascending ordering
library(data.table)
dt <- data.table(df)
dt[, .( val
   , num = rank(val))
    , by = list(cat)][order(cat, num),]

    cat        val num
 1: aaa 0.05638315   1
 2: aaa 0.25767250   2
 3: aaa 0.30776611   3
 4: aaa 0.46854928   4
 5: aaa 0.55232243   5
 6: bbb 0.17026205   1
 7: bbb 0.37032054   2
 8: bbb 0.48377074   3
 9: bbb 0.54655860   4
10: bbb 0.81240262   5
11: ccc 0.28035384   1
12: ccc 0.39848790   2
13: ccc 0.62499648   3
14: ccc 0.76255108   4

# Variant with descending ordering
dt[, .( val
   , num = rank(desc(val)))
    , by = list(cat)][order(cat, num),]

Edit on 2021-04-16 to make the switch between descending and ascending order more fail-safe

Affective answered 18/6, 2018 at 9:28 Comment(0)

Here is an option using a for loop by groups rather by rows (like OP did)

for (i in unique(df$cat)) df$num[df$cat == i] <- seq_len(sum(df$cat == i))

Beaufort answered 16/10, 2012 at 23:51 Comment(0)

Using the rowid() function in data.table:

> set.seed(100)  
> df <- data.frame(cat = c(rep("aaa", 5), rep("bbb", 5), rep("ccc", 5)), val = runif(15))
> df <- df[order(df$cat, df$val), ]  
> df$num <- data.table::rowid(df$cat)
> df
   cat        val num
4  aaa 0.05638315   1
2  aaa 0.25767250   2
1  aaa 0.30776611   3
5  aaa 0.46854928   4
3  aaa 0.55232243   5
10 bbb 0.17026205   1
8  bbb 0.37032054   2
6  bbb 0.48377074   3
9  bbb 0.54655860   4
7  bbb 0.81240262   5
13 ccc 0.28035384   1
14 ccc 0.39848790   2
11 ccc 0.62499648   3
15 ccc 0.76255108   4
12 ccc 0.88216552   5

Lauralauraceous answered 10/1, 2020 at 13:55 Comment(1)

Thanks for your answer but it seems to be already covered in the last suggestion in @mnel's answer – Neoprene 10/1, 2020 at 14:2

Very simple, tidy solutions.

Row number for entire data.frame

library(tidyverse)

iris %>%
  mutate(row_num = seq_along(Sepal.Length)) %>%
  head

    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species row_num
1            5.1         3.5          1.4         0.2     setosa       1
2            4.9         3.0          1.4         0.2     setosa       2
3            4.7         3.2          1.3         0.2     setosa       3
..           ...         ...          ...         ...     ......     ...
148          6.5         3.0          5.2         2.0  virginica     148
149          6.2         3.4          5.4         2.3  virginica     149
150          5.9         3.0          5.1         1.8  virginica     150

Row number by group in data.frame

iris %>% 
  group_by(Species) %>% 
  mutate(num_in_group=seq_along(Species)) %>% 
  as.data.frame


    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species num_in_group
1            5.1         3.5          1.4         0.2     setosa            1
2            4.9         3.0          1.4         0.2     setosa            2
3            4.7         3.2          1.3         0.2     setosa            3
..           ...         ...          ...         ...     ......           ..
48           4.6         3.2          1.4         0.2     setosa           48
49           5.3         3.7          1.5         0.2     setosa           49
50           5.0         3.3          1.4         0.2     setosa           50
51           7.0         3.2          4.7         1.4 versicolor            1
52           6.4         3.2          4.5         1.5 versicolor            2
53           6.9         3.1          4.9         1.5 versicolor            3
..           ...         ...          ...         ...     ......           ..
98           6.2         2.9          4.3         1.3 versicolor           48
99           5.1         2.5          3.0         1.1 versicolor           49
100          5.7         2.8          4.1         1.3 versicolor           50
101          6.3         3.3          6.0         2.5  virginica            1
102          5.8         2.7          5.1         1.9  virginica            2
103          7.1         3.0          5.9         2.1  virginica            3
..           ...         ...          ...         ...     ......           ..
148          6.5         3.0          5.2         2.0  virginica           48
149          6.2         3.4          5.4         2.3  virginica           49
150          5.9         3.0          5.1         1.8  virginica           50

Pronuba answered 4/5, 2021 at 0:14 Comment(0)

In devel version of dplyr

library(dplyr)
df %>%
  mutate(num = row_number(), .by = "cat")

Manzoni answered 6/1, 2023 at 0:23 Comment(0)

Another base R solution would be to split the data frame per cat, after that using lapply: add a column with number 1:nrow(x). The last step is to have your final data frame back with do.call, that is:

        df_split <- split(df, df$cat)
        df_lapply <- lapply(df_split, function(x) {
          x$num <- seq_len(nrow(x))
          return(x)
        })
        df <- do.call(rbind, df_lapply)

Thenar answered 27/10, 2020 at 12:51 Comment(0)

A collapse/data.table solution which uses a grouped cumulative sum on a sequence of ones.

library(data.table)
library(collapse)

set.seed(100) 
df <- data.table(cat = c(rep("aaa", 5), rep("bbb", 5), rep("ccc", 5)), 
                 val = runif(15))
setorder(df, cat, val)

df[, id := fcumsum(alloc(1L, .N), g = cat)][]
#>     cat        val id
#>  1: aaa 0.05638315  1
#>  2: aaa 0.25767250  2
#>  3: aaa 0.30776611  3
#>  4: aaa 0.46854928  4
#>  5: aaa 0.55232243  5
#>  6: bbb 0.17026205  1
#>  7: bbb 0.37032054  2
#>  8: bbb 0.48377074  3
#>  9: bbb 0.54655860  4
#> 10: bbb 0.81240262  5
#> 11: ccc 0.28035384  1
#> 12: ccc 0.39848790  2
#> 13: ccc 0.62499648  3
#> 14: ccc 0.76255108  4
#> 15: ccc 0.88216552  5

^{Created on 2023-06-07 with reprex v2.0.2}

Pupil answered 7/6, 2023 at 13:57 Comment(1)

Perhaps one of several "pure" {collapse} options: X |> fgroup_by(cat) |> fmutate(id = seq_along(val)), where X is df. – Asafoetida 19/4 at 15:57

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags