Numbering rows within groups in a data frame
Asked Answered
N

11

248

Working with a data frame similar to this:

set.seed(100)  
df <- data.frame(cat = c(rep("aaa", 5), rep("bbb", 5), rep("ccc", 5)), val = runif(15))             
df <- df[order(df$cat, df$val), ]  
df  
   cat        val  
1  aaa 0.05638315  
2  aaa 0.25767250  
3  aaa 0.30776611  
4  aaa 0.46854928  
5  aaa 0.55232243  
6  bbb 0.17026205  
7  bbb 0.37032054  
8  bbb 0.48377074  
9  bbb 0.54655860  
10 bbb 0.81240262  
11 ccc 0.28035384  
12 ccc 0.39848790  
13 ccc 0.62499648  
14 ccc 0.76255108  
15 ccc 0.88216552 

I am trying to add a column with numbering within each group. Doing it this way obviously isn't using the powers of R:

 df$num <- 1  
 for (i in 2:(length(df[,1]))) {  
   if (df[i,"cat"]==df[(i-1),"cat"]) {  
     df[i,"num"]<-df[i-1,"num"]+1  
     }  
 }  
 df  
   cat        val num  
1  aaa 0.05638315   1  
2  aaa 0.25767250   2  
3  aaa 0.30776611   3  
4  aaa 0.46854928   4  
5  aaa 0.55232243   5  
6  bbb 0.17026205   1  
7  bbb 0.37032054   2  
8  bbb 0.48377074   3  
9  bbb 0.54655860   4  
10 bbb 0.81240262   5  
11 ccc 0.28035384   1  
12 ccc 0.39848790   2  
13 ccc 0.62499648   3  
14 ccc 0.76255108   4  
15 ccc 0.88216552   5  

What would be a good way to do this?

Neoprene answered 16/10, 2012 at 23:38 Comment(2)
I would suggest to add something like "seq along levels" or "counting along replicates" in the question title as this is how I found this question and it is exactly what I was looking forKovacev
@Kovacev If that were the title, I wouldn't have found what I was looking for :-( I was literally looking for "how to number rows within groups in a data frame"Evadnee
P
396

Use ave, ddply, dplyr or data.table:

df$num <- ave(df$val, df$cat, FUN = seq_along)

or:

library(plyr)
ddply(df, .(cat), mutate, id = seq_along(val))

or:

library(dplyr)
df %>% group_by(cat) %>% mutate(id = row_number())

or (the most memory efficient, as it assigns by reference within DT):

library(data.table)
DT <- data.table(df)

DT[, id := seq_len(.N), by = cat]
DT[, id := rowid(cat)]
Pictish answered 16/10, 2012 at 23:41 Comment(9)
It might be worth mentioning that ave gives a float instead of an int here. Alternately, could change df$val to seq_len(nrow(df)). I just ran into this over here: #42797357Moncada
Interestingly this data.table solution seems to be quicker than using frank: library(microbenchmark); microbenchmark(a = DT[, .(val ,num = frank(val)), by = list(cat)] ,b =DT[, .(val , id = seq_len(.N)), by = list(cat)] , times = 1000L)Affective
Thanks! The dplyr solution is good. But if, like me, you kept getting weird errors when trying this approach, make sure that you are not getting conflicts between plyr and dplyr as explained in this post It can be avoided by explicitly calling dplyr::mutate(...)Minnesota
another data.table method is setDT(df)[, id:=rleid(val), by=.(cat)]Zipangu
How to modify library(plyr) and library(dplyr) answers to make the ranking val column in descending order?Gentlewoman
I tried using the plyr method and got an error: "Error in unique.default(x) : unique() applies only to vectors" - has anyone ever seen that happen?Saccharin
@PrzemyslawRemin You can simply sort the whole dataset in advance. df <- df[order(df$val),]Loredo
data.table was the moste effective way, it took not a second to compute about 17000 rows. Using ddply it was running for ever so I had to kill the R process.Loredo
How could I do if I have some NA for the variable val and I don't want to consider this rows when creating the variable num although I want them to appear in my dataframe with NA in the column num?Montpelier
R
35

For making this question more complete, a base R alternative with sequence and rle:

df$num <- sequence(rle(df$cat)$lengths)

which gives the intended result:

> df
   cat        val num
4  aaa 0.05638315   1
2  aaa 0.25767250   2
1  aaa 0.30776611   3
5  aaa 0.46854928   4
3  aaa 0.55232243   5
10 bbb 0.17026205   1
8  bbb 0.37032054   2
6  bbb 0.48377074   3
9  bbb 0.54655860   4
7  bbb 0.81240262   5
13 ccc 0.28035384   1
14 ccc 0.39848790   2
11 ccc 0.62499648   3
15 ccc 0.76255108   4
12 ccc 0.88216552   5

If df$cat is a factor variable, you need to wrap it in as.character first:

df$num <- sequence(rle(as.character(df$cat))$lengths)
Ramonramona answered 6/10, 2017 at 20:1 Comment(2)
Just noticed, this solutions requires cat column to be sorted?Keystroke
@Keystroke yes, unless you want to number by consecutive occurances of catRamonramona
B
14

Here is a small improvement trick that allows sort 'val' inside the groups:

# 1. Data set
set.seed(100)
df <- data.frame(
  cat = c(rep("aaa", 5), rep("ccc", 5), rep("bbb", 5)), 
  val = runif(15))             

# 2. 'dplyr' approach
df %>% 
  arrange(cat, val) %>% 
  group_by(cat) %>% 
  mutate(id = row_number())
Brader answered 22/9, 2018 at 7:40 Comment(1)
Can you not sort after the group_by?Crosscut
R
12

Another dplyr possibility could be:

df %>%
 group_by(cat) %>%
 mutate(num = 1:n())

   cat      val   num
   <fct>  <dbl> <int>
 1 aaa   0.0564     1
 2 aaa   0.258      2
 3 aaa   0.308      3
 4 aaa   0.469      4
 5 aaa   0.552      5
 6 bbb   0.170      1
 7 bbb   0.370      2
 8 bbb   0.484      3
 9 bbb   0.547      4
10 bbb   0.812      5
11 ccc   0.280      1
12 ccc   0.398      2
13 ccc   0.625      3
14 ccc   0.763      4
15 ccc   0.882      5
Rod answered 28/6, 2019 at 19:52 Comment(1)
In some cases instead of 1:n() using seq_len(n()) is safer, in the event that in your sequence of operations you have a situation where n() might return 0, because 1:0 gives you a length two vector while seq_len(0) gives a length zero vector, thus avoiding a length mismatch error with mutate().Dovev
A
9

I would like to add a data.table variant using the rank() function which provides the additional possibility to change the ordering and thus makes it a bit more flexible than the seq_len() solution and is pretty similar to row_number functions in RDBMS.

# Variant with ascending ordering
library(data.table)
dt <- data.table(df)
dt[, .( val
   , num = rank(val))
    , by = list(cat)][order(cat, num),]

    cat        val num
 1: aaa 0.05638315   1
 2: aaa 0.25767250   2
 3: aaa 0.30776611   3
 4: aaa 0.46854928   4
 5: aaa 0.55232243   5
 6: bbb 0.17026205   1
 7: bbb 0.37032054   2
 8: bbb 0.48377074   3
 9: bbb 0.54655860   4
10: bbb 0.81240262   5
11: ccc 0.28035384   1
12: ccc 0.39848790   2
13: ccc 0.62499648   3
14: ccc 0.76255108   4

# Variant with descending ordering
dt[, .( val
   , num = rank(desc(val)))
    , by = list(cat)][order(cat, num),]

Edit on 2021-04-16 to make the switch between descending and ascending order more fail-safe

Affective answered 18/6, 2018 at 9:28 Comment(0)
B
8

Here is an option using a for loop by groups rather by rows (like OP did)

for (i in unique(df$cat)) df$num[df$cat == i] <- seq_len(sum(df$cat == i))
Beaufort answered 16/10, 2012 at 23:51 Comment(0)
L
4

Using the rowid() function in data.table:

> set.seed(100)  
> df <- data.frame(cat = c(rep("aaa", 5), rep("bbb", 5), rep("ccc", 5)), val = runif(15))
> df <- df[order(df$cat, df$val), ]  
> df$num <- data.table::rowid(df$cat)
> df
   cat        val num
4  aaa 0.05638315   1
2  aaa 0.25767250   2
1  aaa 0.30776611   3
5  aaa 0.46854928   4
3  aaa 0.55232243   5
10 bbb 0.17026205   1
8  bbb 0.37032054   2
6  bbb 0.48377074   3
9  bbb 0.54655860   4
7  bbb 0.81240262   5
13 ccc 0.28035384   1
14 ccc 0.39848790   2
11 ccc 0.62499648   3
15 ccc 0.76255108   4
12 ccc 0.88216552   5
Lauralauraceous answered 10/1, 2020 at 13:55 Comment(1)
Thanks for your answer but it seems to be already covered in the last suggestion in @mnel's answerNeoprene
P
2

Very simple, tidy solutions.

Row number for entire data.frame

library(tidyverse)

iris %>%
  mutate(row_num = seq_along(Sepal.Length)) %>%
  head

    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species row_num
1            5.1         3.5          1.4         0.2     setosa       1
2            4.9         3.0          1.4         0.2     setosa       2
3            4.7         3.2          1.3         0.2     setosa       3
..           ...         ...          ...         ...     ......     ...
148          6.5         3.0          5.2         2.0  virginica     148
149          6.2         3.4          5.4         2.3  virginica     149
150          5.9         3.0          5.1         1.8  virginica     150

Row number by group in data.frame

iris %>% 
  group_by(Species) %>% 
  mutate(num_in_group=seq_along(Species)) %>% 
  as.data.frame


    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species num_in_group
1            5.1         3.5          1.4         0.2     setosa            1
2            4.9         3.0          1.4         0.2     setosa            2
3            4.7         3.2          1.3         0.2     setosa            3
..           ...         ...          ...         ...     ......           ..
48           4.6         3.2          1.4         0.2     setosa           48
49           5.3         3.7          1.5         0.2     setosa           49
50           5.0         3.3          1.4         0.2     setosa           50
51           7.0         3.2          4.7         1.4 versicolor            1
52           6.4         3.2          4.5         1.5 versicolor            2
53           6.9         3.1          4.9         1.5 versicolor            3
..           ...         ...          ...         ...     ......           ..
98           6.2         2.9          4.3         1.3 versicolor           48
99           5.1         2.5          3.0         1.1 versicolor           49
100          5.7         2.8          4.1         1.3 versicolor           50
101          6.3         3.3          6.0         2.5  virginica            1
102          5.8         2.7          5.1         1.9  virginica            2
103          7.1         3.0          5.9         2.1  virginica            3
..           ...         ...          ...         ...     ......           ..
148          6.5         3.0          5.2         2.0  virginica           48
149          6.2         3.4          5.4         2.3  virginica           49
150          5.9         3.0          5.1         1.8  virginica           50
Pronuba answered 4/5, 2021 at 0:14 Comment(0)
M
2

In devel version of dplyr

library(dplyr)
df %>%
  mutate(num = row_number(), .by = "cat")
Manzoni answered 6/1, 2023 at 0:23 Comment(0)
T
0

Another base R solution would be to split the data frame per cat, after that using lapply: add a column with number 1:nrow(x). The last step is to have your final data frame back with do.call, that is:

        df_split <- split(df, df$cat)
        df_lapply <- lapply(df_split, function(x) {
          x$num <- seq_len(nrow(x))
          return(x)
        })
        df <- do.call(rbind, df_lapply)
Thenar answered 27/10, 2020 at 12:51 Comment(0)
P
0

A collapse/data.table solution which uses a grouped cumulative sum on a sequence of ones.

library(data.table)
library(collapse)

set.seed(100) 
df <- data.table(cat = c(rep("aaa", 5), rep("bbb", 5), rep("ccc", 5)), 
                 val = runif(15))
setorder(df, cat, val)

df[, id := fcumsum(alloc(1L, .N), g = cat)][]
#>     cat        val id
#>  1: aaa 0.05638315  1
#>  2: aaa 0.25767250  2
#>  3: aaa 0.30776611  3
#>  4: aaa 0.46854928  4
#>  5: aaa 0.55232243  5
#>  6: bbb 0.17026205  1
#>  7: bbb 0.37032054  2
#>  8: bbb 0.48377074  3
#>  9: bbb 0.54655860  4
#> 10: bbb 0.81240262  5
#> 11: ccc 0.28035384  1
#> 12: ccc 0.39848790  2
#> 13: ccc 0.62499648  3
#> 14: ccc 0.76255108  4
#> 15: ccc 0.88216552  5

Created on 2023-06-07 with reprex v2.0.2

Pupil answered 7/6, 2023 at 13:57 Comment(1)
Perhaps one of several "pure" {collapse} options: X |> fgroup_by(cat) |> fmutate(id = seq_along(val)), where X is df.Asafoetida

© 2022 - 2024 — McMap. All rights reserved.