Repeat each row of data.frame the number of times specified in a column

Asked 24/5, 2010 at 4:46 Answered 12/10, 2022 at 5:19

202

df <- data.frame(var1 = c('a', 'b', 'c'), var2 = c('d', 'e', 'f'),
                 freq = 1:3)

What is the simplest way to expand each row the first two columns of the data.frame above, so that each row is repeated the number of times specified in the column 'freq'?

In other words, go from this:

df
  var1 var2 freq
1    a    d    1
2    b    e    2
3    c    f    3

To this:

df.expanded
  var1 var2
1    a    d
2    b    e
3    b    e
4    c    f
5    c    f
6    c    f

Photoengrave answered 24/5, 2010 at 4:46 Comment(0)

201

Here's one solution:

df.expanded <- df[rep(row.names(df), df$freq), 1:2]

Result:

    var1 var2
1      a    d
2      b    e
2.1    b    e
3      c    f
3.1    c    f
3.2    c    f

Joviality answered 24/5, 2010 at 5:1 Comment(5)

Great! I always forget you can use square brackets that way. I keep thinking of indexing just for subsetting or reordering. I had another solution that is far less elegant and no doubt less efficient. I might post anyway so that others can compare. – Photoengrave 24/5, 2010 at 10:30

For large data.frame more efficient is to replace row.names(df) with seq.int(1,nrow(df)) or seq_len(nrow(df)). – Disproportionation 25/5, 2010 at 11:54

This worked fantastically for a big data frame -- 1.5million rows, 5 cols, went very quick. Thanks! – Swithbert 21/11, 2012 at 6:16

1:2 hard codes the solution to this example, 1:ncol(df) will work for an arbitrary dataframe. – Lowndes 30/8, 2018 at 6:28

vladim, it should be 1:(ncol(df)-1) to remove the "freq" column – Snazzy 18/12, 2023 at 23:55

118

old question, new verb in tidyverse:

library(tidyr) # version >= 0.8.0
df <- data.frame(var1=c('a', 'b', 'c'), var2=c('d', 'e', 'f'), freq=1:3)
df %>% 
  uncount(freq)

    var1 var2
1      a    d
2      b    e
2.1    b    e
3      c    f
3.1    c    f
3.2    c    f

Galibi answered 1/2, 2018 at 21:12 Comment(0)

Use expandRows() from the splitstackshape package:

library(splitstackshape)
expandRows(df, "freq")

Simple syntax, very fast, works on data.frame or data.table.

Result:

    var1 var2
1      a    d
2      b    e
2.1    b    e
3      c    f
3.1    c    f
3.2    c    f

Nymphalid answered 11/5, 2015 at 16:18 Comment(0)

@neilfws's solution works great for data.frames, but not for data.tables since they lack the row.names property. This approach works for both:

df.expanded <- df[rep(seq(nrow(df)), df$freq), 1:2]

The code for data.table is a tad cleaner:

# convert to data.table by reference
setDT(df)
df.expanded <- df[rep(seq(.N), freq), !"freq"]

Headwork answered 11/12, 2014 at 23:1 Comment(2)

another alternative: df[rep(seq(.N), freq)][, freq := NULL] – Clobber 10/4, 2017 at 6:4

another alternative df[rep(1:.N, freq)][, freq:=NULL] – Caboodle 11/8, 2019 at 12:46

Another dplyr alternative with slice where we repeat each row number freq times

library(dplyr)

df %>%  
  slice(rep(seq_len(n()), freq)) %>% 
  select(-freq)

#  var1 var2
#1    a    d
#2    b    e
#3    b    e
#4    c    f
#5    c    f
#6    c    f

seq_len(n()) part can be replaced with any of the following.

df %>% slice(rep(1:nrow(df), freq)) %>% select(-freq)
#Or
df %>% slice(rep(row_number(), freq)) %>% select(-freq)
#Or
df %>% slice(rep(seq_len(nrow(.)), freq)) %>% select(-freq)

Disquiet answered 8/7, 2019 at 6:20 Comment(0)

I know this is not the case but if you need to keep the original freq column, you can use another tidyverse approach together with rep:

library(purrr)

df <- data.frame(var1 = c('a', 'b', 'c'), var2 = c('d', 'e', 'f'), freq = 1:3)

df %>% 
  map_df(., rep, .$freq)
#> # A tibble: 6 x 3
#>   var1  var2   freq
#>   <fct> <fct> <int>
#> 1 a     d         1
#> 2 b     e         2
#> 3 b     e         2
#> 4 c     f         3
#> 5 c     f         3
#> 6 c     f         3

^{Created on 2019-12-21 by the reprex package (v0.3.0)}

Carin answered 21/12, 2019 at 3:54 Comment(1)

Or just use .remove = FALSE in uncount() – Kandrakandy 30/6, 2020 at 15:35

In case you have to do this operation on very large data.frames I would recommend converting it into a data.table and use the following, which should run much faster:

library(data.table)
dt <- data.table(df)
dt.expanded <- dt[ ,list(freq=rep(1,freq)),by=c("var1","var2")]
dt.expanded[ ,freq := NULL]
dt.expanded

See how much faster this solution is:

df <- data.frame(var1=1:2e3, var2=1:2e3, freq=1:2e3)
system.time(df.exp <- df[rep(row.names(df), df$freq), 1:2])
##    user  system elapsed 
##    4.57    0.00    4.56
dt <- data.table(df)
system.time(dt.expanded <- dt[ ,list(freq=rep(1,freq)),by=c("var1","var2")])
##    user  system elapsed 
##    0.05    0.01    0.06

Nosebleed answered 6/7, 2015 at 10:18 Comment(8)

I get an error: Error in rep(1, freq) : invalid 'times' argument. And given that there is already a data.table answer to this question, you may want to describe how your approach is different or when it is better than the current data.table answer. Or if there's not a major difference, you could add it as a comment to the existing answer instead. – Nymphalid 7/7, 2015 at 16:20

@SamFirke: Thank you for your comment. Strange, I just tried it again and I get no such error. Do you use the original dffrom the OP's question? My answer is better because the other answer is kind of misusing the data.table package by using data.frame syntax, see the FAQ of data.table: " It is generally bad practice to refer to columns by number rather than name." – Nosebleed 7/7, 2015 at 16:37

Thanks for the explanation. Your code works for me on the sample df posted by the OP, but when I tried to benchmark this on a larger data.frame I got that error. The data.frame I used was:

set.seed(1) dfbig <- data.frame(var1=sample(letters, 1000, replace = TRUE), var2=sample(LETTERS, 1000, replace = TRUE), freq=sample(1:10, 1000, replace = TRUE))

On the tiny data.frame, the base answer does well in my benchmarking, it just doesn't scale well to bigger data.frames. The other three answers ran successfully with this larger data.frame. – Nymphalid 7/7, 2015 at 17:6

@SamFirke: This is indeed strange, it should work there too and I don't know why it doesn't. Do you want to create a question out of it or shall I? – Nosebleed 7/7, 2015 at 17:18

Good idea. Can you? I don't know data.table syntax so I shouldn't be the one judging the answers. – Nymphalid 7/7, 2015 at 17:22

@SamFirke: Did it, so let's see what comes in... #31276657 – Nosebleed 7/7, 2015 at 18:6

@SamFirke: We found the problem, it was quite a ride: #31276657 – Nosebleed 7/7, 2015 at 19:17

Let us continue this discussion in chat. – Nymphalid 7/7, 2015 at 19:42

Another possibility is using tidyr::expand:

library(dplyr)
library(tidyr)

df %>% group_by_at(vars(-freq)) %>% expand(temp = 1:freq) %>% select(-temp)

#> # A tibble: 6 x 2
#> # Groups:   var1, var2 [3]
#>   var1  var2 
#>   <fct> <fct>
#> 1 a     d    
#> 2 b     e    
#> 3 b     e    
#> 4 c     f    
#> 5 c     f    
#> 6 c     f

One-liner version of vonjd's answer:

library(data.table)

setDT(df)[ ,list(freq=rep(1,freq)),by=c("var1","var2")][ ,freq := NULL][]

#>    var1 var2
#> 1:    a    d
#> 2:    b    e
#> 3:    b    e
#> 4:    c    f
#> 5:    c    f
#> 6:    c    f

^{Created on 2019-05-21 by the reprex package (v0.2.1)}

Palestrina answered 21/5, 2019 at 20:14 Comment(0)

I am providing one more addition to this wonderful thread of nice answers! Use the tidyr package (included in tidyverse) for a one-liner solution:

df %>% tidyr::uncount(weights = freq)

Requiem answered 12/10, 2022 at 5:19 Comment(0)

in fact. use the methods of vector and index. we can also achieve the same result, and more easier to understand:

rawdata <- data.frame('time' = 1:3, 
           'x1' = 4:6,
           'x2' = 7:9,
           'x3' = 10:12)

rawdata[rep(1, time=2), ] %>% remove_rownames()
#  time x1 x2 x3
# 1    1  4  7 10
# 2    1  4  7 10

Manama answered 2/12, 2020 at 5:54 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags