Repeat each row of data.frame the number of times specified in a column
Asked Answered
P

10

202
df <- data.frame(var1 = c('a', 'b', 'c'), var2 = c('d', 'e', 'f'),
                 freq = 1:3)

What is the simplest way to expand each row the first two columns of the data.frame above, so that each row is repeated the number of times specified in the column 'freq'?

In other words, go from this:

df
  var1 var2 freq
1    a    d    1
2    b    e    2
3    c    f    3

To this:

df.expanded
  var1 var2
1    a    d
2    b    e
3    b    e
4    c    f
5    c    f
6    c    f
Photoengrave answered 24/5, 2010 at 4:46 Comment(0)
J
201

Here's one solution:

df.expanded <- df[rep(row.names(df), df$freq), 1:2]

Result:

    var1 var2
1      a    d
2      b    e
2.1    b    e
3      c    f
3.1    c    f
3.2    c    f
Joviality answered 24/5, 2010 at 5:1 Comment(5)
Great! I always forget you can use square brackets that way. I keep thinking of indexing just for subsetting or reordering. I had another solution that is far less elegant and no doubt less efficient. I might post anyway so that others can compare.Photoengrave
For large data.frame more efficient is to replace row.names(df) with seq.int(1,nrow(df)) or seq_len(nrow(df)).Disproportionation
This worked fantastically for a big data frame -- 1.5million rows, 5 cols, went very quick. Thanks!Swithbert
1:2 hard codes the solution to this example, 1:ncol(df) will work for an arbitrary dataframe.Lowndes
vladim, it should be 1:(ncol(df)-1) to remove the "freq" columnSnazzy
G
118

old question, new verb in tidyverse:

library(tidyr) # version >= 0.8.0
df <- data.frame(var1=c('a', 'b', 'c'), var2=c('d', 'e', 'f'), freq=1:3)
df %>% 
  uncount(freq)

    var1 var2
1      a    d
2      b    e
2.1    b    e
3      c    f
3.1    c    f
3.2    c    f
Galibi answered 1/2, 2018 at 21:12 Comment(0)
N
49

Use expandRows() from the splitstackshape package:

library(splitstackshape)
expandRows(df, "freq")

Simple syntax, very fast, works on data.frame or data.table.

Result:

    var1 var2
1      a    d
2      b    e
2.1    b    e
3      c    f
3.1    c    f
3.2    c    f
Nymphalid answered 11/5, 2015 at 16:18 Comment(0)
H
28

@neilfws's solution works great for data.frames, but not for data.tables since they lack the row.names property. This approach works for both:

df.expanded <- df[rep(seq(nrow(df)), df$freq), 1:2]

The code for data.table is a tad cleaner:

# convert to data.table by reference
setDT(df)
df.expanded <- df[rep(seq(.N), freq), !"freq"]
Headwork answered 11/12, 2014 at 23:1 Comment(2)
another alternative: df[rep(seq(.N), freq)][, freq := NULL]Clobber
another alternative df[rep(1:.N, freq)][, freq:=NULL]Caboodle
D
9

Another dplyr alternative with slice where we repeat each row number freq times

library(dplyr)

df %>%  
  slice(rep(seq_len(n()), freq)) %>% 
  select(-freq)

#  var1 var2
#1    a    d
#2    b    e
#3    b    e
#4    c    f
#5    c    f
#6    c    f

seq_len(n()) part can be replaced with any of the following.

df %>% slice(rep(1:nrow(df), freq)) %>% select(-freq)
#Or
df %>% slice(rep(row_number(), freq)) %>% select(-freq)
#Or
df %>% slice(rep(seq_len(nrow(.)), freq)) %>% select(-freq)
Disquiet answered 8/7, 2019 at 6:20 Comment(0)
C
8

I know this is not the case but if you need to keep the original freq column, you can use another tidyverse approach together with rep:

library(purrr)

df <- data.frame(var1 = c('a', 'b', 'c'), var2 = c('d', 'e', 'f'), freq = 1:3)

df %>% 
  map_df(., rep, .$freq)
#> # A tibble: 6 x 3
#>   var1  var2   freq
#>   <fct> <fct> <int>
#> 1 a     d         1
#> 2 b     e         2
#> 3 b     e         2
#> 4 c     f         3
#> 5 c     f         3
#> 6 c     f         3

Created on 2019-12-21 by the reprex package (v0.3.0)

Carin answered 21/12, 2019 at 3:54 Comment(1)
Or just use .remove = FALSE in uncount()Kandrakandy
N
7

In case you have to do this operation on very large data.frames I would recommend converting it into a data.table and use the following, which should run much faster:

library(data.table)
dt <- data.table(df)
dt.expanded <- dt[ ,list(freq=rep(1,freq)),by=c("var1","var2")]
dt.expanded[ ,freq := NULL]
dt.expanded

See how much faster this solution is:

df <- data.frame(var1=1:2e3, var2=1:2e3, freq=1:2e3)
system.time(df.exp <- df[rep(row.names(df), df$freq), 1:2])
##    user  system elapsed 
##    4.57    0.00    4.56
dt <- data.table(df)
system.time(dt.expanded <- dt[ ,list(freq=rep(1,freq)),by=c("var1","var2")])
##    user  system elapsed 
##    0.05    0.01    0.06
Nosebleed answered 6/7, 2015 at 10:18 Comment(8)
I get an error: Error in rep(1, freq) : invalid 'times' argument. And given that there is already a data.table answer to this question, you may want to describe how your approach is different or when it is better than the current data.table answer. Or if there's not a major difference, you could add it as a comment to the existing answer instead.Nymphalid
@SamFirke: Thank you for your comment. Strange, I just tried it again and I get no such error. Do you use the original dffrom the OP's question? My answer is better because the other answer is kind of misusing the data.table package by using data.frame syntax, see the FAQ of data.table: " It is generally bad practice to refer to columns by number rather than name."Nosebleed
Thanks for the explanation. Your code works for me on the sample df posted by the OP, but when I tried to benchmark this on a larger data.frame I got that error. The data.frame I used was: set.seed(1) dfbig <- data.frame(var1=sample(letters, 1000, replace = TRUE), var2=sample(LETTERS, 1000, replace = TRUE), freq=sample(1:10, 1000, replace = TRUE)) On the tiny data.frame, the base answer does well in my benchmarking, it just doesn't scale well to bigger data.frames. The other three answers ran successfully with this larger data.frame.Nymphalid
@SamFirke: This is indeed strange, it should work there too and I don't know why it doesn't. Do you want to create a question out of it or shall I?Nosebleed
Good idea. Can you? I don't know data.table syntax so I shouldn't be the one judging the answers.Nymphalid
@SamFirke: Did it, so let's see what comes in... #31276657Nosebleed
@SamFirke: We found the problem, it was quite a ride: #31276657Nosebleed
Let us continue this discussion in chat.Nymphalid
P
5

Another possibility is using tidyr::expand:

library(dplyr)
library(tidyr)

df %>% group_by_at(vars(-freq)) %>% expand(temp = 1:freq) %>% select(-temp)
#> # A tibble: 6 x 2
#> # Groups:   var1, var2 [3]
#>   var1  var2 
#>   <fct> <fct>
#> 1 a     d    
#> 2 b     e    
#> 3 b     e    
#> 4 c     f    
#> 5 c     f    
#> 6 c     f

One-liner version of vonjd's answer:

library(data.table)

setDT(df)[ ,list(freq=rep(1,freq)),by=c("var1","var2")][ ,freq := NULL][]
#>    var1 var2
#> 1:    a    d
#> 2:    b    e
#> 3:    b    e
#> 4:    c    f
#> 5:    c    f
#> 6:    c    f

Created on 2019-05-21 by the reprex package (v0.2.1)

Palestrina answered 21/5, 2019 at 20:14 Comment(0)
R
4

I am providing one more addition to this wonderful thread of nice answers! Use the tidyr package (included in tidyverse) for a one-liner solution:

df %>% tidyr::uncount(weights = freq)
Requiem answered 12/10, 2022 at 5:19 Comment(0)
M
0

in fact. use the methods of vector and index. we can also achieve the same result, and more easier to understand:

rawdata <- data.frame('time' = 1:3, 
           'x1' = 4:6,
           'x2' = 7:9,
           'x3' = 10:12)

rawdata[rep(1, time=2), ] %>% remove_rownames()
#  time x1 x2 x3
# 1    1  4  7 10
# 2    1  4  7 10


Manama answered 2/12, 2020 at 5:54 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.