Cumulatively paste (concatenate) values grouped by another variable
Asked Answered
E

6

17

I have a problem dealing with a data frame in R. I would like to paste the contents of cells in different rows together based on the values of the cells in another column. My problem is that I want the output to be progressively (cumulatively) printed. The output vector must be of the same length as the input vector. Here is a sampel table similar to the one I am dealing with:

id <- c("a", "a", "a", "b", "b", "b")
content <- c("A", "B", "A", "B", "C", "B")
(testdf <- data.frame(id, content, stringsAsFactors=FALSE))
#  id content
#1  a       A
#2  a       B
#3  a       A
#4  b       B
#5  b       C
#6  b       B

And this is want I want the result to look like:

result <- c("A", "A B", "A B A", "B", "B C", "B C B") 
result

#[1] "A"     "A B"   "A B A" "B"     "B C"   "B C B"

What I do NOT need something like this:

ddply(testdf, .(id), summarize, content_concatenated = paste(content, collapse = " "))

#  id content_concatenated
#1  a                A B A
#2  b                B C B
Etan answered 21/7, 2014 at 9:44 Comment(5)
You want something like a "cumulative paste". You could use Reduce: ave(as.character(testdf$content), testdf$id, FUN = function(x) Reduce(paste, x, acc = T))Tysontyumen
@alexis_laz, this is the comment box, not the answer box :-)Broker
@alexis_laz: great! it worked! thanks a lot! all the bestEtan
@AnandaMahto : I tend to see it as the "it-has-to-be-in-SO-somewhere-but-I'm-too-lazy-to-search" box :PTysontyumen
@alexis_laz, but people don't read comments for the answers. So unless you're going to do the work of finding a duplicate to mark, it's much more helpful to the community if you took the 10 seconds to post an answer, let the OP accept it, and show clearly that this is a resolved question.... At least that's my perspective.Broker
T
38

You could define a "cumulative paste" function using Reduce:

cumpaste = function(x, .sep = " ") 
          Reduce(function(x1, x2) paste(x1, x2, sep = .sep), x, accumulate = TRUE)

cumpaste(letters[1:3], "; ")
#[1] "a"       "a; b"    "a; b; c"

Reduce's loop avoids re-concatenating elements from the start as it elongates the previous concatenation by the next element.

Applying it by group:

ave(as.character(testdf$content), testdf$id, FUN = cumpaste)
#[1] "A"     "A B"   "A B A" "B"     "B C"   "B C B"

Another idea, could to concatenate the whole vector at start and, then, progressively substring:

cumpaste2 = function(x, .sep = " ")
{
    concat = paste(x, collapse = .sep)
    substring(concat, 1L, cumsum(c(nchar(x[[1L]]), nchar(x[-1L]) + nchar(.sep))))
}
cumpaste2(letters[1:3], " ;@-")
#[1] "a"           "a ;@-b"      "a ;@-b ;@-c"

This seems to be somewhat faster, too:

set.seed(077)
X = replicate(1e3, paste(sample(letters, sample(0:5, 1), TRUE), collapse = ""))
identical(cumpaste(X, " --- "), cumpaste2(X, " --- "))
#[1] TRUE
microbenchmark::microbenchmark(cumpaste(X, " --- "), cumpaste2(X, " --- "), times = 30)
#Unit: milliseconds
#                  expr      min       lq     mean   median       uq      max neval cld
#  cumpaste(X, " --- ") 21.19967 21.82295 26.47899 24.83196 30.34068 39.86275    30   b
# cumpaste2(X, " --- ") 14.41291 14.92378 16.87865 16.03339 18.56703 23.22958    30  a

...which makes it the cumpaste_faster.

Tysontyumen answered 21/7, 2014 at 11:29 Comment(6)
Terrible name for a function, but +1 for the answer (and thanks for bearing with my comments) :-)Broker
@AnandaMahto : Ha, I knew I had to go with "foo", but sometimes you have to call it as you see it :). (or, in this case, as you use it..)Tysontyumen
How should I modify cumpaste2 function if I want to group by two columns? e.g., testdf$id and testdf$id2 ?Sicard
@Sicard : If I understand correctly, you don't need to modify the function. Just group by both columns; e.g. something like ave(x, id, id2, FUN = cumpaste)Tysontyumen
@Tysontyumen ave(x, id, id2, FUN = cumpaste) works but when I use compaste2 function it gives me the subscript out of bound error from the function itself. I'm guessing I need to change substring(concat, 1L, ... ) part..Sicard
@Sicard : I think I've debugged the issue. It seems that adding if(!length(x)) return(x) before all in the function fixes it?Tysontyumen
M
5

data.table solution

library(data.table)
setDT(testdf)[, content2 := sapply(seq_len(.N), function(x) paste(content[seq_len(x)], collapse = " ")), by = id]
testdf

##    id content content2
## 1:  a       A        A
## 2:  a       B      A B
## 3:  a       A    A B A
## 4:  b       B        B
## 5:  b       C      B C
## 6:  b       B    B C B
Melodee answered 21/7, 2014 at 11:33 Comment(2)
Best and simpler solution!Lumper
This is awesome. If you only want to paste the unique values just add unique before content.Bilection
I
4

One option using dplyr and purrr could be:

testdf %>%
 group_by(id) %>%
 transmute(content_concatenated = accumulate(content, ~ paste(.x, .y)))

  id    content_concatenated
  <chr> <chr>               
1 a     A                   
2 a     A B                 
3 a     A B A               
4 b     B                   
5 b     B C                 
6 b     B C B  
Indian answered 2/11, 2019 at 10:22 Comment(0)
H
2

Here's a ddply method using sapply and subsetting to paste together incrementally:

library(plyr)
ddply(testdf, .(id), mutate, content_concatenated = sapply(seq_along(content), function(x) paste(content[seq(x)], collapse = " ")))
  id content content_concatenated
1  a       A                    A
2  a       B                  A B
3  a       A                A B A
4  b       B                    B
5  b       C                  B C
6  b       B                B C B
Heartsome answered 21/7, 2014 at 10:50 Comment(1)
You probably would want to mention where ddply comes fromMelodee
C
2

You may also try dplyr

 library(dplyr)
 res <- testdf%>%
        mutate(n=row_number()) %>%
        group_by(id) %>%
        mutate(n1=n[1L]) %>%
        rowwise() %>% 
        do(data.frame(cont_concat= paste(content[.$n1:.$n],collapse=" "),stringsAsFactors=F))

 res$cont_concat
 #[1] "A"     "A B"   "A B A" "B"     "B C"   "B C B"
Crocodilian answered 21/7, 2014 at 17:23 Comment(2)
This only seems to work because content is manually defined as a vector above. Is there a way to do it all within dplyr without having to refer to objects outside the dataframe?Puga
@ErikShilts Please post as a new questionCrocodilian
N
1

For cumulative functions I recommend runner package with runner function which can apply any algorithm on cumulative window. It can't compete with @alexis_laz solution in terms of speed, but if one needs window of certain size, lag or windows dependent on date - I would suggest to use runner.


id <- c("a", "a", "a", "b", "b", "b")
content <- c("A", "B", "A", "B", "C", "B")
testdf <- data.frame(id, content, stringsAsFactors=FALSE)

library(runner)
library(dplyr)
testdf %>%
  group_by(id) %>%
  mutate(
    result = runner(x = content, 
                    f = function(x) paste(x, collapse = " "),
                    type = "character")) # specify output type - by default numeric

For more go to documentation and vignettes

Negation answered 10/12, 2019 at 17:45 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.