Sum elements with conditions on indices
Asked Answered
R

4

8

I have 3 vectors:

x <- c(3, 5, 2)
y <- c(3, 2, 1,  1, 2, 3, 4, 5,  4, 5)
z <- c(2, 4, 8, 1, 5)

x is the number of elements in each group.

y gives indices to extract elements from z. The first three indices belong to group 1 (corresponding to first element in x, 3); the next five indices belong to group 2 (the second element in x, 5), and so on.

z is values from which to extract values using indices in y, and then summed by group.

For each element in x, I want to sum the elements in z indexed by corresponding indices in y. For example:

i = 1; indices = 3, 2, 1; sum = 8 + 4 + 2

i = 2; indices = 1, 2, 3, 4, 5; sum = 2 + 4 + 8 + 1 + 5

i = 3; indices = 4, 5; sum = 1 + 5

I've thought of creating a new vector with the same length as y and elements are from z with their respective indices but this loop lies within a bigger program and it may get too complicated. I'd really appreciate your advices on how to specify conditions on indices to get the sum.

Thanks!

Rufous answered 15/7, 2022 at 10:19 Comment(0)
C
4

First index z with y to get a vector of elements you want to sum. Then create a group index from x, and use tapply() to sum in each group:

x <- c(3, 5, 2)
y <- c(3, 2, 1, 1, 2, 3, 4, 5, 4, 5)
z <- c(2, 4, 8, 1, 5)

g <- rep(seq_along(x), x)
tapply(z[y], g, sum)
#>  1  2  3 
#> 14 20  6
Cynar answered 15/7, 2022 at 10:25 Comment(0)
H
3

A vectorized solution with base R:

diff(c(0, cumsum(z[y])[cumsum(x)]))
#> [1] 14 20  6

Just for fun, let's benchmark the different solutions proposed using a larger dataset (all integer values so the results of the four functions are identical):

f1 <- function(x, y, z) diff(c(0L, cumsum(z[y])[cumsum(x)]))
f2 <- function(x, y, z) as.integer(tapply(z[y], rep(seq_along(x), x), sum))
f3 <- function(x, y, z) sapply(unname(split(y, rep(1:length(x), x))), function(x) sum(z[x]))
# @Mael's for loop answer modified for speed
f4 <- function(x, y, z) {
  s <- integer(length(x))
  cx <- cumsum(x)
  s[1] <- sum(z[y[seq(x[1])]])
  for(i in 2:length(x)) s[i] <- sum(z[y[seq(cx[i - 1L] + 1L, cx[i])]])
  s
}

x <- sample(10, 1e4, TRUE)
y <- unlist(mapply(function(x) sample(10, x), x))
z <- sample(10)

microbenchmark::microbenchmark(f1(x, y, z),
                               f2(x, y, z),
                               f3(x, y, z),
                               f4(x, y, z),
                               check = "equal")
#> Unit: microseconds
#>         expr     min       lq      mean   median       uq     max neval
#>  f1(x, y, z)   221.7   237.35   269.056   246.75   259.25  1621.3   100
#>  f2(x, y, z)  8659.5  8966.05  9436.873  9097.60  9551.10 13567.7   100
#>  f3(x, y, z)  9960.1 10746.35 11759.939 11030.05 12242.85 43611.5   100
#>  f4(x, y, z) 47574.7 50506.10 51927.481 51728.70 53416.30 58262.5   100
Hyden answered 15/7, 2022 at 20:51 Comment(1)
Nice way to take advantage of the inherent ordering in the question! Your answer reminded me of this series of blog posts with more about these sorts of group statistics optimizations: brodieg.com/2019/06/10/base-vs-data-table/#group-sumsCynar
C
1

With a for loop:

s <- c()
for(i in seq(x)){
  if(i == 1){
    idx <- seq(x[i])
  } else {
    idx <- seq(from = cumsum(x)[i - 1] + 1, to = cumsum(x)[i])
  }
  s <- c(s, sum(z[y[idx]]))
}

output

> s
# [1] 14 20  6
Cremate answered 15/7, 2022 at 10:28 Comment(0)
D
1

Another possible solution, in base R:

sapply(unname(split(y, rep(1:length(x), x))), \(x) sum(z[x]))

#> [1] 14 20  6

Drysalt answered 15/7, 2022 at 10:42 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.