Summarise over all columns
Asked Answered
N

3

10

I have data of the following format:

gen = function () sample.int(10, replace = TRUE)
x = data.frame(A = gen(), C = gen(), G = gen(), T = gen())

I would now like to attach, to each row, the total sum of all the elements in the row (my actual function is more complex but sum illustrates the problem).

Without dplyr, I’d write

cbind(x, Sum = apply(x, 1, sum))

Resulting in:

   A C  G T Sum
1  3 1  6 9  19
2  3 4  3 3  13
3  3 1 10 5  19
4  7 2  1 6  16
…

But it seems surprisingly hard to do this with dplyr.

I’ve tried

x %>% rowwise() %>% mutate(Sum = sum(A : T))

But the result is not the sum of the columns of each row, it’s something unexpected and (to me) inexplicable.

I’ve also tried

x %>% rowwise() %>% mutate(Sum = sum(.))

But here, . is simply a placeholder for the whole x. Providing no argument does, unsurprisingly, also not work (results are all 0). Needless to say, none of these variants works without rowwise(), either.

(There isn’t really any reason to necessarily do this in dplyr, but (a) I’d like to keep my code as uniform as possible, and jumping between different APIs doesn’t help; and (b) I’m hoping to one day get automatic and free parallelisation of such commands in dplyr.)

Nikolai answered 22/1, 2015 at 17:54 Comment(6)
If sum is just for illustration it's perhaps a bad choice because there is a special function for that, as shown in the answer below. For other functions it might be more dplyr-like to gather your columns of interest, compute the values you are interested in and then spread (using for example tidyr). That's my interpretation at least..Encourage
I'm wondering if some like this library(data.table) ; setDT(x)[, Sum := Reduce("+", .SD)][] would be of any use...Lodestar
@DavidArenburg, nice approach which works the same way with dplyr. How would you use it with a different function, for example the one from my answer? (I think the difference is that + is a binary function taking 2 inputs which can then be applied / reduced multiple times while f from my answer takes a whole vector at once..)Encourage
@docendodiscimus, this is why I've posted this in comments. We don't have Konrads real function so it may will be that it will work with Reduce too. Will wait and see what he says.Lodestar
@DavidArenburg, sure - would certainly be interesting to know a bit more about the function input/output design.Encourage
@David I wanted to have a general solution, since this problem crops up for me again and again, with different functions. In my current case, the function isn’t amenable to Reduce – it calculates the GC bias from a frequency table of codons. Here’s an implementation: gist.github.com/klmr/4898c3eb1a5216850134Nikolai
S
5

I once did something similar, and by that time I ended up with:

x %>%
  rowwise() %>%
  do(data.frame(., res = sum(unlist(.))))
#    A  C G  T res
# 1  3  2 8  6  19
# 2  6  1 7 10  24
# 3  4  8 6  7  25
# 4  6  4 7  8  25
# 5  6 10 7  2  25
# 6  7  1 2  2  12
# 7  5  4 8  5  22
# 8  9  2 3  2  16
# 9  3  4 7  6  20
# 10 7  5 3  9  24

Perhaps your more complex function works fine without unlist, but it seems like it is necessary for sum. Because . refers to the "current group", I initially thought that . for e.g. the first row in the rowwise machinery would correspond to x[1, ], which is a list, which sum swallows happily outside do

is.list((x[1, ]))
# [1] TRUE

sum(x[1, ])
# [1] 19 

However, without unlist in do an error is generated, and I am not sure why:

x %>%
  rowwise() %>%
  do(data.frame(., res = sum(.)))
# Error in sum(.) : invalid 'type' (list) of argument
Signorina answered 22/1, 2015 at 22:0 Comment(4)
Nice and more succinct approach! (+1) OTOH I think calling do and data.frame for each row might also be costly (perhaps it would be slightly more efficient with data_frame?)Encourage
Thanks for your feedback! For my original use I didn't need to count microseconds, it was enough that it eventually worked... Yes, I have tried data_frame, but it doesn't return the desired result.Signorina
Huh. This works. So . is something for do (= the current group/row) than for mutate (= the whole table). Amazing. In my case, I also don’t need the unlist, but then, my function also accepts a one-row data.frame so that’s probably why. And yes, it’s quite slow in my case, but so is the apply approach, and my actual function is simply hellishly inefficient and not at all optimised.Nikolai
@KonradRudolph I have tried to find an official text on the use of ., in addition to what's written in ?do and ?rowwise. Perhaps the comment by @hadley here ("The pronoun is supplied by either %>% or do() so it won't with arbitrary dplyr functions.") suggests that . is most happy together with do? Quite a lot is happening between versions though, so I am not sure about the current status of the friend and foes of .Signorina
P
3

Does this do what you'd like?

Data %>%
   mutate(SumVar=rowSums(.))
Painful answered 22/1, 2015 at 18:1 Comment(1)
It probably does but the actual function I’m using isn’t vectorised.Nikolai
E
1

I'll try to show an example of what I wrote in my comment. Let's assume you have a custom-function f:

f <- function(vec) sum(vec)^2

And you want to apply this function to each row of your data.frame x. One option in base R would be to use apply, as you show in your question:

> transform(x, z = apply(x, 1, f))
#   A  C  G T   z
#1  5  7 10 7 841
#2  1  9  5 9 576
#3  7 10  2 4 529
#4  1  4 10 1 256
#5  4  4  5 2 225
#6  9  1  6 8 576
#7  9  3  7 1 400
#8  5  2  7 5 361
#9  6  3 10 4 529
#10 5 10  1 6 484

Little disadvantage here is, because you are using apply on a data.frame, the whole data.frame is converted to matrix first and this would mean of course that all columns are converted to the same type.

With dplyr (and tidyr) you could solve the problem with gathering/melting and spreading/casting afterwards.

library(dplyr)
library(tidyr)
x %>% 
  mutate(n = row_number()) %>%    # add row numbers for grouping 
  gather(key, value, A:T) %>%
  group_by(n) %>% 
  mutate(z = f(value)) %>%
  ungroup() %>%
  spread(key, value) %>%
  select(-n)

#Source: local data frame [10 x 5]
#
#     z A  C  G T
#1  841 5  7 10 7
#2  576 1  9  5 9
#3  529 7 10  2 4
#4  256 1  4 10 1
#5  225 4  4  5 2
#6  576 9  1  6 8
#7  400 9  3  7 1
#8  361 5  2  7 5
#9  529 6  3 10 4
#10 484 5 10  1 6

This is obviously quite a bit longer code than using apply but as soon as the data get a bit larger, I expect this to be a lot faster than any apply over the rows of a data.frame.

Alternatively, you could use rowwise if you specify the columns manually:

x %>%
  rowwise %>%
  mutate(z = f(c(A,C,G,T)))  # manual column specification

#Source: local data frame [10 x 5]
#Groups: <by row>
# 
#  A  C  G T   z
#1  5  7 10 7 841
#2  1  9  5 9 576
#3  7 10  2 4 529
#4  1  4 10 1 256
#5  4  4  5 2 225
#6  9  1  6 8 576
#7  9  3  7 1 400
#8  5  2  7 5 361
#9  6  3 10 4 529
#10 5 10  1 6 484

I haven't figured out yet, if the rowwise solution can be changed so that it would work with character input of the column names - perhaps with lazyeval somehow.

data:

set.seed(16457)
gen = function () sample.int(10, replace = TRUE)
x = data.frame(A = gen(), C = gen(), G = gen(), T = gen())
Encourage answered 22/1, 2015 at 18:56 Comment(2)
To be honest, the melting solution looks incredibly inefficient: Melting and spreading doesn’t come free, it involves a lot of (in this case) gratuitous copying. Manually specifying the columns isn’t feasible in my case either, since my actual data has 61 columns, not four (well, I could obviously use mutate_ to work around that). I get the impression that this is a missing verb in dplyr … that said, thanks for this thorough answer.Nikolai
@KonradRudolph, You are right about the efficiency concerns (I made a small benchmark). Perhaps someone will come up with a cleaner and more efficient alternative - I'll also be interested in learning thatEncourage

© 2022 - 2024 — McMap. All rights reserved.