Efficient way to calculate sum or return NA if all values are NA
Asked Answered
S

5

6

During a simulation I created multiple data sets with > 1,000,000 variables. However, some of the values of these variables are NA and in some cases even all values are NA. Now I'd like to calculate the sum of all values of the variables but want to get NA if all values are NA.

The problem with the common sum(x, na.rm=T) or sum(na.omit(x)) is, that it returns 0 if all values are NA. Thus, I've written my own function that deals with NA in the expected way:

sumna <- function(x) {
  sumna <- NULL
  return(ifelse(all(is.na(x)), NA, sum(na.omit(x))))
}

However, that implementation is rather slow.

Thus, I'm looking for an implementation or pre-implemented function that sums up values of a vector, omits NA and returns NA if all values are NA.

Many thanks in advance!

Sancha answered 6/6, 2019 at 7:50 Comment(1)
Can you explain what you mean by 'rather slow'? It takes about 5 - 10 times longer to generate a million random deviates x <- rnorm(1000000) than it does to apply sumna(x) and sumna(x) takes a few milliseconds, so I guess most of your computational time is in the simulations rather than this step?Epiphenomenon
S
10

The sum_ from hablar have the same behavior as the OP wanted. So, no need to reinvent the wheel

library(hablar)
sum_(c(1:10, NA))
#[1] 55
sum_(c(NA, NA, NA))
#[1] NA

and it can be used in tidyverse or data.table

library(dplyr)
df1 %>%
    summarise_all(sum_)

But, if we need to change the OP's custom function, instead of ifelse, a better option is if/else

sumna <- function(x) {
       if(all(is.na(x))) NA else sum(x, na.rm = TRUE)
   }

Also, we can use the vectorized colSums

v1 <- colSums(df1, na.rm = TRUE)
v1[colSums(is.na(df1)) == nrow(df1)] <- NA

As the dataset is huge, we can also make use of the efficient data.table

library(data.table)
setDT(df1)[, lapply(.SD, sumna)]

Or using tidyverse

library(tidyverse)
df1 %>%
     summarise_all(sumna)
Stealing answered 6/6, 2019 at 7:52 Comment(2)
Thanks a lot, I think this is exactly what I need!Sancha
sum_ is slow? x = rnorm(1000000); microbenchmark(sumna(x), sum_(x)) for your sumna()?Epiphenomenon
W
6

You can remove all NA's and test if the length is longer than 0.

sumna_a <- function(x) {
    x <- na.omit(x)
    if(length(x)) return(sum(x))
    NA
}

#or
sumna_b <- function(x) {
    if(length(na.omit(x))) return(sum(x, na.rm = TRUE))
    NA
}

I think in most cases it should be efficient if you simply use sum(x, na.rm = TRUE) and test afterwards when sum==0 if there are only NA like:

sumna_c <- function(x) {
    ret <- sum(x, na.rm = TRUE)
    if(ret == 0 && all(is.na(x))) {NA} else {ret}
}
Wendt answered 6/6, 2019 at 9:9 Comment(4)
This is not an answer, please make this post a wiki post.Goodrow
it's faster to use x[!is.na(x)] for simple vectors, rather than na.omit().Epiphenomenon
@MartinMorgan On my PC and my R-Version na.omit() is faster than x[!is.na(x)]Wendt
Ah; try y = rep(NA_real_, 1000000). Also I'm kind of unsure from the original question whether the concern is a million variables, or a million simulations each of a few variables.Epiphenomenon
S
4

When comparing the posted methods it looks like user10488504 function c is currently most performant for cases when your sum is != 0 and you don't have only NA's. In case you have only NA's Anti, RonakShah and akrun are good:

sumna_Anti <- function(x) {
  sumna <- NULL
  return(ifelse(all(is.na(x)), NA, sum(na.omit(x))))
}

sumna_RonakShah <- function(x) {
   c(NA, sum(x, na.rm = TRUE))[(sum(is.na(x)) != length(x)) +1]
}

sumna_akrun <- function(x) {
       if(all(is.na(x))) NA else sum(x, na.rm = TRUE)
   }

sumna_user10488504_a <- function(x) {
    x <- na.omit(x)
    if(length(x)) return(sum(x))
    NA
}

sumna_user10488504_b <- function(x) {
    if(length(na.omit(x))) return(sum(x, na.rm = TRUE))
    NA
}

sumna_user10488504_c <- function(x) {
    ret <- sum(x, na.rm = TRUE)
    if(ret == 0 && all(is.na(x))) {NA} else {ret}
}

set.seed(0)
x <- rnorm(99999)

library(microbenchmark)
microbenchmark(sumna_Anti(x),
               sumna_RonakShah(x),
               sumna_akrun(x),
               sumna_user10488504_a(x),
               sumna_user10488504_b(x),
               sumna_user10488504_c(x)
               )

                    expr     min       lq      mean   median       uq      max neval  cld
           sumna_Anti(x) 307.288 310.0280 390.01838 319.2800 410.2040 2056.284   100    d
      sumna_RonakShah(x) 245.251 247.4715 269.40054 253.1650 259.7850  393.495   100  bc 
          sumna_akrun(x) 165.998 167.5005 209.39315 171.8925 190.8330 1768.761   100  b  
 sumna_user10488504_a(x) 221.275 222.6740 315.93037 229.0330 263.6405 1944.602   100   cd
 sumna_user10488504_b(x) 224.614 225.8170 261.77913 231.2305 234.6465 1934.120   100  bc 
 sumna_user10488504_c(x)  83.367  84.2610  86.16793  84.5900  86.4585  119.629   100 a   


x[sample(1:99999, 100)] <- NA
microbenchmark(sumna_Anti(x),
               sumna_RonakShah(x),
               sumna_akrun(x),
               sumna_user10488504_a(x),
               sumna_user10488504_b(x),
               sumna_user10488504_c(x)
               )

                    expr     min       lq      mean   median        uq      max neval  cld
           sumna_Anti(x) 607.367 628.4000 907.53974 634.3195  692.0845 4205.011   100    d
      sumna_RonakShah(x) 246.992 251.1290 273.80595 254.6195  261.4470  455.446   100  b  
          sumna_akrun(x) 167.058 168.5790 196.13280 170.4125  186.2650  373.708   100 ab  
 sumna_user10488504_a(x) 517.615 539.2940 684.20267 543.6295  582.5330 2360.247   100   c 
 sumna_user10488504_b(x) 523.769 544.6195 869.76645 558.0240 1125.6725 3914.266   100    d
 sumna_user10488504_c(x)  84.142  85.2940  89.04266  86.4255   87.4020  207.624   100 a   

x = rep(NA_real_, 99999)
microbenchmark(sumna_Anti(x),
               sumna_RonakShah(x),
               sumna_akrun(x),
               sumna_user10488504_a(x),
               sumna_user10488504_b(x),
               sumna_user10488504_c(x)
               )

           sumna_Anti(x)  243.848  252.3825  308.2693  260.5285  286.8025  2198.275   100  a 
      sumna_RonakShah(x)  242.706  249.3855  287.1426  258.8390  278.5905  1882.114   100  a 
          sumna_akrun(x)  240.459  244.9125  269.2847  255.1230  274.9215   409.886   100  a 
 sumna_user10488504_a(x) 1241.069 1285.3075 1534.1261 1343.7185 1486.4220  3117.453   100  ab
 sumna_user10488504_b(x) 1244.580 1281.9825 2461.2302 1349.9840 1427.9640 97590.279   100   b
 sumna_user10488504_c(x)  320.084  323.1905  353.9885  340.3555  358.7490   478.537   100  a 
Stuppy answered 6/6, 2019 at 7:50 Comment(0)
R
3

You can also subset values based on number of NA's in x

sumna <- function(x) {
   c(NA, sum(x, na.rm = TRUE))[(sum(is.na(x)) != length(x)) +1]
}

sumna(c(1:10, NA))
#[1] 55
sumna(c(NA, NA, NA))
#[1] NA
sumna(1:5)
#[1] 15
Rubella answered 6/6, 2019 at 7:58 Comment(0)
E
2

You can use colSums. Assuming that you don't have a column with all 0s,

ifelse(colSums(abs(df), na.rm = TRUE) == 0, NA, colSums(df, na.rm = TRUE))
Electrotechnology answered 6/6, 2019 at 7:55 Comment(1)
Could work. But in case that the some of numerical values will add up to 0 I'd have doubts to use it.Sancha

© 2022 - 2024 — McMap. All rights reserved.