Remove trailing (last) rows with NAs in all columns

Asked 12/1, 2021 at 17:20 Answered 1/2, 2021 at 21:29

Solved r data.table subset na missing-data

I am trying to exclude rows have missing values (NA) in all columns for that row AND for which all subsequent rows have only missing values (or is the last empty row itself), i.e. I want to remove trailing "all-NA" rows.

I came up with the solution below, which works but is too slow (I am using this function on thousands of tables), probably because of the while loop.

## Aux function to remove NA rows below table
remove_empty_row_last <- function(dt){
  dt[ , row_empty := rowSums(is.na(dt)) == ncol(dt)] 
  while (dt[.N, row_empty] == TRUE) {
    dt <- dt[1:(.N-1)]
    
  }
  dt %>% return()
}

d <- data.table(a = c(1,NA,3,NA,5,NA,NA), b = c(1,NA,3,4,5,NA,NA))
remove_empty_row_last(d)

#EDIT2: adding more test cases
d2 <- data.table(A = c(1,NA,3,NA,5,1 ,NA), B = c(1,NA,3,4,5,NA,NA))
remove_empty_row_last(d2)
d3 <- data.table(A = c(1,NA,3,NA,5,NA,NA), B = c(1,NA,3,4,5,1,NA))
remove_empty_row_last(d3)

#Edit3:adding no NA rows test case
d4 <- data.table(A = c(1,2,3,NA,5,NA,NA), B = c(1,2,3,4,5,1,7))
d4 %>% remove_empty_row_last()

Falsehood answered 12/1, 2021 at 17:20 Comment(2)

You should try more to provide neat example, dt %>% return() is really bad. – Lordan 30/1, 2021 at 12:16

Don't know about speed, but worth mentioning in this context is zoo::na.trim: na.trim(d, is.na = "all", sides = "right") – Henkel 1/2, 2021 at 23:20

This seems to work with all test cases.
The idea is to use a reverse cumsum to filter out the NA rows at the end.

library(data.table)

remove_empty_row_last_new <- function(d) {
  d[d[,is.na(rev(cumsum(rev(ifelse(rowSums(!is.na(.SD))==0,1,NA)))))]]
}

d <- data.table(a=c(1,NA,3,NA,5,NA,NA),b=c(1,NA,3,4,5,NA,NA))
remove_empty_row_last_new(d)
#>     a  b
#> 1:  1  1
#> 2: NA NA
#> 3:  3  3
#> 4: NA  4
#> 5:  5  5

d2 <- data.table(A=c(1,NA,3,NA,5,1 ,NA),B=c(1,NA,3,4,5,NA,NA))
remove_empty_row_last_new(d2)
#>     A  B
#> 1:  1  1
#> 2: NA NA
#> 3:  3  3
#> 4: NA  4
#> 5:  5  5
#> 6:  1 NA

d3 <- data.table(A=c(1,NA,3,NA,5,NA,NA),B=c(1,NA,3,4,5,1,NA))
remove_empty_row_last_new(d3)
#>     A  B
#> 1:  1  1
#> 2: NA NA
#> 3:  3  3
#> 4: NA  4
#> 5:  5  5
#> 6: NA  1

d4 <- data.table(A=c(1,2,3,NA,5,NA,NA),B=c(1,2,3,4,5,1,7))
remove_empty_row_last_new(d4)
#>     A B
#> 1:  1 1
#> 2:  2 2
#> 3:  3 3
#> 4: NA 4
#> 5:  5 5
#> 6: NA 1
#> 7: NA 7

You'll have to check performance on your real dataset, but it seems a bit faster :

> microbenchmark::microbenchmark(remove_empty_row_last(d),remove_empty_row_last_new(d))
Unit: microseconds
                         expr     min      lq     mean  median       uq      max neval cld
     remove_empty_row_last(d) 384.701 411.800 468.5251 434.251 483.7515 1004.401   100   b
 remove_empty_row_last_new(d) 345.201 359.301 416.1650 382.501 450.5010 1104.401   100  a

Brundisium answered 30/1, 2021 at 10:12 Comment(1)

Great idea. This works and does not produce the errors found before in the private data. Performance is also a bit better, 20s instead of 26s on a list with 7k nested tables (100rows x 10cols each). – Falsehood 30/1, 2021 at 15:34

Maybe this will be fast enough?

d[!d[,any(rowSums(is.na(.SD)) == ncol(.SD)) & rleid(rowSums(is.na(.SD)) == ncol(.SD)) == max(rleid(rowSums(is.na(.SD)) == ncol(.SD))),]]
    a  b
1:  1  1
2: NA NA
3:  3  3
4: NA  4
5:  5  5

Bash answered 12/1, 2021 at 17:29 Comment(4)

That should do it. – Bash 12/1, 2021 at 18:23

thank you for helping. The current version of your answer still fails 2 test cases (for d2 and d3, see edit 2) – Falsehood 12/1, 2021 at 18:36

thank you for updating again. Still failing for the case with no NA lines (see d4, from edit3 above) – Falsehood 12/1, 2021 at 19:46

This is super fast BTW – Falsehood 12/1, 2021 at 19:46

Here's another approach that relies on rcpp.

library(Rcpp)
library(data.table)

Rcpp::cppFunction("
IntegerVector which_end_cont(LogicalVector x) {
  const int n = x.size();
  int consecutive = 0;
  
  for (int i = n - 1; i >= 0; i--) {
    if (x[i]) consecutive++; else break;
  }
  IntegerVector out(consecutive);
  if (consecutive == 0) 
    return(out);
  else
    return(seq(1, n - consecutive));
}
")

remove_empty_row_last3 <- function(dt) {
  lgl = rowSums(is.na(dt)) == length(dt)
  ind = which_end_cont(lgl)
  if (length(ind)) return(dt[ind]) else return(dt)
}

Basically, it

uses R to find out which rows are completely NA.
it uses rcpp to loop through the logical vector to determine how many consecutive empty rows there are at the end. Using rcpp allows us to minimize the memory allocated.
If there are no rows empty at the end, we prevent allocating memory by just returning the input rcpp. Otherwise, we allocate the sequence in rcpp and return it to subset the data.table.

Using microbenchmark, this is about 3 times faster for cases in which there are empty rows at the end and about 15 times faster in which there are no empty rows.

Edit

If you have taken the time to add rcpp, the nice thing is that data.table has exported some of their internal functions so that they can be called directly from C. That can further simplify things and make it very, very quick, mainly because we can skip the NSE performed during [data.table which is why all conditions are now ~15 times faster than the OP original function.

Rcpp::cppFunction("
SEXP mysub2(SEXP dt, LogicalVector x) {
const int n = x.size();
int consecutive = 0;
  
  for (int i = n - 1; i >= 0; i--) {
    if (x[i]) consecutive++; else break;
  }
  if (consecutive == 0) 
    return(dt);
  else
    return(DT_subsetDT(dt, wrap(seq(1, n - consecutive)), wrap(seq_len(LENGTH(dt)))));
}",
                  include="#include <datatableAPI.h>",
                  depends="data.table")

remove_empty_row_last4 <- function(dt) {
  lgl = rowSums(is.na(dt)) == length(dt)
  return(mysub2(dt, lgl))
}

dt = copy(d)
dt2 = copy(d2)
dt3 = copy(d3)
dt4 = copy(d4)
microbenchmark::microbenchmark(original = remove_empty_row_last(d3),
                               rcpp_subset = remove_empty_row_last4(dt3), 
                               rcpp_ind_only = remove_empty_row_last3(dt3),
                               waldi = remove_empty_row_last_new(dt3),
                               ian = dt3[!dt3[,any(rowSums(is.na(.SD)) == ncol(.SD)) & rleid(rowSums(is.na(.SD)) == ncol(.SD)) == max(rleid(rowSums(is.na(.SD)) == ncol(.SD))),]])


## Unit: microseconds
##           expr   min     lq    mean median     uq   max neval
##       original 498.0 519.00 539.602 537.65 551.85 621.6   100
##    rcpp_subset  34.0  39.95  43.422  43.30  46.70  59.0   100
##  rcpp_ind_only 116.9 129.75 139.943 140.15 146.35 177.7   100
##          waldi 370.9 387.70 408.910 400.55 417.90 683.4   100
##            ian 432.0 445.30 461.310 456.25 473.35 554.1   100
##         andrew 120.0 131.40 143.153 141.60 151.65 197.5   100

Deil answered 30/1, 2021 at 16:26 Comment(2)

FYI: @JanGorecki about your PR at work. See Jan's pending PR about data.table C exports here: github.com/Rdatatable/data.table/pull/4753 – Deil 30/1, 2021 at 19:41

nice! OP here. I can confirm the 10x speed gain in my data. After parsing subtables I now have a list of 6k small tables with trailling NAs. Waldi`s answer takes 7.7s and yours 0,78s. – Falsehood 1/2, 2021 at 19:50

I am late to the party but here is another option that should be relatively memory efficient and only uses base R.

library(data.table)

d <- data.table(a=c(1,NA,3,NA,5,NA,NA),b=c(1,NA,3,4,5,NA,NA))
remove_empty_row_last_andrew(d)
#>     a  b
#> 1:  1  1
#> 2: NA NA
#> 3:  3  3
#> 4: NA  4
#> 5:  5  5

d2 <- data.table(A=c(1,NA,3,NA,5,1 ,NA),B=c(1,NA,3,4,5,NA,NA))
remove_empty_row_last_andrew(d2)
#>     A  B
#> 1:  1  1
#> 2: NA NA
#> 3:  3  3
#> 4: NA  4
#> 5:  5  5
#> 6:  1 NA

d3 <- data.table(A=c(1,NA,3,NA,5,NA,NA),B=c(1,NA,3,4,5,1,NA))
remove_empty_row_last_andrew(d3)
#>     A  B
#> 1:  1  1
#> 2: NA NA
#> 3:  3  3
#> 4: NA  4
#> 5:  5  5
#> 6: NA  1

d4 <- data.table(A=c(1,2,3,NA,5,NA,NA),B=c(1,2,3,4,5,1,7))
remove_empty_row_last_andrew(d4)
#>     A B
#> 1:  1 1
#> 2:  2 2
#> 3:  3 3
#> 4: NA 4
#> 5:  5 5
#> 6: NA 1
#> 7: NA 7

^{Created on 2021-02-01 by the reprex package (v0.3.0)}

Function:

remove_empty_row_last_andrew = function(x) {
  idx = do.call(pmin.int, lapply(x, is.na))
  length_idx = length(idx)
  
  if(idx[length_idx] == 0) {
    return(x)
  }
  
  start_idx = length_idx - which.min(idx[length_idx:1L]) + 2
  
  x = x[-(start_idx:length_idx), ]
  x
}

Distinguish answered 1/2, 2021 at 21:29 Comment(2)

Very clever. For data.frames that do not end with empty rows, this is the fastest of the answers. – Deil 2/2, 2021 at 1:2

Thanks @Cole, always impressed by your rcpp answers (including this one)! – Distinguish 2/2, 2021 at 2:26

Recommended topics

Hot tags