Remove trailing (last) rows with NAs in all columns
Asked Answered
F

4

6

I am trying to exclude rows have missing values (NA) in all columns for that row AND for which all subsequent rows have only missing values (or is the last empty row itself), i.e. I want to remove trailing "all-NA" rows.

I came up with the solution below, which works but is too slow (I am using this function on thousands of tables), probably because of the while loop.

## Aux function to remove NA rows below table
remove_empty_row_last <- function(dt){
  dt[ , row_empty := rowSums(is.na(dt)) == ncol(dt)] 
  while (dt[.N, row_empty] == TRUE) {
    dt <- dt[1:(.N-1)]
    
  }
  dt %>% return()
}

d <- data.table(a = c(1,NA,3,NA,5,NA,NA), b = c(1,NA,3,4,5,NA,NA))
remove_empty_row_last(d)

#EDIT2: adding more test cases
d2 <- data.table(A = c(1,NA,3,NA,5,1 ,NA), B = c(1,NA,3,4,5,NA,NA))
remove_empty_row_last(d2)
d3 <- data.table(A = c(1,NA,3,NA,5,NA,NA), B = c(1,NA,3,4,5,1,NA))
remove_empty_row_last(d3)

#Edit3:adding no NA rows test case
d4 <- data.table(A = c(1,2,3,NA,5,NA,NA), B = c(1,2,3,4,5,1,7))
d4 %>% remove_empty_row_last()
Falsehood answered 12/1, 2021 at 17:20 Comment(2)
You should try more to provide neat example, dt %>% return() is really bad.Lordan
Don't know about speed, but worth mentioning in this context is zoo::na.trim: na.trim(d, is.na = "all", sides = "right")Henkel
B
6

This seems to work with all test cases.
The idea is to use a reverse cumsum to filter out the NA rows at the end.

library(data.table)

remove_empty_row_last_new <- function(d) {
  d[d[,is.na(rev(cumsum(rev(ifelse(rowSums(!is.na(.SD))==0,1,NA)))))]]
}

d <- data.table(a=c(1,NA,3,NA,5,NA,NA),b=c(1,NA,3,4,5,NA,NA))
remove_empty_row_last_new(d)
#>     a  b
#> 1:  1  1
#> 2: NA NA
#> 3:  3  3
#> 4: NA  4
#> 5:  5  5

d2 <- data.table(A=c(1,NA,3,NA,5,1 ,NA),B=c(1,NA,3,4,5,NA,NA))
remove_empty_row_last_new(d2)
#>     A  B
#> 1:  1  1
#> 2: NA NA
#> 3:  3  3
#> 4: NA  4
#> 5:  5  5
#> 6:  1 NA

d3 <- data.table(A=c(1,NA,3,NA,5,NA,NA),B=c(1,NA,3,4,5,1,NA))
remove_empty_row_last_new(d3)
#>     A  B
#> 1:  1  1
#> 2: NA NA
#> 3:  3  3
#> 4: NA  4
#> 5:  5  5
#> 6: NA  1

d4 <- data.table(A=c(1,2,3,NA,5,NA,NA),B=c(1,2,3,4,5,1,7))
remove_empty_row_last_new(d4)
#>     A B
#> 1:  1 1
#> 2:  2 2
#> 3:  3 3
#> 4: NA 4
#> 5:  5 5
#> 6: NA 1
#> 7: NA 7

You'll have to check performance on your real dataset, but it seems a bit faster :

> microbenchmark::microbenchmark(remove_empty_row_last(d),remove_empty_row_last_new(d))
Unit: microseconds
                         expr     min      lq     mean  median       uq      max neval cld
     remove_empty_row_last(d) 384.701 411.800 468.5251 434.251 483.7515 1004.401   100   b
 remove_empty_row_last_new(d) 345.201 359.301 416.1650 382.501 450.5010 1104.401   100  a 
Brundisium answered 30/1, 2021 at 10:12 Comment(1)
Great idea. This works and does not produce the errors found before in the private data. Performance is also a bit better, 20s instead of 26s on a list with 7k nested tables (100rows x 10cols each).Falsehood
B
5

Maybe this will be fast enough?

d[!d[,any(rowSums(is.na(.SD)) == ncol(.SD)) & rleid(rowSums(is.na(.SD)) == ncol(.SD)) == max(rleid(rowSums(is.na(.SD)) == ncol(.SD))),]]
    a  b
1:  1  1
2: NA NA
3:  3  3
4: NA  4
5:  5  5
Bash answered 12/1, 2021 at 17:29 Comment(4)
That should do it.Bash
thank you for helping. The current version of your answer still fails 2 test cases (for d2 and d3, see edit 2)Falsehood
thank you for updating again. Still failing for the case with no NA lines (see d4, from edit3 above)Falsehood
This is super fast BTWFalsehood
D
5

Here's another approach that relies on .

library(Rcpp)
library(data.table)

Rcpp::cppFunction("
IntegerVector which_end_cont(LogicalVector x) {
  const int n = x.size();
  int consecutive = 0;
  
  for (int i = n - 1; i >= 0; i--) {
    if (x[i]) consecutive++; else break;
  }
  IntegerVector out(consecutive);
  if (consecutive == 0) 
    return(out);
  else
    return(seq(1, n - consecutive));
}
")

remove_empty_row_last3 <- function(dt) {
  lgl = rowSums(is.na(dt)) == length(dt)
  ind = which_end_cont(lgl)
  if (length(ind)) return(dt[ind]) else return(dt)
}

Basically, it

  1. uses R to find out which rows are completely NA.
  2. it uses to loop through the logical vector to determine how many consecutive empty rows there are at the end. Using allows us to minimize the memory allocated.
  3. If there are no rows empty at the end, we prevent allocating memory by just returning the input . Otherwise, we allocate the sequence in and return it to subset the data.table.

Using , this is about 3 times faster for cases in which there are empty rows at the end and about 15 times faster in which there are no empty rows.

Edit

If you have taken the time to add , the nice thing is that has exported some of their internal functions so that they can be called directly from C. That can further simplify things and make it very, very quick, mainly because we can skip the NSE performed during [data.table which is why all conditions are now ~15 times faster than the OP original function.

Rcpp::cppFunction("
SEXP mysub2(SEXP dt, LogicalVector x) {
const int n = x.size();
int consecutive = 0;
  
  for (int i = n - 1; i >= 0; i--) {
    if (x[i]) consecutive++; else break;
  }
  if (consecutive == 0) 
    return(dt);
  else
    return(DT_subsetDT(dt, wrap(seq(1, n - consecutive)), wrap(seq_len(LENGTH(dt)))));
}",
                  include="#include <datatableAPI.h>",
                  depends="data.table")

remove_empty_row_last4 <- function(dt) {
  lgl = rowSums(is.na(dt)) == length(dt)
  return(mysub2(dt, lgl))
}

dt = copy(d)
dt2 = copy(d2)
dt3 = copy(d3)
dt4 = copy(d4)
microbenchmark::microbenchmark(original = remove_empty_row_last(d3),
                               rcpp_subset = remove_empty_row_last4(dt3), 
                               rcpp_ind_only = remove_empty_row_last3(dt3),
                               waldi = remove_empty_row_last_new(dt3),
                               ian = dt3[!dt3[,any(rowSums(is.na(.SD)) == ncol(.SD)) & rleid(rowSums(is.na(.SD)) == ncol(.SD)) == max(rleid(rowSums(is.na(.SD)) == ncol(.SD))),]])


## Unit: microseconds
##           expr   min     lq    mean median     uq   max neval
##       original 498.0 519.00 539.602 537.65 551.85 621.6   100
##    rcpp_subset  34.0  39.95  43.422  43.30  46.70  59.0   100
##  rcpp_ind_only 116.9 129.75 139.943 140.15 146.35 177.7   100
##          waldi 370.9 387.70 408.910 400.55 417.90 683.4   100
##            ian 432.0 445.30 461.310 456.25 473.35 554.1   100
##         andrew 120.0 131.40 143.153 141.60 151.65 197.5   100
Deil answered 30/1, 2021 at 16:26 Comment(2)
FYI: @JanGorecki about your PR at work. See Jan's pending PR about data.table C exports here: github.com/Rdatatable/data.table/pull/4753Deil
nice! OP here. I can confirm the 10x speed gain in my data. After parsing subtables I now have a list of 6k small tables with trailling NAs. Waldi`s answer takes 7.7s and yours 0,78s.Falsehood
D
1

I am late to the party but here is another option that should be relatively memory efficient and only uses base R.

library(data.table)

d <- data.table(a=c(1,NA,3,NA,5,NA,NA),b=c(1,NA,3,4,5,NA,NA))
remove_empty_row_last_andrew(d)
#>     a  b
#> 1:  1  1
#> 2: NA NA
#> 3:  3  3
#> 4: NA  4
#> 5:  5  5

d2 <- data.table(A=c(1,NA,3,NA,5,1 ,NA),B=c(1,NA,3,4,5,NA,NA))
remove_empty_row_last_andrew(d2)
#>     A  B
#> 1:  1  1
#> 2: NA NA
#> 3:  3  3
#> 4: NA  4
#> 5:  5  5
#> 6:  1 NA

d3 <- data.table(A=c(1,NA,3,NA,5,NA,NA),B=c(1,NA,3,4,5,1,NA))
remove_empty_row_last_andrew(d3)
#>     A  B
#> 1:  1  1
#> 2: NA NA
#> 3:  3  3
#> 4: NA  4
#> 5:  5  5
#> 6: NA  1

d4 <- data.table(A=c(1,2,3,NA,5,NA,NA),B=c(1,2,3,4,5,1,7))
remove_empty_row_last_andrew(d4)
#>     A B
#> 1:  1 1
#> 2:  2 2
#> 3:  3 3
#> 4: NA 4
#> 5:  5 5
#> 6: NA 1
#> 7: NA 7

Created on 2021-02-01 by the reprex package (v0.3.0)

Function:

remove_empty_row_last_andrew = function(x) {
  idx = do.call(pmin.int, lapply(x, is.na))
  length_idx = length(idx)
  
  if(idx[length_idx] == 0) {
    return(x)
  }
  
  start_idx = length_idx - which.min(idx[length_idx:1L]) + 2
  
  x = x[-(start_idx:length_idx), ]
  x
}
Distinguish answered 1/2, 2021 at 21:29 Comment(2)
Very clever. For data.frames that do not end with empty rows, this is the fastest of the answers.Deil
Thanks @Cole, always impressed by your rcpp answers (including this one)!Distinguish

© 2022 - 2024 — McMap. All rights reserved.