Efficient use of a list for filtering in `dplyr`
Asked Answered
L

3

6

My filter_list has a large number of elements. The filtering below works but how would one make the dplyr::filter more concise?

I couldn't make all_of work.

filter_list <- list(
  hair_color = c("blond", "brown"),
  skin_color = "light"
)

dplyr::starwars |> 
  dplyr::filter(
    hair_color %in% filter_list[["hair_color"]],
    skin_color %in% filter_list[["skin_color"]]
  )
Laudanum answered 27/8, 2024 at 21:28 Comment(1)
Here's an incomplete solution, since it doesn't generalize without expanding to create all combinations of data, but maybe in some cases something like this could suffice: library(tidyverse); dplyr::starwars |> inner_join(filter_list |> as.data.frame()). It would probably be infeasible if there are large number of combinations.Larrigan
T
9

We could use reduce2 to iteratively apply filter statements, e.g.:

library(purrr); library(dplyr)

out <- starwars |> 
  reduce2(
    .x = filter_list, .y = names(filter_list), .init = _,
    .f = \(df, x, y) filter(df, .data[[y]] %in% x)
  )
# A tibble: 8 × 14
  name     height  mass hair_color skin_color eye_color birth_year sex   gender homeworld species films vehicles
  <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr>  <chr>     <chr>   <lis> <list>  
1 Leia Or…    150    49 brown      light      brown             19 fema… femin… Alderaan  Human   <chr> <chr>   
2 Beru Wh…    165    75 brown      light      blue              47 fema… femin… Tatooine  Human   <chr> <chr>   
3 Padmé A…    185    45 brown      light      brown             46 fema… femin… Naboo     Human   <chr> <chr>   
4 Cordé       157    NA brown      light      brown             NA NA    NA     Naboo     NA      <chr> <chr>   
5 Dormé       165    NA brown      light      brown             NA fema… femin… Naboo     Human   <chr> <chr>   
6 Raymus …    188    79 brown      light      brown             NA male  mascu… Alderaan  Human   <chr> <chr>   
7 Rey          NA    NA brown      light      hazel             NA fema… femin… NA        Human   <chr> <chr>   
8 Poe Dam…     NA    NA brown      light      brown             NA male  mascu… NA        Human   <chr> <chr>

Check if correct:

all.equal(
  out, 
  dplyr::starwars |> 
    dplyr::filter(
      hair_color %in% filter_list[["hair_color"]],
      skin_color %in% filter_list[["skin_color"]]
    )
)
Theran answered 27/8, 2024 at 21:59 Comment(0)
F
7

You can try rowMeans + mapply like below

starwars %>%
  filter(
    rowMeans(mapply(`%in%`, select(., names(filter_list)), filter_list)) == 1
  )

or Reduce + Map

starwars %>%
  filter(Reduce(`&`, Map(`%in%`, select(., names(filter_list)), filter_list)))

or just a base R combo subset + Reduce + Map

subset(starwars, Reduce(`&`, Map(`%in%`, starwars[names(filter_list)], filter_list)))

which gives

# A tibble: 8 × 14
  name      height  mass hair_color skin_color eye_color birth_year sex   gender
  <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
1 Leia Org…    150    49 brown      light      brown             19 fema… femin…
2 Beru Whi…    165    75 brown      light      blue              47 fema… femin…
3 Padmé Am…    185    45 brown      light      brown             46 fema… femin…
4 Cordé        157    NA brown      light      brown             NA NA    NA
5 Dormé        165    NA brown      light      brown             NA fema… femin…
6 Raymus A…    188    79 brown      light      brown             NA male  mascu…
7 Rey           NA    NA brown      light      hazel             NA fema… femin…
8 Poe Dame…     NA    NA brown      light      brown             NA male  mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

Benchmark

If "Efficiency" (in the title) refers to the speed, you can check it out the benchmarking here

axeman <- \() {
  starwars |>
    reduce2(
      .x = filter_list, .y = names(filter_list), .init = _,
      .f = \(df, x, y) filter(df, .data[[y]] %in% x)
    )
}

i_o <- \() {
  names(filter_list) |>
    Map(f = \(varname) starwars |> filter(.data[[varname]] %in% filter_list[[varname]])) |>
    Reduce(f = \(stack, piece) inner_join(stack, piece))
}

tic1 <- \() {
  starwars %>%
    filter(
      rowMeans(mapply(`%in%`, select(., names(filter_list)), filter_list)) == 1
    )
}

tic2 <- \() {
  starwars %>%
    filter(Reduce(`&`, Map(`%in%`, select(., names(filter_list)), filter_list)))
}

tic3 <- \() {
  subset(starwars, Reduce(`&`, Map(`%in%`, starwars[names(filter_list)], filter_list)))
}

microbenchmark(
  axeman(),
  i_o(),
  tic1(),
  tic2(),
  tic3(),
  unit = "relative",
  check = "equal"
)

which shows

Unit: relative
     expr       min         lq      mean    median        uq       max neval
 axeman()  11.98158   9.977999  9.679677  10.74786  9.652521  4.009427   100
    i_o() 172.43091 130.316298 96.607907 121.01399 96.094325 15.142344   100
   tic1()  12.45654  11.237299 11.433905  12.15965 12.796552  2.417425   100
   tic2()  12.14343  10.864622 10.723350  11.55505 11.580282  4.656169   100
   tic3()   1.00000   1.000000  1.000000   1.00000  1.000000  1.000000   100
Fronton answered 27/8, 2024 at 22:43 Comment(4)
tic2 could be shortened to starwars %>% filter(!!!Map(%in%, select(., names(filter_list)), filter_list)) It is shorter but not as fast as tic3.Boudoir
@G.Grothendieck thank you, that's cool! I haven't used !!! before, but it looks really succinct.Fronton
Nice work! (+1) Only data.table might add further speed-ups. :)Penetrance
@Penetrance haha, thank you. sometimes base R is surprisingly performantFronton
L
4

Using base Maping and Reduceing:


    names(filter_list) |> 
      Map(f = \(varname) starwars |> filter(.data[[varname]] %in% filter_list[[varname]])) |> 
      Reduce(f = \(stack, piece) inner_join(stack, piece))

Note that the accepted solution with purrr::reduce2 is more than twice as fast.

Lavina answered 27/8, 2024 at 22:7 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.