Cartesian product with dplyr
Asked Answered
F

7

39

I'm trying to find the dplyr function for cartesian product. I've two simple data.frame with no common variable:

x <- data.frame(x = c("a", "b", "c"))
y <- data.frame(y = c(1, 2, 3))

I would like to reproduce the result of

merge(x, y)

  x y
1 a 1
2 b 1
3 c 1
4 a 2
5 b 2
6 c 2
7 a 3
8 b 3
9 c 3

I've already looked for this (for example here or here) without finding anything useful.

Foreleg answered 5/4, 2017 at 10:21 Comment(1)
expand.grid(x$x,y$y) or CJ(x$x, y$y) from data.tableGravamen
C
53

Use crossing from the tidyr package:

x <- data.frame(x=c("a","b","c"))
y <- data.frame(y=c(1,2,3))

crossing(x, y)

Result:

   x y
 1 a 1
 2 a 2
 3 a 3
 4 b 1
 5 b 2
 6 b 3
 7 c 1
 8 c 2
 9 c 3
Cob answered 3/4, 2018 at 13:3 Comment(0)
P
10

When x and y are database tbls (tbl_dbi / tbl_sql) you can now also do:

full_join(x, y, by = character())

Added to dplyr at the end of 2017, and also gets translated to a CROSS JOIN in the DB world. Saves the nastiness of having to introduce the fake variables.

I'm seeing comments now (Nov2022) that this does also work on standard dataframes! Great news!

Polydactyl answered 21/6, 2018 at 14:22 Comment(4)
I'm running with dplyr 0.7.6 and this gives the error Error in full_join_impl(x, y, by_x, by_y, aux_x, aux_y, na_matches) : by must specify variables to join by for the given example. Any ideas?Buskined
Same error as @Lyngbakr with dplyr 0.7.8. Did this ever appear in dplyr? Absent this, I use crossing(), as in the accepted answer.Leprose
@andyyy What was the underlying type when this error occur? I've seen odd and different behaviours especially on data.tables.Polydactyl
Works fine with dataframes.Analphabetic
G
6

If we need a tidyverse output, we can use expand from tidyr

library(tidyverse)
y %>% 
   expand(y, x= x$x) %>%
   select(x,y)
# A tibble: 9 × 2
#       x     y
#  <fctr> <dbl>
#1      a     1
#2      b     1
#3      c     1
#4      a     2
#5      b     2
#6      c     2
#7      a     3
#8      b     3
#9      c     3
Gravamen answered 5/4, 2017 at 10:28 Comment(0)
J
5

When faced with this problem, I tend to do something like this:

x <- data.frame(x=c("a","b","c"))
y <- data.frame(y=c(1,2,3))
x %>% mutate(temp=1) %>% 
inner_join(y %>% mutate(temp=1),by="temp") %>%
dplyr::select(-temp) 

If x and y are multi-column data frames, but I want to do every combination of a row of x with a row of y, then this is neater than any expand.grid() option that I can come up with

Johst answered 23/5, 2017 at 10:32 Comment(0)
R
4
expand.grid(x=c("a","b","c"),y=c(1,2,3))

Edit: Consider also this following elegant solution from "Y T" for n more complex data.frame :

https://mcmap.net/q/378229/-alternative-to-expand-grid-for-data-frames

in short:

expand.grid.df <- function(...) Reduce(function(...) merge(..., by=NULL), list(...))
expand.grid.df(df1, df2, df3)
Rawlinson answered 5/4, 2017 at 10:28 Comment(1)
There is also expand_grid in package tidyr.Restrict
P
4

This is a continuation of dsz's comment. Idea came from: http://jarrettmeyer.com/2018/07/10/cross-join-dplyr.

tbl_1$fake <- 1
tbl_2$fake <- 1
my_cross_join <- full_join(tbl_1, tbl_2, by = "fake") %>%
                 select(-fake)

I tested this on four columns of data ranging in size from 4 to 640 obs, and it took about 1.08 seconds.

Praedial answered 17/12, 2018 at 2:53 Comment(1)
I like this one. It's nice and simple, but not very performant in my experience.Mansoor
F
2

Using two answers above, using full_join() with by = character() seems to be faster:

library(tidyverse)
library(microbenchmark)

df <- data.frame(blah = 1:10)

microbenchmark(diamonds %>% crossing(df))
Unit: milliseconds
                      expr      min       lq     mean   median       uq     max neval
 diamonds %>% crossing(df) 21.70086 22.63943 23.72622 23.01447 24.25333 30.3367   100
microbenchmark(diamonds %>% full_join(df, by = character()))
Unit: milliseconds
                                         expr      min       lq     mean   median       uq      max neval
 diamonds %>% full_join(df, by = character()) 9.814783 10.23155 10.76592 10.44343 11.18464 15.71868   100
Fathomless answered 23/4, 2021 at 17:17 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.