How to do cross join in R?
Asked Answered
I

11

45

How can I achieve a cross join in R ? I know that "merge" can do inner join, outer join. But I do not know how to achieve a cross join in R.

Thanks

Ithunn answered 15/5, 2012 at 1:32 Comment(3)
Possible duplicate of How to generate a matrix of combinationsGaylord
Within the tidyverse or more specifically the dplyr package you can simply do: full_join(df1, df2, by=character())Falsecard
data.table::foverlapseEligible
A
39

Is it just all=TRUE?

x<-data.frame(id1=c("a","b","c"),vals1=1:3)
y<-data.frame(id2=c("d","e","f"),vals2=4:6)
merge(x,y,all=TRUE)

From documentation of merge:

If by or both by.x and by.y are of length 0 (a length zero vector or NULL), the result, r, is the Cartesian product of x and y, i.e., dim(r) = c(nrow(x)*nrow(y), ncol(x) + ncol(y)).

Aircool answered 15/5, 2012 at 7:30 Comment(3)
Why all? Don't see how all=FALSE (default) would influence the outcome. Also note that merge sets by.x = by.y = by = intersect(names(x), names(y) so x and y may not share any column names (otherwise you will not get a cross join with default settings).Ballista
Not sure why this is the accepted solution. As pointed out in the comments, it doesn't work to give a cross-join for several use-cases.Wealthy
Just FYI: this only works with data.frames and not with data.tablesTortious
C
65

If speed is an issue, I suggest checking out the excellent data.table package. In the example at the end it's ~90x faster than merge.

You didn't provide example data. If you just want to get all combinations of two (or more individual) columns, you can use CJ (cross join):

library(data.table)
CJ(x=1:2,y=letters[1:3])
#   x y
#1: 1 a
#2: 1 b
#3: 1 c
#4: 2 a
#5: 2 b
#6: 2 c

If you want to do a cross join on two tables, I haven't found a way to use CJ(). But you can still use data.table:

x2<-data.table(id1=letters[1:3],vals1=1:3)
y2<-data.table(id2=letters[4:7],vals2=4:7)

res<-setkey(x2[,c(k=1,.SD)],k)[y2[,c(k=1,.SD)],allow.cartesian=TRUE][,k:=NULL]
res
#    id1 vals1 id2 vals2
# 1:   a     1   d     4
# 2:   b     2   d     4
# 3:   c     3   d     4
# 4:   a     1   e     5
# 5:   b     2   e     5
# 6:   c     3   e     5
# 7:   a     1   f     6
# 8:   b     2   f     6
# 9:   c     3   f     6
#10:   a     1   g     7
#11:   b     2   g     7
#12:   c     3   g     7

Explanation of the res line:

  • Basically you add a dummy column (k in this example) to one table and set it as the key (setkey(tablename,keycolumns)), add the dummy column to the other table, and then join them.
  • The data.table structure uses column positions and not names in the join, so you have to put the dummy column at the beginning. The c(k=1,.SD) part is one way that I have found to add columns at the beginning (the default is to add them to the end).
  • A standard data.table join has a format of X[Y]. The X in this case is setkey(x2[,c(k=1,.SD)],k), and the Y is y2[,c(k=1,.SD)].
  • allow.cartesian=TRUE tells data.table to ignore the duplicate key values, and perform a cartesian join (prior versions didn't require this)
  • The [,k:=NULL] at the end just removes the dummy key from the result.

You can also turn this into a function, so it's cleaner to use:

# Version 1; easier to write:
CJ.table.1 <- function(X,Y)
  setkey(X[,c(k=1,.SD)],k)[Y[,c(k=1,.SD)],allow.cartesian=TRUE][,k:=NULL]

CJ.table.1(x2,y2)
#    id1 vals1 id2 vals2
# 1:   a     1   d     4
# 2:   b     2   d     4
# 3:   c     3   d     4
# 4:   a     1   e     5
# 5:   b     2   e     5
# 6:   c     3   e     5
# 7:   a     1   f     6
# 8:   b     2   f     6
# 9:   c     3   f     6
#10:   a     1   g     7
#11:   b     2   g     7
#12:   c     3   g     7

# Version 2; faster but messier:
CJ.table.2 <- function(X,Y) {
  eval(parse(text=paste0("setkey(X[,c(k=1,.SD)],k)[Y[,c(k=1,.SD)],list(",paste0(unique(c(names(X),names(Y))),collapse=","),")][,k:=NULL]")))
}

Here are some speed benchmarks:

# Create a bigger (but still very small) example:
n<-1e3
x3<-data.table(id1=1L:n,vals1=sample(letters,n,replace=T))
y3<-data.table(id2=1L:n,vals2=sample(LETTERS,n,replace=T))

library(microbenchmark)
microbenchmark(merge=merge.data.frame(x3,y3,all=TRUE),
               CJ.table.1=CJ.table.1(x3,y3),
               CJ.table.2=CJ.table.2(x3,y3),
               times=3, unit="s")
#Unit: seconds
#       expr        min         lq     median         uq        max neval
#      merge 4.03710225 4.23233688 4.42757152 5.57854711 6.72952271     3
# CJ.table.1 0.06227603 0.06264222 0.06300842 0.06701880 0.07102917     3
# CJ.table.2 0.04740142 0.04812997 0.04885853 0.05433146 0.05980440     3

Note that these data.table methods are much faster than the merge method suggested by @danas.zuokas. The two tables with 1,000 rows in this example result in a cross-joined table with 1 million rows. So even if your original tables are small, the result can get big quickly and speed becomes important.

Lastly, recent versions of data.table require you to add the allow.cartesian=TRUE (as in CJ.table.1) or specify the names of the columns that should be returned (CJ.table.2). The second method (CJ.table.2) seems to be faster, but requires some more complicated code if you want to automatically specify all the column names. And it may not work with duplicate column names. (Feel free to suggest a simpler version of CJ.table.2)

Churchman answered 4/1, 2013 at 21:44 Comment(7)
Not sure if this is due to subsequent package changes, but to get this working I had to amend the function slightly to CJ.table<-function(X,Y) setkey(X[,c(k=1,.SD)],k)[Y[,c(k=1,.SD)],allow.cartesian=TRUE][,k:=NULL]Tittletattle
You're correct @StephLocke, the data.table behavior has changed since my original answer. I've updated it and added some timings. Thanks.Churchman
variant that makes sure the dummy variable name used is unique: CJ.table.3 <- function(X,Y){ unique_name <- last(make.unique(c(colnames(X),colnames(Y),"k"))) X[,c(setNames(1,unique_name),.SD)][Y[,c(setNames(1,unique_name),.SD)],on=unique_name,allow.cartesian=TRUE][,(unique_name):=NULL] }Cachexia
During a recent update data.table does no longer allow for the latter version, but instead throws an error. adding allow.cartesian alleviates this, but data.table suggests using the by = .EACHI. Note that all three methods provide almost exactly the same speed as of today (benchmarking provides around 35 miliseconds on my machine for all 3 methods over 1000 replications), as such there is no performance benefit over the less readable version CJ.table.2. (with the added allow.cartesian argument)Sisyphean
Is there a version faster than CJ.table.1 that works on data table? I have a problem of a massive scale so even a small improvement saves a lot of time!Libby
@Churchman can this extended for more than two data.tables? Both CJ.table.1 and CJ.table.2 are not working for me, but "res" does. thanksHartzel
Noting that this no longer works in the most up to date version of data tableAcreage
A
39

Is it just all=TRUE?

x<-data.frame(id1=c("a","b","c"),vals1=1:3)
y<-data.frame(id2=c("d","e","f"),vals2=4:6)
merge(x,y,all=TRUE)

From documentation of merge:

If by or both by.x and by.y are of length 0 (a length zero vector or NULL), the result, r, is the Cartesian product of x and y, i.e., dim(r) = c(nrow(x)*nrow(y), ncol(x) + ncol(y)).

Aircool answered 15/5, 2012 at 7:30 Comment(3)
Why all? Don't see how all=FALSE (default) would influence the outcome. Also note that merge sets by.x = by.y = by = intersect(names(x), names(y) so x and y may not share any column names (otherwise you will not get a cross join with default settings).Ballista
Not sure why this is the accepted solution. As pointed out in the comments, it doesn't work to give a cross-join for several use-cases.Wealthy
Just FYI: this only works with data.frames and not with data.tablesTortious
R
23

This was asked years ago, but you can use tidyr::crossing() to do a cross-join. Definitely the simplest solution of the bunch.

library(tidyr)

league <- c("MLB", "NHL", "NFL", "NBA")
season <- c("2018", "2017")

tidyr::crossing(league, season)
#> # A tibble: 8 x 2
#>   league season
#>   <chr>  <chr> 
#> 1 MLB    2017  
#> 2 MLB    2018  
#> 3 NBA    2017  
#> 4 NBA    2018  
#> 5 NFL    2017  
#> 6 NFL    2018  
#> 7 NHL    2017  
#> 8 NHL    2018

Created on 2018-12-08 by the reprex package (v0.2.0).

Raster answered 7/12, 2018 at 14:48 Comment(2)
Better. The behavior you show (with vectors as inputs) is identical to the base function expand.grid. The advantage crossing has is that it works with data.frame inputs (and the point of the question). Using the example from the accepted answer, x <- data.frame(id1 = c("a", "b", "c"), vals1 = 1:3); y <- data.frame(id2 = c("d", "e", "f"), vals2 = 4:6), then crossing(x, y) works as expected, while expand.grid(x, y) fails.Lympho
This appears to be much faster than merge()Dysphonia
P
12

If you want to do it via data.table, this is one way:

cjdt <- function(a,b){
  cj = CJ(1:nrow(a),1:nrow(b))
  cbind(a[cj[[1]],],b[cj[[2]],])
}

A = data.table(ida = 1:10)
B = data.table(idb = 1:10)
cjdt(A,B)

Having said the above, if you are doing many little joins, and you don't need a data.table object and the overhead of producing it, a significant speed increase can be achieved by writing a c++ code block using Rcpp and the like:

// [[Rcpp::export]]
NumericMatrix crossJoin(NumericVector a, NumericVector b){
  int szA = a.size(), 
      szB = b.size();
  int i,j,r;
  NumericMatrix ret(szA*szB,2);
  for(i = 0, r = 0; i < szA; i++){
    for(j = 0; j < szB; j++, r++){
      ret(r,0) = a(i);
      ret(r,1) = b(j);
    }
  }
  return ret;
}

To compare, firstly for a large join:

C++

n = 1
a = runif(10000)
b = runif(10000)
system.time({for(i in 1:n){
  crossJoin(a,b)
}})

user system elapsed 1.033 0.424 1.462


data.table

system.time({for(i in 1:n){
  CJ(a,b)
}})

user system elapsed 0.602 0.569 2.452


Now for lots of little joins:

C++

n = 1e5
a = runif(10)
b = runif(10)
system.time({for(i in 1:n){
  crossJoin(a,b)
}})

user system elapsed 0.660 0.077 0.739


data.table

system.time({for(i in 1:n){
  CJ(a,b)
}})

user system elapsed 26.164 0.056 26.271

Priscella answered 8/8, 2015 at 12:23 Comment(1)
Joining atomic vectors is quite different from joining data.frames. Basically, you simplified it to much. Neither CJ or your solution address the problem in question.Hurless
W
8

Using sqldf:

x <- data.frame(id1 = c("a", "b", "c"), vals1 = 1:3)
y <- data.frame(id2 = c("d", "e", "f"), vals2 = 4:6) 

library(sqldf)
sqldf("SELECT * FROM x
      CROSS JOIN y")

Output:

  id1 vals1 id2 vals2
1   a     1   d     4
2   a     1   e     5
3   a     1   f     6
4   b     2   d     4
5   b     2   e     5
6   b     2   f     6
7   c     3   d     4
8   c     3   e     5
9   c     3   f     6

Just for the record, with the base package, we can use the by=NULL instead of all=TRUE:

merge(x, y, by= NULL)
Waldon answered 10/8, 2015 at 8:16 Comment(0)
C
6

By using the merge function and its optional parameters:

Inner join: merge(df1, df2) will work for these examples because R automatically joins the frames by common variable names, but you would most likely want to specify merge(df1, df2, by = "CustomerId") to make sure that you were matching on only the fields you desired. You can also use the by.x and by.y parameters if the matching variables have different names in the different data frames.

Outer join: merge(x = df1, y = df2, by = "CustomerId", all = TRUE)

Left outer: merge(x = df1, y = df2, by = "CustomerId", all.x = TRUE)

Right outer: merge(x = df1, y = df2, by = "CustomerId", all.y = TRUE)

Cross join: merge(x = df1, y = df2, by = NULL)
Cornelius answered 23/10, 2015 at 12:23 Comment(0)
R
4

For data.table use

dt1[, as.list(dt2), by = names(dt1)]

Note that this only works if there are no duplicate rows.

Roney answered 18/11, 2020 at 13:14 Comment(0)
G
2

I don't know of a built-in way to do it with data.frame's but it isn't hard to make.

@danas showed there is an easy built-in way, but I'll leave my answer here in case it is useful for other purposes.

cross.join <- function(a, b) {
    idx <- expand.grid(seq(length=nrow(a)), seq(length=nrow(b)))
    cbind(a[idx[,1],], b[idx[,2],])
}

and showing that it works with some built-in data sets:

> tmp <- cross.join(mtcars, iris)
> dim(mtcars)
[1] 32 11
> dim(iris)
[1] 150   5
> dim(tmp)
[1] 4800   16
> str(tmp)
'data.frame':   4800 obs. of  16 variables:
 $ mpg         : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl         : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp        : num  160 160 108 258 360 ...
 $ hp          : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat        : num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt          : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec        : num  16.5 17 18.6 19.4 17 ...
 $ vs          : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am          : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear        : num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb        : num  4 4 1 1 2 1 4 2 2 4 ...
 $ Sepal.Length: num  5.1 5.1 5.1 5.1 5.1 5.1 5.1 5.1 5.1 5.1 ...
 $ Sepal.Width : num  3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 ...
 $ Petal.Length: num  1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
Girhiny answered 15/5, 2012 at 3:42 Comment(0)
K
2

I'd love to know if there exists a convenient way to crossjoin two data.tables. I do this so often I ended up rolling my own function that others may find helpful

library(data.table)

cartesian_join <- function(i, j){
  # Cartesian join of two data.tables
  # If i has M rows and j has N rows, the result will have M*N rows
  # Example: cartesian_join(as.data.table(iris), as.data.table(mtcars))

  # Check inputs
  if(!is.data.table(i)) stop("'i' must be a data.table")
  if(!is.data.table(j)) stop("'j' must be a data.table")
  if(nrow(i) == 0) stop("'i' has 0 rows. Not sure how to handle cartesian join")
  if(nrow(j) == 0) stop("'j' has 0 rows. Not sure how to handle cartesian join")

  # Do the join (use a join column name that's unlikely to clash with a pre-existing column name)
  i[, MrJoinyJoin := 1L]
  j[, MrJoinyJoin := 1L]
  result <- j[i, on = "MrJoinyJoin", allow.cartesian = TRUE]
  result[, MrJoinyJoin := NULL]
  i[, MrJoinyJoin := NULL]
  j[, MrJoinyJoin := NULL]

  return(result[])
}

foo <- data.frame(Foo = c(1,2,3))
foo
  Foo
1   1
2   2
3   3

bar <- data.frame(Bar = c("a", "b", "c"))
bar
  Bar
1   a
2   b
3   c

cartesian_join(as.data.table(foo), as.data.table(bar))
   Bar Foo
1:   a   1
2:   b   1
3:   c   1
4:   a   2
5:   b   2
6:   c   2
7:   a   3
8:   b   3
9:   c   3
Kuhlman answered 6/10, 2018 at 15:27 Comment(0)
I
2

dplyr solution:

You can perform a cross-join within dplyr joins (i.e. left_join, inner_join, etc.).

For example, according to the left_join help: "To perform a cross-join, generating all combinations of x and y, use by = character()"

So you could use something like:

df3 <- left_join(df1, df2, by = character())
Impress answered 23/11, 2021 at 17:3 Comment(0)
A
0

In dplyr, there is cross_join:

library(dplyr)
A <- data.frame(id1=letters[1:3],vals1=1:3)
B <- data.frame(id2=letters[4:7],vals2=4:7)

cross_join(A, B)

output

   id1 vals1 id2 vals2
1    a     1   d     4
2    a     1   e     5
3    a     1   f     6
4    a     1   g     7
5    b     2   d     4
6    b     2   e     5
7    b     2   f     6
8    b     2   g     7
9    c     3   d     4
10   c     3   e     5
11   c     3   f     6
12   c     3   g     7
Archi answered 6/2, 2023 at 13:17 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.