Cartesian product data frame
Asked Answered
S

7

78

I have three or more independent variables represented as R vectors, like so:

A <- c(1,2,3)
B <- factor(c('x','y'))
C <- c(0.1,0.5)

and I want to take the Cartesian product of all of them and put the result into a data frame, like this:

A B C
1 x 0.1
1 x 0.5
1 y 0.1
1 y 0.5
2 x 0.1
2 x 0.5
2 y 0.1
2 y 0.5
3 x 0.1
3 x 0.5
3 y 0.1
3 y 0.5

I can do this by manually writing out calls to rep:

d <- data.frame(A = rep(A, times=length(B)*length(C)),
                B = rep(B, times=length(A), each=length(C)),
                C = rep(C, each=length(A)*length(B))

but there must be a more elegant way to do it, yes? product in itertools does part of the job, but I can't find any way to absorb the output of an iterator and put it into a data frame. Any suggestions?

p.s. The next step in this calculation looks like

d$D <- f(d$A, d$B, d$C)

so if you know a way to do both steps at once, that would also be helpful.

Sigmon answered 29/11, 2010 at 23:41 Comment(6)
it would be useful if you specify what the function f does.Inhumane
f is a placeholder for one of several different hairy mathematical calculations, but for purposes of this question, I think the thing you need to know is that they all take N vectors of appropriate type and produce one vector; all inputs must be the same length, and the output is also that length.Sigmon
I would recommend changing the title of this question... "data table" now means something different in R.Lorca
@Lorca I changed it to "data frame". If that's not what you meant please clarify. (I don't know what you are talking about, but it was always a data frame that I meant and the title was indeed sloppy of me.)Sigmon
@Lorca Dunno if you're planning on systematically arguing for changing such titles, but I would recommend you don't. Tabular data is a concept folks have (from sql, excel or elsewhere) and they may well google for answers using that term, not knowing the minutiae of R packages. I think it's best that we let them do so and not rewrite questions for "correctness". Besides, the R thing is data.table, not data table.Incondite
@Incondite The reason for my pickiness is because I found this question because I was searching for exactly what the title states: how to do a cartesian product with data.table's in R. This question doesn't pertain to that topic, and so I suggested changing it to avoid future confusion/misdirection.Lorca
I
84

You can use expand.grid(A, B, C)


EDIT: an alternative to using do.call to achieve the second part, is the function mdply from the package plyr:

library(plyr)

d = expand.grid(x = A, y = B, z = C)
d = mdply(d, f)

To illustrate its usage using a trivial function 'paste', you can try

d = mdply(d, 'paste', sep = '+');
Inhumane answered 30/11, 2010 at 0:4 Comment(3)
Aha! I knew there had to be a standard library routine that did this, but could not find what it was called. I am going to leave the question open in case someone has an answer to part two, though.Sigmon
if f is a custom function, then you could modify it to accept a data frame as an argument and let the function handle the splitting into component vectorsInhumane
Was staring at the plyr documentation, but didn't catch that this was what mdply was for. Thanks.Sigmon
P
22

There's a function manipulating dataframe, which is helpful in this case.

It can produce various join(in SQL terminology), while Cartesian product is a special case.

You have to convert the varibles to data frames first, because it take data frame as parameters.

so something like this will do:

A.B=merge(data.frame(A=A), data.frame(B=B),by=NULL);
A.B.C=merge(A.B, data.frame(C=C),by=NULL);

The only thing to care about is that rows are not sorted as you depicted. You may sort them manually as you wish.

merge(x, y, by = intersect(names(x), names(y)),
      by.x = by, by.y = by, all = FALSE, all.x = all, all.y = all,
      sort = TRUE, suffixes = c(".x",".y"),
      incomparables = NULL, ...)

"If by or both by.x and by.y are of length 0 (a length zero vector or NULL), the result, r, is the Cartesian product of x and y"

see this url for detail: http://stat.ethz.ch/R-manual/R-patched/library/base/html/merge.html

Pylos answered 24/1, 2013 at 18:26 Comment(0)
P
16

With library tidyr one can use tidyr::crossing (order will be as in OP):

library(tidyr)
crossing(A,B,C)
# A tibble: 12 x 3
#        A B         C
#    <dbl> <fct> <dbl>
#  1     1 x       0.1
#  2     1 x       0.5
#  3     1 y       0.1
#  4     1 y       0.5
#  5     2 x       0.1
#  6     2 x       0.5
#  7     2 y       0.1
#  8     2 y       0.5
#  9     3 x       0.1
# 10     3 x       0.5
# 11     3 y       0.1
# 12     3 y       0.5 

The next step would be to use tidyverse and especially the purrr::pmap* family:

library(tidyverse)
crossing(A,B,C) %>% mutate(D = pmap_chr(.,paste,sep="_"))
# A tibble: 12 x 4
#        A B         C D      
#    <dbl> <fct> <dbl> <chr>  
#  1     1 x       0.1 1_1_0.1
#  2     1 x       0.5 1_1_0.5
#  3     1 y       0.1 1_2_0.1
#  4     1 y       0.5 1_2_0.5
#  5     2 x       0.1 2_1_0.1
#  6     2 x       0.5 2_1_0.5
#  7     2 y       0.1 2_2_0.1
#  8     2 y       0.5 2_2_0.5
#  9     3 x       0.1 3_1_0.1
# 10     3 x       0.5 3_1_0.5
# 11     3 y       0.1 3_2_0.1
# 12     3 y       0.5 3_2_0.5
Pontonier answered 4/6, 2018 at 22:14 Comment(0)
D
8

Consider using the wonderful data.table library for expressiveness and speed. It handles many plyr use-cases (relational group by), along with transform, subset and relational join using a fairly simple uniform syntax.

library(data.table)
d <- CJ(x=A, y=B, z=C)  # Cross join
d[, w:=f(x,y,z)]  # Mutates the data.table

or in one line

d <- CJ(x=A, y=B, z=C)[, w:=f(x,y,z)]
Discontinuation answered 31/5, 2014 at 23:7 Comment(0)
U
6

Here's a way to do both, using Ramnath's suggestion of expand.grid:

f <- function(x,y,z) paste(x,y,z,sep="+")
d <- expand.grid(x=A, y=B, z=C)
d$D <- do.call(f, d)

Note that do.call works on d "as-is" because a data.frame is a list. But do.call expects the column names of d to match the argument names of f.

Ugric answered 30/11, 2010 at 0:46 Comment(1)
@Zack: Thanks; I've updated my response. It's not a one-liner, but evaluating f is still easier with do.call than typing in each argument.Ugric
B
1

Using cross join in sqldf:

library(sqldf)

A <- data.frame(c1 = c(1,2,3))
B <- data.frame(c2 = factor(c('x','y')))
C <- data.frame(c3 = c(0.1,0.5))

result <- sqldf('SELECT * FROM (A CROSS JOIN B) CROSS JOIN C') 
Blanc answered 12/7, 2019 at 3:23 Comment(0)
U
0

I can never remember that standard function expand.grid. So here's another version.

crossproduct <- function(...,FUN='data.frame') {
  args <- list(...)
  n1 <- names(args)
  n2 <- sapply(match.call()[1+1:length(args)], as.character)
  nn <- if (is.null(n1)) n2 else ifelse(n1!='',n1,n2)
  dims <- sapply(args,length)
  dimtot <- prod(dims)
  reps <- rev(cumprod(c(1,rev(dims))))[-1]
  cols <- lapply(1:length(dims), function(j)
                 args[[j]][1+((1:dimtot-1) %/% reps[j]) %% dims[j]])
  names(cols) <- nn
  do.call(match.fun(FUN),cols)
}

A <- c(1,2,3)
B <- factor(c('x','y'))
C <- c(.1,.5)

crossproduct(A,B,C)

crossproduct(A,B,C, FUN=function(...) paste(...,sep='_'))
Unfold answered 30/11, 2010 at 0:28 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.