"recursive" self join in data.table

Asked 30/6, 2019 at 2:39 Answered 30/6, 2019 at 8:52

Solved r recursion join data.table self-join

I have a component list made of 3 columns: product, component and quantity of component used:

a <- structure(list(prodName = c("prod1", "prod1", "prod2", "prod3", 
"prod3", "int1", "int1", "int2", "int2"), component = c("a", 
"int1", "b", "b", "int2", "a", "b", "int1", "d"), qty = c(1L, 
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L)), row.names = c(NA, -9L), class = c("data.table", 
"data.frame"))

  prodName component qty
1    prod1         a   1
2    prod1      int1   2
3    prod2         b   3
4    prod3         b   4
5    prod3      int2   5
6     int1         a   6
7     int1         b   7
8     int2      int1   8
9     int2         d   9

Products with names starting with prod are final products, those with names like int are intermediate products, and those with letters are raw materials.

I need the full component list of final products with only raw materials as components. That is, I want to convert any int into raw materials.

Intermediate products can be composed by raw materials and another intermediate products, hence my reference to "recursive".
I can't know in advance the level of nesting / recursion of an intermediate product (2 levels in this example, in excess of 6 in actual data).

For this example, my expected result is (I explicitly stated the computation of the resulting number):

prodName  |component  |qty
prod1     |a          |1+2*6 = 13
prod1     |b          |0+2*7 = 14
prod2     |b          |3
prod3     |b          |4+5*8*7 = 284
prod3     |a          |0+5*8*6 = 240
prod3     |d          |0+5*9 = 45

What I have done:

I solved this by creating a very cumbersome sequence of joins with merge. While this approach worked for the toy data, it's unlikely I can apply it to the real one.

#load data.table
library(data.table)

# split the tables between products and different levels of intermediate
a1 <- a[prodName %like% "prod",]
b1 <- a[prodName %like% "int1",]
c1 <- a[prodName %like% "int2",]

# convert int2 to raw materials
d1 <- merge(c1, 
            b1, 
            by.x = "component", 
            by.y = "prodName", 
            all.x = TRUE)[
              is.na(component.y),
              component.y := component][
                is.na(qty.y),
                qty.y := 1][,
                                .(prodName, qty = qty.x*qty.y),
                                by = .(component = component.y)]

# Since int1 is already exploded into raw materials, rbind both tables:
d1 <- rbind(d1, b1)

# convert all final products into raw materials, except that the raw mats that go directly into the product won't appear:
e1 <- merge(a1, 
            d1, 
            by.x = "component", 
            by.y = "prodName", 
            all.x = TRUE)

# rbind the last calculated raw mats (those coming from intermediate products) with those coming _directly_ into the final product:
result <- rbind(e1[!is.na(qty.y), 
                   .(prodName, qty = qty.x * qty.y), 
                   by = .(component = component.y)], 
                e1[is.na(qty.y), 
                   .(prodName, component, qty = qty.x)])[, 
                                                         .(qty = sum(qty)), 
                                                         keyby = .(prodName, component)]

I'm aware I can split the data into tables and perform joins until every intermediate product is expressed as composed by only raw materials, but as mentioned above, that will be a last resort due to the size of data and levels of recursion of intermediate products.

Is there an easier / better way to do this sort of recursive join?

Paulinepauling answered 30/6, 2019 at 2:39 Comment(2)

Can you change your example for qty to be different numbers. Maybe 1:9 (not sure if they all can be different). – Gensmer 30/6, 2019 at 2:48

@M-M Please see my edited code. – Paulinepauling 30/6, 2019 at 3:8

Here's my attempt using your dataset.

It uses a while loop checking to see if there's any components that also are in the prodName field. The loop always needs to have the same fields so instead of adding a column for the recursive multipliers (i.e., 5*8*7 at the end), the iterative multipliers are integrated. That is, 5*8*7 becomes 5*56 at the end.

library(data.table)

a[, qty_multiplier := 1]
b <- copy(a)

while (b[component %in% prodName, .N] > 0) {
  b <- b[a
         , on = .(prodName = component)
         , .(prodName = i.prodName
             , component = ifelse(is.na(x.component), i.component, x.component)
             , qty = i.qty
             , qty_multiplier = ifelse(is.na(x.qty), 1, x.qty * qty_multiplier)
         )
         ]
}

b[prodName %like% 'prod', .(qty = sum(qty * qty_multiplier)), by = .(prodName, component)] 

   prodName component qty
1:    prod1         a  13
2:    prod1         b  14
3:    prod2         b   3
4:    prod3         b 284
5:    prod3         a 240
6:    prod3         d  45

Vizor answered 30/6, 2019 at 4:9 Comment(0)

Essentially, your data represents a weighted edgelist in a directed graph. The below code directly calculates the sum of (product) distances over each simple path from raw component -> final product using the igraph library:

library(igraph)

## transform edgelist into graph
graph <- graph_from_edgelist(as.matrix(a[, c(2, 1)])) %>%
  set_edge_attr("weight", value = unlist(a[, 3]))

## combinations raw components -> final products
out <- expand.grid(prodname = c("prod1", "prod2", "prod3"), component = c("a", "b", "d"), stringsAsFactors = FALSE)

## calculate quantities
out$qty <- mapply(function(component, prodname) {

  ## all simple paths from component -> prodname
  all_paths <- all_simple_paths(graph, from = component, to = prodname)

  ## if simple paths exist, sum over product of weights for each path
  ifelse(length(all_paths) > 0,
         sum(sapply(all_paths, function(path) prod(E(graph, path = path)$weight))), 0)

}, out$component, out$prodname)

out
#>   prodname component qty
#> 1    prod1         a  13
#> 2    prod2         a   0
#> 3    prod3         a 240
#> 4    prod1         b  14
#> 5    prod2         b   3
#> 6    prod3         b 284
#> 7    prod1         d   0
#> 8    prod2         d   0
#> 9    prod3         d  45

Imre answered 30/6, 2019 at 8:52 Comment(1)

Wow! This is a radically different approach. I'll take a look at the igraph package. Thank you – Paulinepauling 30/6, 2019 at 13:59

Here's my attempt using your dataset.

library(data.table)

a[, qty_multiplier := 1]
b <- copy(a)

while (b[component %in% prodName, .N] > 0) {
  b <- b[a
         , on = .(prodName = component)
         , .(prodName = i.prodName
             , component = ifelse(is.na(x.component), i.component, x.component)
             , qty = i.qty
             , qty_multiplier = ifelse(is.na(x.qty), 1, x.qty * qty_multiplier)
         )
         ]
}

b[prodName %like% 'prod', .(qty = sum(qty * qty_multiplier)), by = .(prodName, component)] 

   prodName component qty
1:    prod1         a  13
2:    prod1         b  14
3:    prod2         b   3
4:    prod3         b 284
5:    prod3         a 240
6:    prod3         d  45

Vizor answered 30/6, 2019 at 4:9 Comment(0)

I think you are better off representing the information in a set of adjacency matrices that tell you "how much of this is made of that". You need 4 matrices, corresponding to all the possible relationships. For example you put the relationship between final product and intermediate in a matrix with 3 rows and 2 columns like this:

QPI <- matrix(0,3,2)
row.names(QPI) <- c("p1","p2","p3")
colnames(QPI) <- c("i1","i2")

QPI["p1","i1"] <- 2
QPI["p3","i2"] <- 5

   i1 i2
p1  2  0
p2  0  0
p3  0  5

this tells you that it takes 2 units of intermediate product i1 to make one unit of final product p1.

Similarly you define the other matrices:

QPR <- matrix(0,3,3)
row.names(QPR) <- c("p1","p2","p3")
colnames(QPR) <- c("a","b","d")

QPR["p1","a"] <- 1
QPR["p2","b"] <- 3
QPR["p3","b"] <- 4

QIR <- matrix(0,2,3)
row.names(QIR) <- c("i1","i2")
colnames(QIR) <- c("a","b","d")

QIR["i1","a"] <- 6
QIR["i1","b"] <- 7
QIR["i2","d"] <- 9

QII <- matrix(0,2,2)
row.names(QII) <- colnames(QII) <- c("i1","i2")

For example looking at QIR we see it takes 6 units of raw material a to make one unit of intermediate product i1. Once you have it in this way you sum over all possible ways of going from raw material to final product using matrix multiplication.

You have 3 terms: you can go directly from raw to final [QPR] QPR, or go from raw to intermediate to final [QPI%*%QIR] or go from raw to intermediate to other intermediate to final [QPI%*%QII%*%QIR]

You result is in the end represented by the matrix

result <- QPI%*%QIR + QPI%*%QII%*%QIR + QPR

I put all the code together below. If you run it you will see that the result looks like this:

     a   b  d
p1  13  14  0
p2   0   3  0
p3 240 284 45

which says exactly the same thing as

prodName  |component  |qty
prod1     |a          |1+2*6 = 13
prod1     |b          |0+2*7 = 14
prod2     |b          |3
prod3     |b          |4+5*8*7 = 284
prod3     |a          |0+5*8*6 = 240
prod3     |d          |0+5*9 = 45

hope this helps

QPI <- matrix(0,3,2)
row.names(QPI) <- c("p1","p2","p3")
colnames(QPI) <- c("i1","i2")

QPI["p1","i1"] <- 2
QPI["p3","i2"] <- 5

QPR <- matrix(0,3,3)
row.names(QPR) <- c("p1","p2","p3")
colnames(QPR) <- c("a","b","d")

QPR["p1","a"] <- 1
QPR["p2","b"] <- 3
QPR["p3","b"] <- 4

QIR <- matrix(0,2,3)
row.names(QIR) <- c("i1","i2")
colnames(QIR) <- c("a","b","d")

QIR["i1","a"] <- 6
QIR["i1","b"] <- 7
QIR["i2","d"] <- 9

QII <- matrix(0,2,2)
row.names(QII) <- colnames(QII) <- c("i1","i2")


QII["i2","i1"] <- 8

result <- QPI%*%QIR + QPI%*%QII%*%QIR + QPR
print(result)

Revolt answered 30/6, 2019 at 7:8 Comment(4)

I appreciate your input. I'll think about how to programmatically transform my data into an unknown set of qpr qpi qii qir matrices. One of the doubts I have, as I stated in the question, is that I can have many nested levels of intermediate products, that will require (nested?) qii matrices. If you have any ideas on how to do that, I appreciate if you can share with me. – Paulinepauling 30/6, 2019 at 14:30

I am not 100% sure I understand "nesting". can you you give me an example of one more level of nesting? just to understand better. thank you – Revolt 30/6, 2019 at 15:6

Prod20 made of a + int18; int18 made of k + int14; int14 made of int3 + int4; int3 made of b and c; int4 made of g + h. To that I call "nested intermediate products" (lacking a better name), and converting each one of them into raw materials is what I call "recursing" (it may be a poor name, too) – Paulinepauling 30/6, 2019 at 17:7

sorry for the delay in answering. thanks for the clarification, I see what you mean now. The approach outlined above, as well as the excellent solution that uses igraph , will work independently of how many levels of nesting you have. they all go in one QII matrix, and the matrix multiplication takes care of summing all the possible contributions. I am not very familiar with igraph, but I am pretty sure that if you wanted to extract the Q matrices that library would do it for you: those are adjacency matrices of different subgraphs. I hope this is helpful. – Revolt 2/7, 2019 at 16:31

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

What I have done:

Recommended topics

Hot tags