subcomponent(mode = "in") for multiple source vertices
Asked Answered
H

1

7

Question

In the igraph R package, is there an efficient implementation of subcomponent() and/or BFS that can handle multiple source vertices?

Motivation

The drake R package models a user's workflow as a DAG of interdependent objects and files. The DAG should only contain the user's targets and their upstream dependencies, so drake uses igraph::subcomponent() to eliminate superfluous vertices. This approach is inefficient because the v argument must be a single vertex, so drake ends up conducting a new BFS for every single target the user wants to build.

EDIT: 2019-01-10

drake now uses a different approach that ultimately relies on sequential calls to adjacent_vertices(). The approach is clunky, but the speed improvement is actually quite nice. Still holding out for something more elegant and sophisticated.

Hying answered 7/1, 2019 at 3:16 Comment(4)
Can you maybe use neighbourhood functions to get all ancestors : ?ego & hereAblative
make_ego_graph() creates a new graph for every single vertex in v and then puts those graphs together in a list. ego() is similar. I will check out the performance, but I am still looking for a more direct solution.Hying
Yeah, the performance improvement is really small.Hying
I took care of the performance issue by applying adjacent_vertices() over and over again, but the technique I used is really naive and unsatisfying. I am still holding out for something more sophisticated, if implemented already.Hying
C
2

I think you can do this using the distances() function which generates a matrix of distances (no of edges) between nodes. This seems to do the search once and is a good deal faster than iterating through each vertex.

Example code:

library(igraph)
library(microbenchmark)

# generate some random testing data
set.seed(1234)
g <- erdos.renyi.game(50, .01)

# Here we make a function that iterates 
# across the vector of IDs applying the function
# and returns a list where each element is the
# ids of the subcomponents
sc_apply <- function(g) {
  vs <- V(g)
  res <- sapply(vs, function(v){as.numeric( # to facilitate comparison
    subcomponent(g, v, mode = "in")
    )})
  res
}

# Try it for testing
t1 <- sc_apply(g)

# Here we get the matrix of node distances. Infinite distance
# implies a seperate component. We iterate through rows of
# matrix to extract the set of nodes that are connected 
sc_distmat <- function(g) {
  dmat <- distances(g, mode = "in")
  res <- apply(dmat, 1, function(row){which(is.finite(row))})
  res
}

# extract for testing
t2 <- sc_distmat(g)

# check that equal (we need to sort the 
# subcomponent list elements first to facilitate comparison)
all.equal(lapply(t1, sort), t2)
#> [1] TRUE

The results are identical -- although worth noting that if your graph is one giant component than apply will return a matrix to you instead of a list so you will need to do your comparisons in a slightly different way.

Ok now lets see how much/whether this is faster:

# generate graphs of different sizes (not too big because my
# laptop is borderline antique!)
set.seed(42)
small_g <- erdos.renyi.game(10, .2)
mid_g <- erdos.renyi.game(50, .1)
big_g <- erdos.renyi.game(100, .1)

# check speed improvement
microbenchmark(sc_apply(small_g), sc_distmat(small_g))
#> Unit: microseconds
#>                 expr      min        lq      mean   median        uq
#>    sc_apply(small_g) 2181.465 2243.4895 2734.7132 2313.005 2777.7135
#>  sc_distmat(small_g)  451.333  471.8565  840.4742  521.865  598.0845
#>        max neval cld
#>   9152.262   100   b
#>  27139.262   100  a
microbenchmark(sc_apply(mid_g), sc_distmat(mid_g))
#> Unit: microseconds
#>               expr       min        lq       mean    median         uq
#>    sc_apply(mid_g) 11006.113 11327.794 13590.9536 12919.492 15397.2510
#>  sc_distmat(mid_g)   757.752   795.308   949.2551   841.834   965.4545
#>        max neval cld
#>  27068.296   100   b
#>   2061.824   100  a
microbenchmark(sc_apply(big_g), sc_distmat(big_g))
#> Unit: milliseconds
#>               expr      min        lq      mean    median        uq
#>    sc_apply(big_g) 23.11678 26.696373 29.940675 29.191045 33.012796
#>  sc_distmat(big_g)  1.67531  1.727873  2.156307  1.855994  2.244872
#>        max neval cld
#>  47.081647   100   b
#>   7.576123   100  a

As you can see the distances() approach is faster, and increasingly faster as the size of the graph grows.

Created on 2019-01-10 by the reprex package (v0.2.1)

Collectivize answered 10/1, 2019 at 14:43 Comment(9)
I like it, it's elegant. However, drake's clunky approach with adjacent_vertices() actually appears to scale better. Benchmarks here.Hying
Sorry, I forgot to mention that this is what drake is now doing. Please see the 2019-01-10 edit in the OP.Hying
hey @landau, cool benchmarks (and cool package incidentally!). You might be being too modest and your implementation is already excellent! However, looking at the benchmarks you posted I'm a bit confused because the sc_drake() function seems to return completely different output: a single vector of vertices rather than a list of vectors of vertices. E.g. sc_drake(big_g) returns a length-200 vector to me which can't be the right output. Do you get the same? Maybe I'm copying the function to my computer wrong.Collectivize
Thanks, gfgm! The output format is actually intentional. I find the usual list of vectors unhelpful for my use case. What drake really needs is the vector of vertices that sc_drake() currently produces. I realize this may not be an apples-to-apples comparison for the purposes of benchmarking against distances(), but it is exactly the output I am after.Hying
sorry follow up q: if I run set.seed(42); small_g <- named_graph(10, .2); plot(small_g) you can see right away that vertices 1-9 are a separate component. If I then call sc_drake(small_g) the output is just the vector 1:10. How can you know from this that these are separate components? Indeed how can you know from that output which vertices depend on which other vertices?Collectivize
You are right, we can't parse components or individual dependency relationships from the output of sc_drake(). However, drake does not need this information from sc_drake(). Given an initial vector of vertices v, all drake needs to know is which vertices are upstream (or downstream) of anything in v.Hying
Let us continue this discussion in chat.Collectivize
Sorry, my company blocks chats. How about github.com/ropensci/drake/issues/665?Hying
For onlookers: the discussion continues at github.com/ropensci/drake/issues/665#issuecomment-453529369Hying

© 2022 - 2024 — McMap. All rights reserved.