Apache Spark GraphX connected components

Asked 25/5, 2015 at 21:24 Answered 27/5, 2015 at 3:18

How to use subgraph function to get a graph that would include only vertexes and edges from the specific connected component? Let's say I know the connected component id, the final goal is to create a new graph based on the connected component. I'd like to keep the vertex attributes from the original graph.

Gloze answered 25/5, 2015 at 21:24 Comment(0)

You have to join the graph with the component IDs to the original graph, filter (take the subgraph) by the component ID, and then discard the component ID.

import scala.reflect._
import org.apache.spark.graphx._
import org.apache.spark.graphx.lib.ConnectedComponents

def getComponent[VD: ClassTag, ED: ClassTag](
    g: Graph[VD, ED], component: VertexId): Graph[VD, ED] = {
  val cc: Graph[VertexId, ED] = ConnectedComponents.run(g)
  // Join component ID to the original graph.
  val joined = g.outerJoinVertices(cc.vertices) {
    (vid, vd, cc) => (vd, cc)
  }
  // Filter by component ID.
  val filtered = joined.subgraph(vpred = {
    (vid, vdcc) => vdcc._2 == Some(component)
  })
  // Discard component IDs.
  filtered.mapVertices {
    (vid, vdcc) => vdcc._1
  }
}

Workable answered 26/5, 2015 at 15:11 Comment(4)

Thanks for this function! But, can we not assume that if you have the CC id thenyou have probably already constructed the CC, such that this can be pulled in as one of getComponent's arguments saving the potentially expensive task of recomputing. I am curious as to why this function works, but when I did the steps manually putting a Long or whatever in as the component it fails. Must it be cast Some or VertexId? – Dg 22/9, 2015 at 19:31

You're right — now that I look at it it's quite strange how we would have a component ID before even running ConnectedComponents. I guess I just wanted to include ConnectedComponents somewhere. I'll rework the code to be more sensible. For your other question: VertexId is an alias for Long. A plain old number should work there. What do you mean by "fails" here? Do you get an exception? – Workable 22/9, 2015 at 19:36

Do you know how to count the number of connected components in a graph? – Logical 8/11, 2017 at 1:44

It should be cc.vertices.values.distinct.count. – Workable 10/11, 2017 at 12:16

I take your question to be, given a VertexId in a source graph, create a new graph with the nodes and edges connected to this VertexId from the source graph.

Given that, here's what I would do:

val targetVertexId = ...
val graph = Graph(..., ...)
val newGraph = Graph(
  graph.vertices.filter{case (vid,attr) => vid == targetVertexId} ++
  graph.collectNeighbors(EdgeDirection.Either)
    .filter{ case (vid,arr) => vid == targetVertexId}
    .flatMap{ case (vid,arr) => arr},
  graph.edges
).subgraph(vpred = { case (vid,attr) => attr != null})

Couple of things to note:

You can change EdgeDirection.Either to EdgeDirection.In or EdgeDirection.Out as needed.

The .subgraph at the end removes all Vertices where the attribute is set to null. If the original val graph has Vertices with attributes set to null this isn't going to work. Otherwise this works, without having to know the Vertex attribute type in advance.

Blomquist answered 27/5, 2015 at 3:18 Comment(0)

Recommended topics

Hot tags