efficiently calculating connected components in pyspark
Asked Answered
P

1

1

I'm trying to find the connected components for friends in a city. My data is a list of edges with an attribute of city.

City | SRC | DEST

Houston Kyle -> Benny

Houston Benny -> Charles

Houston Charles -> Denny

Omaha Carol -> Brian

etc.

I know the connectedComponents function of pyspark's GraphX library will iterate over all the edges of a graph to find the connected components and I'd like to avoid that. How would I do so?

edit: I thought I could do something like

select connected_components(*) from dataframe groupby city

where connected_components generates a list of items.

Potato answered 25/9, 2017 at 1:59 Comment(2)
Avoid asking the same question twice: #46386682Oilstone
Deleted the old one, this one has better phrasing.Potato
S
1

Suppose your data is like this

import org.apache.spark._
import org.graphframes._

val l = List(("Houston","Kyle","Benny"),("Houston","Benny","charles"),
            ("Houston","Charles","Denny"),("Omaha","carol","Brian"),
            ("Omaha","Brian","Daniel"),("Omaha","Sara","Marry"))
var df = spark.createDataFrame(l).toDF("city","src","dst")

Create a list of cities for which you want to run connected components cities = List("Houston","Omaha")

Now run a filter on city column for every city in cities list, then create an edge and vertex dataframes from the resulting dataframe. Create a graphframe from these edge and vertices dataframes and run connected components algorithm

val cities = List("Houston","Omaha")

for(city <- cities){
    val edges = df.filter(df("city") === city).drop("city")
    val vert = edges.select("src").union(edges.select("dst")).
                     distinct.select(col("src").alias("id"))
    val g = GraphFrame(vert,edges)
    val res = g.connectedComponents.run()
    res.select("id", "component").orderBy("component").show()
}

Output

|     id|   component|
+-------+------------+
|   Kyle|249108103168|
|charles|249108103168|
|  Benny|249108103168|
|Charles|721554505728|
|  Denny|721554505728|
+-------+------------+

+------+------------+                                                           
|    id|   component|
+------+------------+
| Marry|858993459200|
|  Sara|858993459200|
| Brian|944892805120|
| carol|944892805120|
|Daniel|944892805120|
+------+------------+
Spouse answered 26/9, 2017 at 16:46 Comment(1)
Thank you for your work! And okay, darn. I thought there might be something a bit closer to the metal than just looping through the values I wanted to block on but I'm still grateful for your answerPotato

© 2022 - 2024 — McMap. All rights reserved.