Storing a Graph in Spark Graphx with HDFS

Asked 4/8, 2015 at 6:54 Answered 13/11, 2015 at 14:4

I have constructed a graph in Spark's GraphX. This graph is going to have potentially 1 billion nodes and upwards of 10 billion edges, so I don't want to have to build this graph over and over again.

I want to have the ability to build it once, save it (I think the best is in HDFS), run some processes on it, and then access it in a couple of days or weeks, add some new nodes and edges, and run some more processes on it.

How can I do that in Apache Spark's GraphX?

EDIT: I think I have found a potential solution, but I would like someone to confirm if this is the best way.

If I have a graph, say graph, I must store the graph by its vertexRDD and its edgeRDDs separately in a text file. Then, later in time, I can access those text files, like so:

graph.vertices.saveAsTextFile(somePath)
graph.edges.saveAsTextFile(somePath)

One question I have now is: should I use saveAsTextFile() or saveAsObjectFile() ? And then how should I access those file at a later time?

Incorrigible answered 4/8, 2015 at 6:54 Comment(2)

hi @Incorrigible were to able to find solution, I am also looking for way to store constructed graph and then load it later for querying – Backtrack 21/6, 2020 at 18:17

hi @Incorrigible & BradRees , Can you please help me how to process the graph like adding new new node and edge .Thanks in advance – Mcripley 16/8, 2022 at 14:6

GraphX does not yet have a graph saving mechanism. Consequently, the next best thing to do is to save both the edges and vertices and construct the graph from that. If your vertices are complex in nature, you should use sequence files to save them.

 vertices.saveAsObjectFile("location/of/vertices")
 edges.saveAsObjectFile("location/of/edges")

And later on, you can read from disk and construct the graph.

val vertices = sc.objectFile[T]("/location/of/vertices")
val edges = sc.objectFile[T]("/location/of/edges")
val graph = Graph(vertices, edges)

Felder answered 13/11, 2015 at 14:4 Comment(1)

pretty new to graphx and spark but for me it was sc.get.objectFile instead of sc.objectFile[T]. Besides that this should be marked as the correct answer – Parrisch 6/4, 2016 at 7:49

As you mentioned, you will have to save the edge and potentially the vertices data. The question is whether or not you are using custom vertex or edge classes. If there are no attributes on the edges or vertices, then you can just save the edge file and recreate the graph from that. A simple example using the GraphLoader would be:

graph.edges.saveAsTextFile(path)
...
val myGraph = GraphLoader.edgeListFile(path)

The only problem is that GraphLoader.edgeListFile returns a Graph[Int, Int] which can be an issue for large graphs. Once you are into the billions you would do something like:

graph.edges.saveAsTextFile(path)
graph.vertices.saveAsTextFile(path)
....
val rawData = sc.textFile(path)
val edges = rawData.map(convertToEdges)
val vert = sc.textFile(path).map(f => f.toLong)
val myGraph = (verts, edges, 1L)

def convertToEdges(line : String) : Edge[Long] = {
val txt = line.split(",")
new Edge(txt(0), txt(1), 1L)
}

I typically use saveAsText simply because I tend to use multiple programs to processes the same data file, but it really depends on your file system.

Fowler answered 7/8, 2015 at 20:8 Comment(0)

Recommended topics

Hot tags