I have constructed a graph in Spark's GraphX. This graph is going to have potentially 1 billion nodes and upwards of 10 billion edges, so I don't want to have to build this graph over and over again.
I want to have the ability to build it once, save it (I think the best is in HDFS), run some processes on it, and then access it in a couple of days or weeks, add some new nodes and edges, and run some more processes on it.
How can I do that in Apache Spark's GraphX?
EDIT: I think I have found a potential solution, but I would like someone to confirm if this is the best way.
If I have a graph, say graph
, I must store the graph by its vertexRDD and its edgeRDDs separately in a text file. Then, later in time, I can access those text files, like so:
graph.vertices.saveAsTextFile(somePath)
graph.edges.saveAsTextFile(somePath)
One question I have now is: should I use saveAsTextFile() or saveAsObjectFile() ? And then how should I access those file at a later time?