I am trying to create a huge database in neo4j which will have around 2 million nodes and around 4 million edges. I have been able to speed up the node creation process by creating the node in the batches of 1000 nodes each. However, when I try to create edges between these nodes, the process slows down and then it times out. Initially I though it might be slow as I was merging on the basis of node name, but its slower even if I use ids - I have manually create these ids. Below I have given snipped to data and code, for better understanding of the problem -
Node.csv - this file contains details about the node
NodeName NodeType NodeId Sachin Person 1 UNO Organisation 2 Obama Person 3 Cricket Sports 4 Tennis Sports 5 USA Place 6 India Place 7
Edges.csv - this file just contains the node ids and their relationship
Node1Id Relationship Node2Id 1 Plays 4 3 PresidentOf 6 1 CitizenOf 7
Code to create Node is given below -
from py2neo import Graph
graph = Graph()
statement =""
tx = graph.cypher.begin()
for i in range(len(Node)):
statement = "Create(n{name:{A} ,label:{C}, id:{B}})"
tx.append(statement,{"A": Node[i][0],"C":str(Node[i][1]), "B":str(Node[i][2])})
if i % 1000 == 0:
print str(i) + "Node Created"
tx.commit()
tx = self.graph.cypher.begin()
statement =""
Above code works like wonder and finished the creation of 2 million nodes in 5 minutes. Code to create edges is given below -
tx = graph.cypher.begin()
statement = ""
for i in range(len(Edges)):
statement = "MATCH (a {id:{A}}), (b {id:{B}}) CREATE (a)-[:"+ Edges[i][1]+"]->(b)"
tx.append(statement, {"A": Edges[i][0], "B": Edges[i][2]})
if i % 1000 == 0:
print str(i) + " Relationship Created"
tx.commit()
tx = graph.cypher.begin()
statement = ""
Above, code works well for creating first 1000 relationship but after that it takes lot of time and connection gets timed out.
I am in immediate need to fix this and any help which can fasten up the process of relationship creation would be really helpful.
Please Note - I am not using import csv of Neo4j or Neo4j shell import because these assume relationship between Nodes to be fixed. Whereas for me relationship vary and its not feasible to import for one relationship at a time because it would mean importing almost 2000 times manually.
id
property? – Garget