Fastest way to perform bulk add/insert in Neo4j with Python?
Asked Answered
K

5

18

I am finding Neo4j slow to add nodes and relationships/arcs/edges when using the REST API via py2neo for Python. I understand that this is due to each REST API call executing as a single self-contained transaction.

Specifically, adding a few hundred pairs of nodes with relationships between them takes a number of seconds, running on localhost.

What is the best approach to significantly improve performance whilst staying with Python?

Would using bulbflow and Gremlin be a way of constructing a bulk insert transaction?

Thanks!

Klingensmith answered 28/9, 2012 at 16:15 Comment(2)
dunno how is this in python, but normally in java you can use batch processing. there should be similar things in py, too.Lapidate
I tried py2neo and found it to be too slow for batch inserts (or anything really). Using the raw REST endpoint was much faster.Zuniga
P
9

There are several ways to do a bulk create with py2neo, each making only a single call to the server.

  1. Use the create method to build a number of nodes and relationships in a single batch.
  2. Use a cypher CREATE statement.
  3. Use the new WriteBatch class (just released this week) to manually make a batch of nodes and relationships (this is really just a manual version of 1).

If you have some code, I'm happy to look at it and make suggestions on performance tweaks. There are also quite a few tests you may be able to get inspiration from.

Cheers, Nige

Peggy answered 29/9, 2012 at 0:48 Comment(5)
Good answer with options to try. Thank you for the offer of your time too - I will get in touch if I come unstuck.Klingensmith
I still find it takes hours to create 600k simple relationships between a category node and a data node with get_or_create_relationships(). Any ideas?Narrows
Are these still the fastest ways to write to Neo4j? What about creating elements within a transaction, and committing when everything is done?Felicle
@NigelSmall, Is this preferred over the two step process where you create a GEOFF file and then batch import using Load2Neo?Grubbs
@NigelSmall, any pointers for this py2neo SO post?Grubbs
W
6

Neo4j's write performance is slow unless you are doing a batch insert.

The Neo4j batch importer (https://github.com/jexp/batch-import) is the fastest way to load data into Neo4j. It's a Java utility, but you don't need to know any Java because you're just running the executable. It handles typed data and indexes, and it imports from a CSV file.

To use it with Bulbs (http://bulbflow.com/) Models, use the model get_bundle() method to get the data, index name, and index keys, which is prepared for insert, and then output the data to a CSV file. Or if you don't want to model your data, just output your data from Python to the CSV file.

Will that work for you?

Wilds answered 1/10, 2012 at 15:17 Comment(1)
Is the Neo4j batch importer still the best way to go?Crupper
M
2

There's so many old answers to this question online, that it took me forever to realize there's an import tool that comes with neo4j. It's very fast and the best tool I was able to find.

Here's a simple example if we want to import student nodes:

bin/neo4j-import --into [path-to-your-neo4j-directory]/data/graph.db --nodes students

The students file contains data that looks like this, for example:

studentID:Id(Student),name,year:int,:LABEL

1111,Amy,2000,Student

2222,Jane,2012,Student

3333,John,2013,Student

Explanation:

  • The header explains how the data below it should be interpreted.
  • studentID is a property with type Id(Student).
  • name is of type string which is the default.
  • year is an integer
  • :LABEL is the label you want for these nodes, in this case it is "Student"

Here's the documentation for it: http://neo4j.com/docs/stable/import-tool-usage.html

Note: I realize the question specifically mentions python, but another useful answer mentions a non-python solution.

Mikkimiko answered 24/6, 2015 at 11:20 Comment(0)
H
2

Well, I myself had need for massive performance from neo4j. I end up doing following things to improve graph performance.

  1. Ditched py2neo, since there were lot of issues with it. Besides it is very convenient to use REST endpoint provided by neo4j, just make sure to use request sessions.
  2. Use raw cypher queries for bulk insert, instead of any OGM(Object-Graph Mapper). That is very crucial if you need an high-performant system.
  3. Performance was not still enough for my needs, so I ended writing a custom system that merges 6-10 queries together using WITH * AND UNION clauses. That improved performance by a factor of 3 to 5 times.
  4. Use larger transaction size with atleast 1000 queries.
Hominoid answered 24/6, 2015 at 12:13 Comment(1)
I'd be interested to hear what issues you had with py2neo.Peggy
C
0

To insert a bulk of nodes in very high speed to Neo4K

Batch Inserter

http://neo4j.com/docs/stable/batchinsert-examples.html

In my case I'm working on Java.

Catherine answered 24/12, 2015 at 20:17 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.