Best way to connect n nodes to a single node?

H

2

9

I am modelling a graph for an application that I am building currently, where I have n Users connected to n Users, I also have n Posts which can be liked by n Users. So the structure would look something like this, for a given user,

if a user likes hundred Post nodes, It would generate 100 edges (realtionships) to the node, when the post is n, the edges will also be n. so one user will be connected to n Users and n posts and n future node types.

So of using an intermediate node thus reducing the edges to the given node, which would look something like this,

Where the users has an intermediate node named Collection, which will be connected to likes, since this is a property graph, I can add a property to the intermediate node and make it behave like the connections are from the user (something like, Likes.username = User.username)

my thought is

This way of intermediate connecting nodes can isolate junk from the primary node, thus can speed up custom algorithms.

My questions,

What is the best solution to this that scales?
Why should I consider this solution over the other?

Hypochromia answered 18/4, 2017 at 19:14 Comment(0)

F

4

According to the very good book Learning Neo4j (by Rik Van Bruggen, available for download in the Neo4j's web site) your problem is called "Dense node" or "Supernode": nodes that have too many connections.

Still according to the book, supernodes

"becomes a real problem for graph traversals because the graph database management system will have to evaluate all of the connected relationships to that node in order to determine what the next step will be in the graph traversal."

The solution proposed by Rik is very close to your solution (Add an intermediate node): it consists of add a "meta" node between a User and your liked Posts. This meta node should got at most a hundred of connections. If the current meta node reaches 100 connections a new meta node must be created and added to the hierarchy, according to the example of figure, showing a example with popular artists and your fans:

The Neo4j team has been doing a great effort to improve the performance about supernodes, as can be seen in this Github commit (for example) that changes how relationships for a node are stored on the disk, in a linked list structure.

I believe that the best is keep your graph model as simple as possible at first. You don't have a dense node problem (yet) and a premature optimization probably will add some not necessary complexity to your model. If in the future dense nodes becomes a problem you can change your model more accurately. Simplicity is a good choice at first.

You can read a bit more about super nodes in graph database in these links:

Frustrated answered 26/4, 2017 at 23:7 Comment(0)

H

6

This solution will have advantages and disadvantages.

The main disadvantage is that traversal operations will be more expensive, ie. you will have to traverse one more node before finding the posts.

The advantages are following:

less contention on User node when you add new "likes", ie. User @version won't be incremented, so you will be able to do updates on User in concurrency, without having version conflicts
the ability to add information to the "likes" node. You can do it also with edges alone, but you would have to duplicate the information on all the edges.
a smaller User, in particular below the embedded/tree ridbag threshold http://orientdb.com/docs/2.2.x/RidBag.html Just consider that with binary protocol, when you have a tree RidBag, it won't be serialized immediately to the client, but you will have an iterator for that, so the overhead in fetching a single User won't be very significant. On the other hand, with HTTP protocol you will receive all the edge RIDs with the vertex, so in this case you will save a lot of bandwidth and computation time using the second approach.

About your questions, the best solution is the solution that better fits your workload: if you do a lot of updates on the User, the second solution will give you immediate advantages; if you fetch users alone very often, the second solution will give advantages as well; on the other hand, if your main concern is fast traversal, the second solution won't be a very good fit.

Henka answered 19/4, 2017 at 6:47 Comment(0)

F

4