Can graph databases distribute data efficiently across nodes?
Asked Answered
C

2

21

If someone builds a database on top of another database, such as twitter has done, does that database inherit the limitations and inefficiencies of the underlying database?

I'm specifically interested in titan db (http://thinkaurelius.com) because of their claim to support splitting the dataset efficiently across nodes.

They claim to support distributing data across nodes, because of the efficiency of cassandra. However, neo4j claims that the reason they aren't distributing data between nodes, but rather duplicating the whole dataset on every node, is because any graph traversal that leaves one node, and therefor has to move across an ethernet network, is way too slow to be practical.

Since cassandra has no knowledge of the graph, it cannot optimize to keep graph traversals on one node. Therefor, most graph traversals will be across node boundaries.

Is titans claim to scale efficiently across nodes true?

Campion answered 23/7, 2013 at 13:23 Comment(0)
M
20

Titan determines the key sort order of the underlying storage backend (BOP for Cassandra, default for HBase) and then assigns ids to vertices such that vertices which are assigned to the same partition block have ids that are assigned to the same physical machine. In other words, Titan "understands" how the underlying storage backend distributes the data and uses graph partitioning techniques that exploit this awareness. Titan uses semi-automatic partitioning which incorporates domain knowledge.

In the Pearson benchmark (http://arli.us/edu-planet-scale) the graph was partitioned according to universities which is a near optimal partitioning criterion for this particular dataset. Without partitioning, scaling to 120 billion edges would be near impossible.

Titan builds on top of proven technologies (for scale, persistence, hot-backup, availability, disaster recovery, etc) while innovating on the graph layer. This is the same route that both, Twitter's Flock and Facebook's Tao, have taken. While this means that Titan is slower at very deep traversals, it does allow Titan to scale to very large graphs or very many concurrent transactions (read and write).

Megilp answered 27/7, 2013 at 2:17 Comment(0)
L
1

Good question. I think this is all about calibrating. Twitter (which uses Cassandra) uses a graph database a really specific way (they only have two levels of "depth") so queries don't not have to traverse long graphs (and they are not forced to replicate the entire dataset). I think both Titan and Neo4j are right, Neo4j tries to provide a graph database for general purpose, so you have multiple solutions depending on how you use it and they can't know how people will use it so they apply the more common solution : replicating the entire dataset.

In fact if you do not replicate the entire dataset and that you want to travel through a long path in your graph it will be slow.

So, what will be your usage ? I never used Titan but a good test would be to compare its performance with Neo4j depending of the "depth" of the queries.

Lawabiding answered 25/7, 2013 at 19:32 Comment(2)
It seems like Titan supports any type of graph, because of cassandra (supposedly) not knowing anything about the graph - just the raw data. Unfortunatelly, I don't have a few servers to use for testing. I don't think a benchmark on 5 VM's running on the same harddrive, with an internal network will be a fair way to test this.Campion
Objectivity/DB is a massively scalable object/graph database that distributes nodes across hosts and allows edges to connect distributed nodes. This provides the advantage of making the architecture horizontal scalable. The object page caching mechanism makes up for some of the inefficiencies of having to cross the network. Many of our customers have distributed architectures that require a distributed solution.Moment

© 2022 - 2024 — McMap. All rights reserved.