How does Titan achieve constant time lookup using HBase / Cassandra?

Asked 24/9, 2014 at 5:19 Answered 25/9, 2014 at 23:50

In the O'Reilly book "Graph Databases" in chapter 6, which is about how Neo4j stores a graph database it says:

To understand why native graph processing is so much more efficient than graphs based on heavy indexing, consider the following. Depending on the implementation, index lookups could be O(log n) in algorithmic complexity versus O(1) for looking up immediate relationships. To traverse a network of m steps, the cost of the indexed approach, at O(m log n), dwarfs the cost of O(m) for an implementation that uses index-free adjacency.

It is then explained that Neo4j achieves this constant time lookup by storing all nodes and relationships as fixed size records:

With fixed sized records and pointer-like record IDs, traversals are implemented simply by chasing pointers around a data structure, which can be performed at very high speed. To traverse a particular relationship from one node to another, the database performs several cheap ID computations (these computations are much cheaper than searching global indexes, as we’d have to do if faking a graph in a non-graph native database)

This last sentence triggers my question: how does Titan, which uses Cassandra or HBase as a storage backend, achieve these performance gains or make up for it?

Ungovernable answered 24/9, 2014 at 5:19 Comment(3)

And I would vote for the same question for OrientDB! – Eddy 24/9, 2014 at 9:15

Good question, yeah. OrientDB handles its own storage, so I am guessing they have something similar going on as Neo4j, but I'd love to know. – Ungovernable 24/9, 2014 at 10:36

Neo4j is O(1) in algorithmic complexity regardless of whether you're accessing a cached object or one on disk, because it simply chases pointers rather than calling out to some external index for traversing relationships. Neubauer and Rodriguez (see above) call this "index-free adjacency" and I suspect it is central to all sensible graph databases. – Fraser 26/9, 2014 at 14:34

Neo4j only achieves O(1) when the data is in-memory in the same JVM. When the data is on disk, Neo4j is slow because of pointer chasing on disk (they have a poor disk representation).

Titan only achieves O(1) when the data is in-memory in the same JVM. When the data is on disk, Titan is faster than Neo4j cause it has a better disk representation.

Please see the following blog post that explains the above quantitatively: http://thinkaurelius.com/2013/11/24/boutique-graph-data-with-titan/

Thus, its important to understand when people say O(1) what part of the memory hierarchy they are in. When you are in a single JVM (single machine), its easy to be fast as both Neo4j and Titan demonstrate with their respective caching engines. When you can't put the entire graph in memory, you have to rely on intelligent disk layouts, distributed caches, and the like.

Please see the following two blog posts for more information:

http://thinkaurelius.com/2013/11/01/a-letter-regarding-native-graph-databases/ http://thinkaurelius.com/2013/07/22/scalable-graph-computing-der-gekrummte-graph/

Zulmazulu answered 24/9, 2014 at 21:20 Comment(6)

To extend, Neo4j does not support O(1) on all traversals. Assume an edge predicate. To get the most recent friend of vertex v costs O(|out(v)|) in Neo4j (i.e. a vertex-centric linear scan). Why? Because Neo4j does not sort its data in memory (nor on disk). Neo4j must iterate over all the outgoing friend-edges of v to determine which one has the most recent timestamp. In Titan, this (can be) an O(1) operation. You can learn more here: thinkaurelius.com/2012/10/25/… This advancement is realized both on disk and in-memory with Titan. – Zulmazulu 24/9, 2014 at 22:2

I believe you are wrong there. In the last versions of Neo4j (2.1.x) Neo4j also implements a so-called "vertex-centric index" to solve that problem. – Disinfection 27/9, 2014 at 6:53

Neo4j is always algorithmically O(1) since it is a id * offset calculation that yields a record. Mechanically, as with any tired storage approach, there are shorter and longer paths. Hit a cached object and you short-circuit the path from database to disk and back; hit cold caches and you will go to disk. This is still O(1). There is no index or other O(log n) or worse search to perform. That is a massive benefit of being graph native all the way down the stack. – Fraser 29/9, 2014 at 8:10

@Marko Thank you very much for your answer. I have carefully read all your blog posts. I have learned a lot, but, to be honest, I am still puzzled. As I understand Titan heavily reduces the number of cache misses through colocation, but still requires logarithmic time lookups for vertices at the storage level. You say that Neo4j is slower than Titan, because of the non-sequential disk representation, but this seems to me a fair tradeoff for a graph database where it's all about traversal. Apparently you don't agree: why? What makes you so sure you don't need this 'native' storage model. – Ungovernable 2/10, 2014 at 20:34

@RikVanBruggen I understand this only applies to labels and not to properties right? – Ungovernable 2/10, 2014 at 20:35

Though I am still a bit puzzled, I'll accept the answer as it really furthered my understanding. Once again, thank you. – Ungovernable 12/10, 2014 at 21:41

OrientDB uses a similar approach where relationships are managed without indexes (index-free adjacency), but rather with direct pointers (LINKS) between vertices. It's like in memory pointers but on disk. In this way OrientDB achieves O(1) on traversing in memory and on disk.

But if you have a vertex "City" with thousands of edges to the vertices "Person", and you're looking for all the people with age > 18, then OrientDB uses indexes because a query is involved, so in this case it's O(log N).

Liva answered 25/9, 2014 at 23:50 Comment(0)

Recommended topics

Hot tags