How to optimize graph traversals in ArangoDB?
Asked Answered
Q

1

15

I primarily intended to ask this question : "Is ArangoDB a true graph database ?"

But, this question would sound quite offending.

You, peoples at triAGENS, did a really great job in creating a "multi-paradigm" database. As a user of PostgreSQL, PostGIS, MongoDB and Neo4J/Titan, I really appreciate to see an "all-in-one" solution :)

But the question remains, basically creating a graph in ArangoDB requires to create two separate collections : one for edges and one for vertices, thus, as far as I understand, it already means that vertices and related edges are not "physically" neighbors.

Moreover, even after creating appropriate index, I'm facing some serious performance issues when doing this kind of stuff in Gremlin

g.v('an_id').out('likes').in('likes').count()

Which returns a result after ~ 3 seconds (perceived time)

I assumed I poorly understood how Gremlin and Blueprint/ArangoDB worked so I tried to rewrite the same query using AQL :

LET lst = (FOR e1 in NEIGHBORS(vertices, edges, "an_id", "outbound", [ { "$label": "likes" } ] )
    FOR e2 in NEIGHBORS(vertices, edges, e1.edge._to, "inbound", [ { "$label": "likes" } ] )
        RETURN 1
    )
RETURN length(lst)

Which gives me a delay of same order of magnitude.

If I tried to run the same query on a Titan or Neo4j database (with the very same data), queries returns almost immediately (perceived time : <200ms)

So it seems to me that ArangoDB graph features are a "smart graph layer" above a "traditionnal document database" but that ArangoDB is not a "native" graph database.

To confirm this feeling, I transform data to load it in PostgreSQL and run a query (with a multiple table JOIN as you can assume) and got similar (to ArangoDB) execution delays

Did I do something wrong (in AQL query) ?

Is there a way to optimize the database to get better traversal times ?

In PostgreSQL, conceptually, I would mix edge and node and use a CLUSTER clause to physically order data, does something similar can be done in ArangoDB ? (I assume that it would be hard, as it would involve to "interlace" edges and nodes, just an intuition)

Quandary answered 9/1, 2014 at 12:32 Comment(0)
S
7

i am a Core Developer of ArangoDB. Could you give me a bit more information ob the dimensions of data you are using?

  • Amount of vertices
  • Amount of edges

Then we can create our own setup with equal dimensions and optimize it.

Snappish answered 10/1, 2014 at 10:54 Comment(14)
thanks for your comment, here are the infos : arangosh [_system]> db.edges.count() 185426 arangosh [_system]> db.vertices.count() 78797Lagrange
hi i tried a similar query to yours using imdb data set db.vertices.count() = 63027, db.edges.count() = 225060. So dimensions are quite similar. (The count returns up to 3000 depending on the starting node.). In my time measurements i get request times below 0.3s (If i do not load collections beforehand it is about 3s, but in production collections are always loaded, only default indices are set). Could you try our dataset on your machine and tell us if you get same results?Snappish
link to dataset: dropbox.com/s/fec6bii624c2lfy/imdbdata.tar.gz In your query replace "likes" with "ACTS_IN" and Starting node "858" for Bruce Willis. To Import the data you have to create a document collection "imdb_vertices" and edge collection "imdb_edges" and you can then use arangoimp to load the data into arangodb.Snappish
thx a lot for your time and tests. Here are my results : I got similar delays on your dataset and I launch query on my own dataset getting also similar results (~ 0.3s). I think I was running the query while other (not optimized) query was running - my bad :(. Then, to go further, I tried to add another level of iteration (3rd FOR) and got ~ 2s response time which is not bad but given the fact that it returns 23665, I think that it could be optimized ("only" 23665 paths are traversed), is there some way to optimize it such as specific graph index or "clustering", cache tuning ?Lagrange
I read index related documentation (which is quite good) but didn't see anything graph specific. I also looked for graph optimization on SO and Google, but didn't find some "tips" (thus my initial question). Maybe I missed sthLagrange
Except of the default indexing for graphs we do not yet offer other graph-specific indices but have plans to add them in the future. E.g. a vertex-centric index is on our roadmap that allows to store an index for pathes of length n for each vertex, where the maximal size of n is configurable. This will give a large performance boost for traversals. If you require something or have other ideas for indices please let us know so than we can add them to the database.Snappish
Nope, vertex-centric index should indeed give some nice performance boosts :). As you support Master-Slave replication since 1.4, do you plan to support distributed path traversal ? Thx for your feedback, I'm looking forward to get those vertex-centric index :)Lagrange
Indeed distributed graphs (and traversals) are on our roadmap for this year. We have to finish "general" sharding first though.Snappish
i am highly interested to know where this leads. Thank you both for the thorough testing data. I want to adopt ArangoDB but need the performance characteristics described in the neo4j test cases. I will keep an eye on the roadmapPushed
@mchacki: ArangoDB 2.4 – christmas release: vertex-centric indexes. :D What a nice Christmas present this would be! Will it be released on schedule? And will old data be able to benefit from this update or does it need to be recreated?Reticle
Are the vertex-centric indexes in 2.4? I don't see them in the documentationMackinaw
It doesn't look like they made the 2.4 release. arangodb.com/roadmap but they are still on the 2.x roadmap. I really want this feature too.Biron
Also today it isn't part of version 2.6 and I couldn't find posts about 2.7 will have this ;(Vaccinia
Looks like they're there now... arangodb.com/docs/stable/indexing-vertex-centric.htmlCordwood

© 2022 - 2024 — McMap. All rights reserved.