Fast way to count all vertices (with property x)
Asked Answered
B

2

5

I'm using Titan with Cassandra and have several (related) questions about querying the database with Gremlin:

1.) Is there an faster way to count all vertices than

g.V.count()

Titan claims to use an index. But how can I use an index without property?

WARN  c.t.t.g.transaction.StandardTitanTx - Query requires iterating over all vertices [<>]. For better performance, use indexes

2.) Is there an faster way to count all vertices with property 'myProperty' than

g.V.has('myProperty').count()

Again Titan means following:

WARN  c.t.t.g.transaction.StandardTitanTx - Query requires iterating over all vertices [(myProperty<> null)]. For better performance, use indexes

But again, how can I do this? I already have an index of 'myProperty', but it needs a value to query fast.

3.) And the same questions with edges...

Brain answered 31/1, 2014 at 9:40 Comment(1)
Duplicate of #17215459Cilium
P
7

Iterating all vertices with g.V.count() is the only way to get the count. It can't be done "faster". If your graph is so large that it takes hours to get an answer or your query just never returns at all, you should consider using Faunus. However, even with Faunus you can expect to wait for your answer (such is the nature of Hadoop...no sub-second response here), but at least you will get one.

Any time you do a table scan (i.e. iterate all Vertices) you get that warning of "iterating over all vertices". Generally speaking, you don't want to do that, as you will never get a response. Adding an index won't help you count all vertices any faster.

Edges have the same answer. Use g.E.count() in Gremlin if you can. If it takes too long, then try Faunus so you can at least get an answer.

Pepi answered 31/1, 2014 at 13:45 Comment(5)
Do you mean there is no way to perform efficient count with Titan ? Should we consider to update a counter? Does Titan allows use to perform atomic operation ?Around
In Titan, there is no internal counter for graph elements (i.e. vertices/edges) to efficiently return the count. I'm not aware if other blueprints implementations do that or not either (i'm thinking "no", but could be wrong wrt the latest developments of orientdb, neo4j, etc.). Regarding the "atomic operation" question, you should probably read this section of titan docs if you intend to use cassandra or hbase backends: s3.thinkaurelius.com/docs/titan/0.5.4/eventual-consistency.htmlPepi
The data is already present in Cassandra backend. How can I implement Fanus on top of it ??Bernt
the link about "Faunus" is deadRomero
these days the answer is probably more related to spark-gremlin: tinkerpop.apache.org/docs/3.5.2/reference/#sparkgraphcomputerPepi
C
1

doing a count is expensive in big distributed graph databases. You can have a node that keeps track of many of the databases frequent aggregate numbers and update it from a cron job so you have it handy. Usually if you have millions of vertices having the count from the previous hour is not such disaster.

Charissecharita answered 20/4, 2014 at 16:12 Comment(1)
but in most operations the client-side needs a count. For example, let's say I want to visualize elements satisfying certain conditions. I cannot visualizer 1 million elements. I have to show them chunk by chunk. I have to say user "I'm just showing you 15 of the results but there is 1563 more in the database". To say this I always count the results.Romero

© 2022 - 2024 — McMap. All rights reserved.