How to handle vertices with a large number of edges?

Asked 15/1, 2019 at 9:12 Answered 18/5, 2020 at 12:15

gremlin tinkerpop tinkerpop3 janusgraph gremlin-server

In our graph, there are a lot of vertices which have more than 100k of outgoing edges. I would like to know what are the approaches to handle all palettes of situation which come out of this.

Let's say that we have a group_1 defined in our graph. group_1 has 100k members. We have a few traversals which start from a member_xvertex and compute some stuff. These traversals are quite fast, they are ending within ~2s each.

But times changed, and now we have a requirement to aggregate all the results from individual small traversals into one number. The traversals have to contain all the results from group_1's members.

At first our approach was to create traversals which emit a bundle of members_x by using skip and limit and then, using parallel processing on application level, count the sum of our stuff. There are few problems with this approach however:

g.V().has('group',y).out('member_of').skip(0).limit(10) - according to the documentation this traversal can return different results each time. So creating bundles this way would just be incorrect
g.V().has('group',y).out('member_of').skip(100_000).limit(10) takes too long, because as we've found out, database will still have to visit 100k vertices

So, our next approach would be to store a traversal which emits bundles of members and then, in separate threads, execute parallel traversals which count sum over the previously fetched member:

while(is_not_the_end) {
   List<Members> members = g.V().has('group',y).out('member_of').next(100)`
   addMembersToExecutorThread(members) // done in async way
}

So, what are the approaches when you have such scenarios? Basically, we can solve that problem if a way can be found to quickly fetch all the ancestors of some vertex. In our case that would be a group_1. But it takes a lot of time just to fetch ids by using g.V().has('group',y).out('member_of').properties('members_id').

Is there a way to work around this problem? Or maybe we should try to execute such queries on GraphComputer?

Uticas answered 15/1, 2019 at 9:12 Comment(0)

It sounds like your use case requires almost (if not) a full graph scan. This is a very common use case for graphs and you can see some cases of it here. With degree centrality being one of the more popular use cases.

If you push the aggregation logic to the application layer then you missing out on the biggest benefit of Tinkerpop's graph library. OLAP traversals are very fast.

Please Note:

That in practice if you do use the graph computer/olap traversals you should do so in an environment where the graph is relatively static. This is because OLAP traversals in tinkerpop serialises the graph into an in memory structure. So change to the graph have to be re-serialised. In a fast changing environment this can substantially slow things down.

Hope that helps.

Diminish answered 21/1, 2019 at 7:35 Comment(0)

Your use case seems to be an OLAP case as mentioned by @Filipe.

There are multiple ways of doing it, one way is to use Tinkerpop's graph library. But this works by consuming data hosted in the storage system (JanusGraph backend) which might eventually slow down other real-time graph queries.

For a similar use case where the scale was ~20B member, we took it out from JanusGraph storage backend and used the MapReduce approach using spark.

Spark GraphX is another tool that can be used by loading data via spark. With multiple iterations of tests & failures, finally, we solved our use case using Conencted Component in MR and Beyond. It's a research paper by Google.

Idiopathy answered 18/5, 2020 at 12:15 Comment(0)

Recommended topics

Hot tags