In our graph, there are a lot of vertices which have more than 100k of outgoing edges. I would like to know what are the approaches to handle all palettes of situation which come out of this.
Let's say that we have a group_1
defined in our graph. group_1
has 100k members
. We have a few traversals which start from a member_x
vertex and compute some stuff. These traversals are quite fast, they are ending within ~2s each.
But times changed, and now we have a requirement to aggregate all the results from individual small traversals into one number. The traversals have to contain all the results from group_1
's members.
At first our approach was to create traversals which emit a bundle of members_x
by using skip
and limit
and then, using parallel processing on application level, count the sum of our stuff. There are few problems with this approach however:
g.V().has('group',y).out('member_of').skip(0).limit(10)
- according to the documentation this traversal can return different results each time. So creating bundles this way would just be incorrectg.V().has('group',y).out('member_of').skip(100_000).limit(10)
takes too long, because as we've found out, database will still have to visit 100k vertices
So, our next approach would be to store a traversal which emits bundles of members
and then, in separate threads, execute parallel traversals which count sum over the previously fetched member:
while(is_not_the_end) {
List<Members> members = g.V().has('group',y).out('member_of').next(100)`
addMembersToExecutorThread(members) // done in async way
}
So, what are the approaches when you have such scenarios? Basically, we can solve that problem if a way can be found to quickly fetch all the ancestors of some vertex. In our case that would be a group_1
. But it takes a lot of time just to fetch ids by using g.V().has('group',y).out('member_of').properties('members_id')
.
Is there a way to work around this problem? Or maybe we should try to execute such queries on GraphComputer?