ArangoDB: Aggregating counts via graph traversal
Asked Answered
D

1

6

In my ArangoDB graph, I have a subject, message threads associated with that subject, and messages inside those message threads. I would like to traverse the graph in such a way that I return the data associated with the message thread as well as the count of messages inside the message thread.

The data is structured fairly simply: I have the subject node, an edge extending to the thread node with the date and category associated, and an edge from the thread node to the message node.

I would like to return the data stored in the thread node and the count of messages attached to the thread.

I'm not sure how to do this with the for v, e, p in 1..2 outbound syntax. Should I just do for v, e, p in outbound with a nested graph inside it? Is that still performant?

Decury answered 21/9, 2016 at 19:42 Comment(0)
R
7

Sorry for the delay, we are working hard on 3.1 release ;)

I think you are already at the correct solution: It is not easily possible to express what you would like to achieve in a 1..2 OUTBOUND statement. It is way easier to formulate in two 1..1 OUTBOUND statements.

From your explanation i think the following query is what you would use:

FOR thread IN 1 OUTBOUND @start @@threadEdges
  LET nr = COUNT(FOR message IN 1 OUTBOUND thread @@messageEdges RETURN 1)
  RETURN {
    date: thread.date,
    category: thread.category,
    messages: nr
  }

For some explanation: i first select the associated thread. Next i do a subquery to simply could the messages for one thread. Finally i return the information i need.

In terms of performance: In terms of data access (which is Most likely the "bottleneck" operation) there is no difference in FOR x IN 1..2 OUTBOUND [...] and FOR x IN 1 OUTBOUND [...] FOR y IN 1 OUTBOUND x [...] both have to look at exactly the same documents. The query optimization might be a bit slower in the later case, but the difference is way below 1ms.

Raymundorayna answered 21/10, 2016 at 9:44 Comment(2)
This is effectively what my team has been doing. Right now, these aggregations take about 5 seconds each, though when six are run at once, the server slows down significantly and the queries begin taking 30-40 seconds. This is for about 60 threads with up to 70,000 messages. Presumably when we go to a cluster, we'll see this go back to around 5 seconds, but we'd really like to get it faster.Decury
Ok understood ;) Is it possible that you could give us some anonymized dataset so that we can try to optimize what is going on? For us it is always easier with a "real" dataset than if we generate one. We are willing to sign an NDA for that (i am not in detail informed with all communications going on, so if we already got such a dataset from you i will get my hands on it and get your query faster) I am also unhappy with everything above 1s.Raymundorayna

© 2022 - 2024 — McMap. All rights reserved.