INTERSECTION of (n) arrays in ArangoDB AQL
Asked Answered
P

3

6

The scenario is this: I have an ArangoDB collection containing items, and another collection containing tags. I am using a graph, and I have an edge collection called "Contains" connecting the items and tags. An item has multiple tags.

Now I am trying to do a search for items containing multiple tags. E.g. items containing the tags "photography", "portrait" and "faces".

My general approach is to start a graph traversal from each of the tag vertices and find the items that relate to that tag. That part works fine. I get a list of items.

But the last part of my task is to make an intersection of all the lists in order to find the items that contain ALL the tags specified. And I cannot work out how to do this.

What I wanted to do was something like this:

let tagnames = SPLIT(@tagnames,',')
let tagcollections = (
    FOR tagname IN tagnames
    LET atag = (FOR t IN tags FILTER LOWER(t.text)==LOWER(tagname) RETURN t)
    let collections = (FOR v IN 1..1 INBOUND atag[0] Contains RETURN v)

    RETURN { tag: atag, collections: collections }
)

RETURN INTERSECTION(tagcollections)

However, it doesn't work: The INTERSECTION function does not work on a single list, but on multiple items, like this: INTERSECTION(listA, listB, listC...).

How can I make an intersection of the lists found in the FOR .. RETURN block?

Persevere answered 26/2, 2016 at 20:35 Comment(3)
If you have an array of lists and want to get the intersection, you may want to use APPLY() to spread the array and pass each list as separate argument: APPLY("INTERSECTION", [listA, listB, listC]). It is identical to INTERSECTION(listA, listB, listC), but the input array can have variable length.Tarweed
I think your comment is the answer I was looking for. Even though the other comments are very useful, your answer directly answers my question. But I cannot mark it as the correct answer when it is a comment ...Arrio
I posted an extended answer: https://mcmap.net/q/1677967/-intersection-of-n-arrays-in-arangodb-aql Please accept and up-vote if it resolves your issue, thanks!Tarweed
T
7

ArangoDB 3.0 introduced special array comparison operators (ANY, ALL, NONE). ALL IN can be used to test if every element in the left-hand side array are also in the right-hand side array:

[ "red", "green", "blue" ] ALL IN [ "purple", "red", "blue", "green" ]
// true

Note that these operators can not use indexes yet. Given a data model that embeds the tags directly into the documents, a workaround is to use an index to find all documents that contain one of the tags (e.g. take the first element, ["red","green","blue"][0]) to reduce the result set without a full collection scan, then post-filter with ALL IN if the other tags are also in the list:

LET tagsToSearchFor = [ "red", "green", "blue" ]
FOR doc IN coll
  FILTER tagsToSearchFor[0] IN doc.tags[*] // array index
  FILTER tagsToSeachFor ALL IN doc.tags
  RETURN doc

ALL IN can also be used for your data model with a separate collection for tags, but you will not be able to make use of an index like above. For instance:

FOR doc IN documents
    LET tags = (
        FOR v IN INBOUND doc contains
            RETURN v._key
    )
    FILTER ["red", "green", "blue"] ALL IN tags
    RETURN MERGE(doc, {tags})

Or if you want to start the traversal with the tags and use an intersection-based approach:

LET startTags = ["red", "green", "blue"] // must exist
LET ids = (
    FOR startTag IN DOCUMENT("tags", startTags)
        RETURN (
            FOR v IN OUTBOUND startTag contains
                RETURN v._id
        )
)
LET docs = APPLY("INTERSECTION", ids)

FOR doc IN DOCUMENT(docs)
    RETURN MERGE(doc, {
        tags: (FOR tag IN INBOUND doc contains RETURN tag._key)

    })
Tarweed answered 20/7, 2016 at 14:45 Comment(3)
Can these operations use the index now?Concert
Looks like the link 404s now. This looks like the replacement: docs.arangodb.com/3.11/aql/operators/…Farceuse
> Can these operations use the index now? < According to the link above, noFarceuse
L
3

I would consider storing your tags as attributes on the on your items. ArangoDB 2.8 includes array indexes which are exactly aimed at your scenario. From their blog post:

{ 
  text: "Here's what I want to retrieve...",
  tags: [ "graphdb", "ArangoDB", "multi-model" ]   
}

FOR doc IN documents 
  FILTER "graphdb" IN doc.tags[*] 
    RETURN doc

This should be both more performant and eliminate the need for the AQL above, simplifying your app.

Leith answered 26/2, 2016 at 22:40 Comment(8)
Hmm. I am using tags widely in my system, using them to tag a lot of different item types, which exist in different collections. So since tags seem to me to be a "major" data type in my system, I have a gut feeling that demoting them to attributes might be a bad idea :-( However, this might just be old RDBMS-thinking...? The idea of relationships seems to me to be the core of the graph DB principle. Given a tag, I can quickly find the items that relate to that tag. For this use case, this is working fine, I just need to be able to make an intersection on two or more result sets...Arrio
If you think about the repetitive work a traversal does (gathering all edges, filtering them and then walking over them), for traversals to be/stay fast you want to create the minimum number of edges. "Hubs" (high degree vertices) are basically the speed bumps of the graph world.Leith
Tags essentially create hubs. But like everything else, sometimes you want that. It's all tradeoffs right? There is a blog post that might help here: mikewilliamson.wordpress.com/2015/07/16/…Leith
are you looking for the way to do a join?Defame
@dothebart: No, I am not looking to do a join - I only need a list of documents, they do not need to be joined.Arrio
@MikeWilliamson: I am not sure that the Array Index solution is a good way to go for me, because I am going to need to do both union and intersection queries, i.e. "Give me all items that have either tag A or tag B", as well as "Give me all items that have both tag A and tag B". How would you implement that with array indexes?Arrio
@MikeWilliamson: Thanks for the link to your blog. Lots of interesting stuff you have there :-) I am using Arango with yield and function*()'s, nice to read about those experiences as well :-)Arrio
... and still: My original question seeks a solution to a specific question: How to have an AQL query that returns an array of arrays do an intersection on those arrays so that only the elements that exist in all the arrays make it into the resultset?Arrio
D
0

You can ensure that you don't get documents twice in a result of an AQL-Query using the DISTINCT keyword.

Lets demonstate this in a graph query using the knows graph example:

var examples = require("org/arangodb/graph-examples/example-graph.js");

var g = examples.loadGraph("knows_graph");
db._query("FOR oneGuy IN persons " +
  "FOR v IN 1..1 OUTBOUND oneGuy GRAPH 'knows_graph' RETURN v.name").toArray()
[ 
  "Charlie", 
  "Dave", 
  "Alice", 
  "Bob", 
  "Bob" 
]

We see your situation, Bob is returned twice. Now lets add the distinct keyword:

db._query("FOR oneGuy IN persons " +
  "FOR v IN 1..1 OUTBOUND oneGuy GRAPH 'knows_graph' RETURN DISTINCT v.name"
  ).toArray()
[ 
  "Bob", 
  "Alice", 
  "Dave", 
  "Charlie" 
]
Defame answered 29/3, 2016 at 9:33 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.