Part of my graph is constructed using a giant join between two large collections, and I run it every time I add documents to either collection. The query is based on an older post.
FOR fromItem IN fromCollection
FOR toItem IN toCollection
FILTER fromItem.fromAttributeValue == toItem.toAttributeValue
INSERT { _from: fromItem._id, _to: toItem._id, otherAttributes: {}} INTO edgeCollection
This takes about 55,000 seconds to complete for my dataset. I would absolutely welcome suggestions for making that faster.
But I have two related issues:
- I need an upsert. Normally,
upsert
would be fine, but in this case, since I have no way of knowing the key up front, it wouldn't help me. To get the key up front, I would need to query by example to find the key of the otherwise identical, existing edge. That seems reasonable as long as it doesn't kill my performance, but I don't know how in AQL to construct my query conditionally so that it inserts an edge if the equivalent edge does not exist yet, but does nothing if the equivalent edge does exist. How can I do this? - I need to run this every time data gets added to either collection. I need a way to run this only on the newest data so that it doesn't try to join the entire collection. How can I write AQL that allows me to join only the newly inserted records? They're added with Arangoimp, and I have no guarantees on which order they'll be updated in, so I cannot create the edges at the same time as I create the nodes. How can I join only the new data? I don't want to spend 55k seconds every time a record is added.
linked = false
in both thefromCollection
andtoCollection
collections. – Gehrkelinked
tofalse
. When you link the documents, you also go back and setlinked
totrue
. To speed it up, you'll also want to put an index onlinked
. You will find this greatly speeds up your processing though it will be still slow for the FIRST time you do it, as everything will have the valuelinked = false
. – Gehrkelinked = false
. – Gehrke