ArangoDB: Insert as function of query by example
Asked Answered
T

1

6

Part of my graph is constructed using a giant join between two large collections, and I run it every time I add documents to either collection. The query is based on an older post.

FOR fromItem IN fromCollection
    FOR toItem IN toCollection
        FILTER fromItem.fromAttributeValue == toItem.toAttributeValue
        INSERT { _from: fromItem._id, _to: toItem._id, otherAttributes: {}} INTO edgeCollection

This takes about 55,000 seconds to complete for my dataset. I would absolutely welcome suggestions for making that faster.

But I have two related issues:

  1. I need an upsert. Normally, upsert would be fine, but in this case, since I have no way of knowing the key up front, it wouldn't help me. To get the key up front, I would need to query by example to find the key of the otherwise identical, existing edge. That seems reasonable as long as it doesn't kill my performance, but I don't know how in AQL to construct my query conditionally so that it inserts an edge if the equivalent edge does not exist yet, but does nothing if the equivalent edge does exist. How can I do this?
  2. I need to run this every time data gets added to either collection. I need a way to run this only on the newest data so that it doesn't try to join the entire collection. How can I write AQL that allows me to join only the newly inserted records? They're added with Arangoimp, and I have no guarantees on which order they'll be updated in, so I cannot create the edges at the same time as I create the nodes. How can I join only the new data? I don't want to spend 55k seconds every time a record is added.
Theisen answered 18/10, 2016 at 23:47 Comment(3)
I've done queries in other databases with the same challenge, how do you reduce the size of the data set when re-linking it. The solution that worked for me is to add a field called something like linked = false in both the fromCollection and toCollection collections.Gehrke
...Then when you insert new documents into either collection, you always set linked to false. When you link the documents, you also go back and set linked to true. To speed it up, you'll also want to put an index on linked. You will find this greatly speeds up your processing though it will be still slow for the FIRST time you do it, as everything will have the value linked = false.Gehrke
You could write a Foxx app to do it for you, I documented an example Foxx app for someone else's question, it's available here on StackOverflow. It's worth taking some time to learn Foxx as it can be nice and quick and a function like what you're describing is a perfect use case. The function doesn't even need any parameters, it just runs, and it will only scan those records with linked = false.Gehrke
P
8

If you run your query as written without any indexes, then it will have to do two nested full collection scans, as can be seen by looking at the output of

db._explain(<your query here>);

which shows something like:

  1   SingletonNode                1   * ROOT
  2   EnumerateCollectionNode      3     - FOR fromItem IN fromCollection   /* full collection scan */
  3   EnumerateCollectionNode      9       - FOR toItem IN toCollection   /* full collection scan */
  4   CalculationNode              9         - LET #3 = (fromItem.`fromAttributeValue` == toItem.`toAttributeValue`)   /* simple expression */   /* collections used: fromItem : fromCollection, toItem : toCollection */
  5   FilterNode                   9         - FILTER #3
  ...

If you do

db.toCollection.ensureIndex({"type":"hash", fields ["toAttributeValue"], unique:false})`

Then there will be a single full table collection scan in fromCollection and for each item found there is a hash lookup in the toCollection, which will be much faster. Everything will happen in batches, so this should already improve the situation. The db._explain() will show this:

  1   SingletonNode                1   * ROOT
  2   EnumerateCollectionNode      3     - FOR fromItem IN fromCollection   /* full collection scan */
  8   IndexNode                    3       - FOR toItem IN toCollection   /* hash index scan */

To only work on recently inserted items in fromCollection is relatively easy: Simply add a timestamp of the import time to all vertices, and use:

FOR fromItem IN fromCollection
    FILTER fromItem.timeStamp > @lastRun
    FOR toItem IN toCollection
        FILTER fromItem.fromAttributeValue == toItem.toAttributeValue
        INSERT { _from: fromItem._id, _to: toItem._id, otherAttributes: {}} INTO edgeCollection

and of course put a skiplist index on the timeStamp attribute in fromCollection.

This should work beautifully to discover new vertices in the fromCollection. It will "overlook" new vertices in the toCollection that are linked to old vertices in fromCollection.

You can discover these by interchanging the roles of the fromCollection and the toCollection in your query (do not forget the index on fromAttributeValue in fromCollection) and remembering to only put in edges if the from vertex is old, like in:

FOR toItem IN toCollection
    FILTER toItem.timeStamp > @lastRun
    FOR fromItem IN fromCollection
        FILTER fromItem.fromAttributeValue == toItem.toAttributeValue
        FILTER fromItem.timeStamp <= @lastRun 
        INSERT { _from: fromItem._id, _to: toItem._id, otherAttributes: {}} INTO edgeCollection

These two together should do what you want. Please find the fully worked example here.

Pedraza answered 20/10, 2016 at 8:47 Comment(1)
Thanks Max! One potential issue with using the timestamp is that the various collections are imported at different rates, so that data in the fromCollection may have been imported last night, but data in the toCollection was imported an hour ago. Additionally, sometimes new data needs to be related to data that was imported a long time ago. This would work if both fromItem and toItem were previously imported, but not for only one. My team since came up with a deterministic key rules for edges, so duplication is not an issue- now it's purely the performance of the insert.Theisen

© 2022 - 2024 — McMap. All rights reserved.