ArangoDB: Insert as function of query by example

FOR fromItem IN fromCollection FOR toItem IN toCollection FILTER fromItem.fromAttributeValue == toItem.toAttributeValue INSERT { _from: fromItem._id, _to: toItem._id, otherAttributes: {}} INTO edgeCollection

If you run your query as written without any indexes, then it will have to do two nested full collection scans, as can be seen by looking at the output of

db._explain(<your query here>);

which shows something like:

  1   SingletonNode                1   * ROOT
  2   EnumerateCollectionNode      3     - FOR fromItem IN fromCollection   /* full collection scan */
  3   EnumerateCollectionNode      9       - FOR toItem IN toCollection   /* full collection scan */
  4   CalculationNode              9         - LET #3 = (fromItem.`fromAttributeValue` == toItem.`toAttributeValue`)   /* simple expression */   /* collections used: fromItem : fromCollection, toItem : toCollection */
  5   FilterNode                   9         - FILTER #3
  ...

If you do

db.toCollection.ensureIndex({"type":"hash", fields ["toAttributeValue"], unique:false})`

Then there will be a single full table collection scan in fromCollection and for each item found there is a hash lookup in the toCollection, which will be much faster. Everything will happen in batches, so this should already improve the situation. The db._explain() will show this:

  1   SingletonNode                1   * ROOT
  2   EnumerateCollectionNode      3     - FOR fromItem IN fromCollection   /* full collection scan */
  8   IndexNode                    3       - FOR toItem IN toCollection   /* hash index scan */

To only work on recently inserted items in fromCollection is relatively easy: Simply add a timestamp of the import time to all vertices, and use:

FOR fromItem IN fromCollection
    FILTER fromItem.timeStamp > @lastRun
    FOR toItem IN toCollection
        FILTER fromItem.fromAttributeValue == toItem.toAttributeValue
        INSERT { _from: fromItem._id, _to: toItem._id, otherAttributes: {}} INTO edgeCollection

and of course put a skiplist index on the timeStamp attribute in fromCollection.

This should work beautifully to discover new vertices in the fromCollection. It will "overlook" new vertices in the toCollection that are linked to old vertices in fromCollection.

You can discover these by interchanging the roles of the fromCollection and the toCollection in your query (do not forget the index on fromAttributeValue in fromCollection) and remembering to only put in edges if the from vertex is old, like in:

FOR toItem IN toCollection
    FILTER toItem.timeStamp > @lastRun
    FOR fromItem IN fromCollection
        FILTER fromItem.fromAttributeValue == toItem.toAttributeValue
        FILTER fromItem.timeStamp <= @lastRun 
        INSERT { _from: fromItem._id, _to: toItem._id, otherAttributes: {}} INTO edgeCollection

These two together should do what you want. Please find the fully worked example here.

Recommended topics

Hot tags