deduplicating ArangoDB document collection

I'm sure there is an easy and fast way to do this but it's escaping me. I have a large dataset that has some duplicate records, and I want to get rid of the duplicates. (the duplicates are uniquely identified by one property, but the rest of the document should be identical as well).

I've attempted to create a new collection that only has unique values a few different ways, but they are all quite slow. For example:

FOR doc IN Documents
    COLLECT docId = doc.myId, doc2 = doc
    INSERT doc2 IN Documents2

FOR doc IN Documents
    LET existing = (FOR doc2 IN Documents2
        FILTER doc.myId == doc2.myId
        RETURN doc2)
    UPDATE existing WITH doc IN Documents2

or (this gives me a "violated unique constraint" error)

FOR doc IN Documents
    UPSERT {myId: doc.myId}}]}
    INSERT doc
    UPDATE doc IN Documents2

TL;DR

It does not take that long to de-duplicate the records and write them to another collection (less than 60 seconds), at least on my desktop machine (Windows 10, Intel 6700K 4x4.0GHz, 32GB RAM, Evo 850 SSD).

Certain queries require proper indexing however, or they will last forever. Indexes require some memory, but compared to the needed memory during query execution for grouping the records, it is negligible. If you're short of memory, performance will suffer because the operating system needs to swap data between memory and mass storage. This is especially a problem with spinning disks, not so much with fast flash storage devices.

Preparation

I generated 2.2 million records with 5-20 random attributes and 160 chars of gibberish per attribute. In addition, every record has an attribute myid. 187k records have a unique id, 60k myids exist twice, and 70k three times. The collection size was reported as 4.83GB:

// 1..2000000: 300s
// 1..130000: 20s
// 1..70000: 10s
FOR i IN 1..2000000
    LET randomAttributes = MERGE(
        FOR j IN 1..FLOOR(RAND() * 15) + 5
            RETURN { [CONCAT("attr", j)]: RANDOM_TOKEN(160) }
    )
    INSERT MERGE(randomAttributes, {myid: i}) INTO test1

Memory consumption before starting ArangoDB was at 3.4GB, after starting 4.0GB, and around 8.8GB after loading the test1 source collection.

Baseline

Reading from test1 and inserting all documents (2.2m) into test2 took 20s on my system, with a memory peak of ~17.6GB:

FOR doc IN test1
    INSERT doc INTO test2

Grouping by myid without writing took approx. 9s for me, with 9GB RAM peak during query:

LET result = (
    FOR doc IN test1
        COLLECT myid = doc.myid
        RETURN 1
)
RETURN LENGTH(result)

Failed grouping

I tried your COLLECT docId = doc.myId, doc2 = doc approach on a dataset with just 3 records and one duplicate myid. It showed that the query does not actually group/remove duplicates. I therefore tried to find alternative queries.

Grouping with INTO

To group duplicate myids together but retain the possibility to access the full documents, COLLECT ... INTO can be used. I simply picked the first document of every group to remove redundant myids. The query took about 40s for writing the 2m records with unique myid attribute to test2. I didn't measure memory consumption accurately, but I saw different memory peaks spanning 14GB to 21GB. Maybe truncating the test collections and re-running the queries increases the required memory because of some stale entries that get in the way somehow (compaction / key generation)?

FOR doc IN test1
    COLLECT myid = doc.myid INTO groups
    INSERT groups[0].doc INTO test2

Grouping with subquery

The following query showed a more stable memory consumption, peaking at 13.4GB:

FOR doc IN test1
    COLLECT myid = doc.myid
    LET doc2 = (
        FOR doc3 IN test1
            FILTER doc3.myid == myid
            LIMIT 1
            RETURN doc3
    )
    INSERT doc2[0] INTO test2

Note however that it required a hash index on myid in test1 to achieve a query execution time of ~38s. Otherwise the subquery will cause millions of collection scans and take ages.

Grouping with INTO and KEEP

Instead of storing the whole documents that fell into a group, we can assign just the _id to a variable and KEEP it so that we can look up the document bodies using DOCUMENT():

FOR doc IN test1
    LET d = doc._id
    COLLECT myid = doc.myid INTO groups KEEP d
    INSERT DOCUMENT(groups[0].d) INTO test2

Memory usage: 8.1GB after loading the source collection, 13.5GB peak during the query. It only took 30 seconds for the 2m records!

Grouping with INTO and projection

Instead of KEEP I also tried a projection out of curiosity:

FOR doc IN test1
    COLLECT myid = doc.myid INTO groups = doc._id
    INSERT DOCUMENT(groups[0]) INTO test2

RAM was at 8.3GB after loading test1, and the peak at 17.8GB (there were actually two heavy spikes during the query execution, both going over 17GB). It took 35s to complete for the 2m records.

Upsert

I tried something with UPSERT, but saw some strange results. It turned out to be an oversight in ArangoDB's upsert implementation. v3.0.2 contains a fix and I get correct results now:

FOR doc IN test1
    UPSERT {myid: doc.myid}
    INSERT doc
    UPDATE {} IN test2

It took 40s to process with a (unique) hash index on myid in test2, with a RAM peak around 13.2GB.

Delete duplicates in-place

I first copied all documents from test1 to test2 (2.2m records), then I tried to REMOVE just the duplicates in test2:

FOR doc IN test2
    COLLECT myid = doc.myid INTO keys = doc._key
    LET allButFirst = SLICE(keys, 1) // or SHIFT(keys)
    FOR k IN allButFirst
        REMOVE k IN test2

Memory was at 8.2GB (with only test2 loaded) and went up to 13.5GB during the query. It took roughly 16 seconds to delete the duplicates (200k).

Verification

The following query groups myid together and aggregates how often every id occurs. Run against the target collection test2, the result should be {"1": 2000000}, otherwise there are still duplicates. I double-checked the query results above and everything checked out.

FOR doc IN test2
    COLLECT myid = doc.myid WITH COUNT INTO count
    COLLECT c = count WITH COUNT INTO cc
    RETURN {[c]: cc}

Conclusion

The performance appears to be reasonable with ArangoDB v3.0, although it may degrade if not enough RAM is available. The different queries completed roughly within the same time, but showed different RAM usage characteristics. For certain queries, indexes are necessary to avoid high computational complexity (here: full collection scans; 2,200,000,000,000 reads in the worst case?).

Can you try my presented solutions on your data and check what the performance is on your machine?