How to find Duplicate documents in Cosmos DB
Asked Answered
M

2

11

I have seen like a huge amount of data write to cosmos DB from stream analytics job on a particular day. It was not supposed to write huge amount of documents in a day. I have to check if there is duplication of documents on that particular day.

Is there any query/any way to find out duplicate records in cosmos DB?

Metallophone answered 6/12, 2019 at 13:6 Comment(0)
F
18

It is possible if you know the properties to check for duplicates. We had a nasty production issue causing many duplicate records as well. Upon contacting MS Support to help us identify the duplicate documents, they gave us the following query;

Bear in mind: property A and B together define the uniqueness in our case. So if two documents have the same value for A and B, they are duplicate. You can then use the output of this query to, for example, delete the oldest ones but keep the recent (based on _ts)

SELECT d.A, d.B From 
   (SELECT c.A, c.B, count(c._ts) as counts FROM c
    GROUP BY c.Discriminator, c.EndDateTime) AS d
WHERE d.counts > 1
Fluorinate answered 9/4, 2020 at 8:8 Comment(1)
How would you list the unique ids of these documents? If you wanted to find duplicates and then delete the duplicate item?Isoprene
H
-1

Is there any query/any way to find out duplicate records in cosmos DB?

Quick answer is YES.Please use distinct keyword in the cosmos db query sql.And filter the _ts(System generated unix timestamp:https://learn.microsoft.com/en-us/azure/cosmos-db/databases-containers-items#properties-of-an-item)

Something like:

Select distinct c.X,c.Y,C.Z....(all columns you want to check) from c where c._ts = particular day

Then you could delete the duplicate data using this bulk delete lib:https://github.com/Azure/azure-cosmosdb-bulkexecutor-dotnet-getting-started/tree/master/BulkDeleteSample.

Hornbook answered 9/12, 2019 at 1:18 Comment(6)
Suppose I have 3000 records written into cosmos DB in one day. And in that 2000 records are duplicate records. If I run the above DISTINCT query , i will get only the distinct 1000 records only? The 2000 duplicates will not be received in output ?@Jay GongMetallophone
@Metallophone I'm afraid the output is limited by the RUs settings,so only 1000 records will show up. And the left 1000 records could be searched by the continuation token.Don't increase RUs settings because it will increase your cost. Could you deal with these data in two parts or three parts? Because the _ts logs the specific time when the data streams into db.So i think the data could be divided into several parts.Hornbook
But the DISTINCT keyword will retrieve only unique(non duplicate) records. How I will find duplicate records from that?@Jay GongMetallophone
@Antony, did you manage to find the query? I am currently facing the same issue...Hiett
@GaelleSou I've posted an answer that will help you to identify the duplicatesFluorinate
@Fluorinate Really appreciate to your sharing,will check that.Hornbook

© 2022 - 2024 — McMap. All rights reserved.