Maintain a distributed incremental counter in Azure cosmos DB
Asked Answered
C

1

6

I am fairly new to cosmos DB and was trying to understand the increment operation that azure cosmos DB SDK provides for Java for patching a document. I have a requirement to maintain an incremental counter in one of the Documents in the container. The document looks like this-

{"counter": 1}

Now from my application I want to increment this counter by a value of 1 every time an action happens. For this I am using CosmosPatchOperations. I add an increment here like this cosmosPatch.increment("/counter", 1) which works fine.

Now this application can have multiple instances running, all of them talking to same document in the cosmos container. So App1 and App2 both could trigger an increment at the same time. The SDK method returns the updated document and I need to use that updated value.

My question here would be that does cosmos DB here employ some locking mechanism to make sure both the patches happen one after another and also in this case what would be the updated value that I would get in App1 and App2 (The SDK method returns the updated document). Will it be 2 in one of them and 3 in the other one?

Couchbase supports such a counter at cluster level as explained here and it has been working perfectly for me without any concurrency issues. I am now migrating to cosmos Db and have been struggling to find how can this be achieved.

Update 1:

I decided to test this. I set up the cosmos emulator in my local mac and created a DB and container with automatically increasing RUs starting from 1 to 10K. Then in this container I added a document like this -

{
"id": "randomId",
"counter": 0
}

Post this I created a simple API whose responsibility is just to increment the counter by 1 every-time it is invoked. Then I used locust to invoke this API multiple times to mimic a small load-like scenario. Initially the test ran fine with each invocation receiving a counter like it is supposed to (in an incremental manner). On increasing the load I saw some errors namely RequestTimeOutException with status code 408. Other requests were still working fine with them getting the correct counter value. I do not understand what caused RequestTimeOut exceptions here. The stack trace hints something to do with concurrency but I am not able to get my head around it. Here's the stack trace-

enter image description here

Update 2: The test run in Update 1 was done on my local machine and I realised I might have resource issues on my local leading to those errors. Decided to test this in a Pre-Prod environment with actual cosmos DB and not emulator.

Test configuration-

  1. Cosmos DB container with RUs to automatically scale from 400 to 4000
  2. 2 instances of application sharing the load.
  3. Locust script to ingest load on the application

Findings-

Up until ~170 TPS, everything was running smoothly. Beyond that I noticed errors belonging to 2 different buckets-

  1. "exception": "["Request rate is large. More Request Units may be needed, so no changes were made. Please retry this request later. Learn more: http://aka.ms/cosmosdb-error-429"]".

I am not sure how 170 odd patch operations would have exhausted 4000 RUs but that's a different discussion altogether.

  1. "exception": "["Conflicting request to resource has been attempted. Retry to avoid conflicts."]", with status code 449.

This error clearly indicates that cosmos DB doesn't handle concurrent requests. I want to understand if they maintain a queue internally to handle some requests or they don't handle any concurrent writes at all.

Cretinism answered 17/3, 2022 at 12:27 Comment(2)
Patch is not supported on Mongo API as of now. You have tagged mongo api? Also SDK is applicable only to SQL APIKeyway
That was a wrong tag. Removed. Apologies.Cretinism
K
4

PATCH is not different from other operations, Fundamentally CosmosDB implements Optimistic Concurrency Control unlike the relational databases which have these mechanisms. Optimistic Concurrency Control (OCC) allows you to prevent lost updates and to keep your data correct. OCC can be implemented by using the etag of a document. T Each document within Azure Cosmos DB has an E_TAG property.

In your scenario, yes it will return 2 in one of them and 3 in other one given both get succeeded, because SDK has the retry mechanism and it's explained here. Also have a look at this sample.

If your Azure Cosmos DB account is configured with multiple write regions, conflicts and conflict resolution policies are applicable at the document level, with Last Write Wins (LWW) being the default conflict resolution policy

Keyway answered 17/3, 2022 at 13:28 Comment(11)
Doesn't seem the same to me. Normally you specify the exact value you want to set, but with increment you send a 'function' and let the database side handle the operation for modifying the value. I've tried this on a single write region with 1000 'concurrent' requests and it gave me the expected value back without using an if-match header.Definition
I guess you misunderstood or my explaination was not that clear, I added about the e-tag property just for other operations and how cosmosdb handles conflicts. For Patch you dont need to do the checkingKeyway
My use case is specific to single region writes.Cretinism
@404 when you say you got the exact value back, do you mean that the server handled concurrent requests probably using optimistic locking and made sure that only one request executed at a time and the later ones had to wait for their turn for the lock acquired by the first one to be released?Cretinism
@Kunalgupta Although I don't see it mentioned in the documentation I have a strong feeling the server does something like that. Incrementing a value 1000 times would yield exactly +1000 for me over and over again in my test.Definition
@404, That's where my confusion was. I didn't find it mentioned anywhere explicitly. I am certain that it would return +1000 but I am not sure about the value it would return across apps. Just like I asked in the question. Would it return 1,2,3, 4 and so on in response to each invocation? I will try setting up some test.Cretinism
@404 and Sajeetharan, I have edited the original question with the test results. Does it make any sense?Cretinism
408 is a generic error. Not related to patch, what is the error code you have?Keyway
@Sajeetharan, did another test. this time against an actual cosmos DB instance and not emulator and this is what i got- "exception": "["Conflicting request to resource has been attempted. Retry to avoid conflicts."]", with status code 449. More details have been updated in the question.Cretinism
Looks like you have exceeded the number of default retries on the SDK(usually it's 9 on SDK). Either you need to increase the number of retries or you need to change the design such that same documents are not being accessed when doing patch.Keyway
For patch yes you get a 449 error if there are concurrent attempts to patch the same document but how does Cosmos determine that it needs to send that error? Does it internally lock the document to ensure only one request can be active and just return a 449 for second and subsequent requests?Telescopium

© 2022 - 2024 — McMap. All rights reserved.