I am fairly new to cosmos DB and was trying to understand the increment operation that azure cosmos DB SDK provides for Java for patching a document. I have a requirement to maintain an incremental counter in one of the Documents in the container. The document looks like this-
{"counter": 1}
Now from my application I want to increment this counter by a value of 1 every time an action happens. For this I am using CosmosPatchOperations. I add an increment here like this cosmosPatch.increment("/counter", 1)
which works fine.
Now this application can have multiple instances running, all of them talking to same document in the cosmos container. So App1 and App2 both could trigger an increment at the same time. The SDK method returns the updated document and I need to use that updated value.
My question here would be that does cosmos DB here employ some locking mechanism to make sure both the patches happen one after another and also in this case what would be the updated value that I would get in App1 and App2 (The SDK method returns the updated document). Will it be 2 in one of them and 3 in the other one?
Couchbase supports such a counter at cluster level as explained here and it has been working perfectly for me without any concurrency issues. I am now migrating to cosmos Db and have been struggling to find how can this be achieved.
Update 1:
I decided to test this. I set up the cosmos emulator in my local mac and created a DB and container with automatically increasing RUs starting from 1 to 10K. Then in this container I added a document like this -
{
"id": "randomId",
"counter": 0
}
Post this I created a simple API whose responsibility is just to increment the counter by 1 every-time it is invoked. Then I used locust to invoke this API multiple times to mimic a small load-like scenario. Initially the test ran fine with each invocation receiving a counter like it is supposed to (in an incremental manner). On increasing the load I saw some errors namely RequestTimeOutException with status code 408. Other requests were still working fine with them getting the correct counter value. I do not understand what caused RequestTimeOut exceptions here. The stack trace hints something to do with concurrency but I am not able to get my head around it. Here's the stack trace-
Update 2: The test run in Update 1 was done on my local machine and I realised I might have resource issues on my local leading to those errors. Decided to test this in a Pre-Prod environment with actual cosmos DB and not emulator.
Test configuration-
- Cosmos DB container with RUs to automatically scale from 400 to 4000
- 2 instances of application sharing the load.
- Locust script to ingest load on the application
Findings-
Up until ~170 TPS, everything was running smoothly. Beyond that I noticed errors belonging to 2 different buckets-
- "exception": "["Request rate is large. More Request Units may be needed, so no changes were made. Please retry this request later. Learn more: http://aka.ms/cosmosdb-error-429"]".
I am not sure how 170 odd patch operations would have exhausted 4000 RUs but that's a different discussion altogether.
- "exception": "["Conflicting request to resource has been attempted. Retry to avoid conflicts."]", with status code 449.
This error clearly indicates that cosmos DB doesn't handle concurrent requests. I want to understand if they maintain a queue internally to handle some requests or they don't handle any concurrent writes at all.