Azure Data Factory copy from Azure Block to CosmosDB is slow
Asked Answered
L

1

0

I have a BlockBlob in Premium Azure Storage. It's a 500mb zip file containing around 280 million phone numbers in CSV format.

I've created a Pipeline in ADF to unzip this and copy the entries into Cosmos DB SQL API, but it took 40 hours to complete. The goal is to update the DB nightly with a diff in the information.

My Storage Account and Cosmos DB are located in the same region. The Cosmos DB partition key is the area code and that seems to distribute well.

Currently, at 20,000 RU's I've scaled a few time, but the portal keeps telling me to scale more. They are suggesting 106,000 RU's which is $6K a month. Any ideas on practical ways I can speed this up?

Azure Data Factory Pipeline Image

-- Update.

I've tried importing the unzipped file, but it doesn't appear any faster. Slower in fact, despite reporting more peak connections. Importing unzipped DB

I'm now trying to dynamically scale up/down the RU's to a really high number when it's time to start the transfer. Still playing with numbers. Not sure the formula to determine the number of RUs I need to transfer this 10.5GB in X minutes.

Lendlease answered 20/6, 2019 at 18:53 Comment(3)
Thanks Greg. Will kick that off tonight.Lendlease
The suggestion for RU is based on your ingestion rate (and related throttling that's occurring). And it's only a suggestion, not something required. But if you are exceeding allotted RU, then you will be throttled; no other way to increase perf, aside from making sure your content fits across multiple physical partitions, to allow for in-parallel writes (RU is divided across your physical partitions). If you're only running such a job for a short period of time, you can scale up, do your import, then scale down again. You wouldn't need to keep your RU rate constant.Penn
Thanks David. This is what I've learned. I'm working on finding the upper limit of what I need to scale. However, I'm trying to figure out how long these scale requests take to process. I've also learned that writes are way more expensive than reads so I'm hoping once the DB is loaded I can perform reads on subsequent runs to just maintain the delta instead of always writing out the entire DB.Lendlease
L
1

I ended up dynamically scaling the throughput with Azure Functions. Yes, the price for Cosmos would have been very expensive if I left the RUs very high. However, I only need that high when I'm doing the data ingestion and then scale back down. I used a Logic App to call Azure Function to "Scale the RUs up" then kicks off my Azure Data Factory Pipeline. When it's down it calls the Azure Function to scale down.

Lendlease answered 19/7, 2019 at 7:58 Comment(1)
which cosmos DB library do you use?Erastian

© 2022 - 2024 — McMap. All rights reserved.