Is it better to have many small Azure storage blob containers (each with some blobs) or one really large container with tons of blobs?
Asked Answered
K

5

101

So the scenario is the following:

I have a multiple instances of a web service that writes a blob of data to Azure Storage. I need to be able to group blobs into a container (or a virtual directory) depending on when it was received. Once in a while (every day at the worst) older blobs will get processed and then deleted.

I have two options:

Option 1

I make one container called "blobs" (for example) and then store all the blogs into that container. Each blob will use a directory style name with the directory name being the time it was received (e.g. "hr0min0/data.bin", "hr0min0/data2.bin", "hr0min30/data3.bin", "hr1min45/data.bin", ... , "hr23min0/dataN.bin", etc - a new directory every X minutes). The thing that processes these blobs will process hr0min0 blobs first, then hr0minX and so on (and the blobs are still being written when being processed).

Option 2

I have many containers each with a name based on the arrival time (so first will be a container called blobs_hr0min0 then blobs_hr0minX, etc) and all the blobs in the container are those blobs that arrived at the named time. The thing that processes these blogs will process one container at a time.

So my question is, which option is better? Does option 2 give me better parallelization (since a containers can be in different servers) or is option 1 better because many containers can cause other unknown issues?

Kratz answered 16/11, 2011 at 20:47 Comment(0)
C
67

I don't think it really matters (from a scalability/parallelization perspective), because partitioning in Win Azure blobs storage is done at the blob level, not the container. Reasons to spread out across different containers have more to do with access control (e.g. SAS) or total storage size.

See here for more details: http://blogs.msdn.com/b/windowsazurestorage/archive/2010/05/10/windows-azure-storage-abstractions-and-their-scalability-targets.aspx

(Scroll down to "Partitions").

Quoting:

Blobs – Since the partition key is down to the blob name, we can load balance access to different blobs across as many servers in order to scale out access to them. This allows the containers to grow as large as you need them to (within the storage account space limit). The tradeoff is that we don’t provide the ability to do atomic transactions across multiple blobs.

Clothing answered 16/11, 2011 at 22:10 Comment(2)
Please, is there any need to keep the blob name as short as possible? (I have "one really large container with tons of blobs", option 1 in the question.)Teens
Link is broken.Slab
C
71

Everyone has given you excellent answers around accessing blobs directly. However, if you need to list blobs in a container, you will likely see better performance with the many-container model. I just talked with a company who's been storing a massive number of blobs in a single container. They frequently list the objects in the container and then perform actions against a subset of those blobs. They're seeing a performance hit, as the time to retrieve a full listing has been growing.

This might not apply to your scenario, but it's something to consider...

Cuneate answered 16/11, 2011 at 23:40 Comment(3)
This is a good point. At the time of writing (June 2016) I believe there is still no way to obtain a count of the number of blobs in a container other than by getting a list of all blobs in that container and checking the list's Count property.Cuckooflower
Is there any need to keep the blob name as short as possible? (I have "one really large container with tons of blobs", option 1 in the question.)Teens
Exactly the scenario we are trying to avoidRang
C
67

I don't think it really matters (from a scalability/parallelization perspective), because partitioning in Win Azure blobs storage is done at the blob level, not the container. Reasons to spread out across different containers have more to do with access control (e.g. SAS) or total storage size.

See here for more details: http://blogs.msdn.com/b/windowsazurestorage/archive/2010/05/10/windows-azure-storage-abstractions-and-their-scalability-targets.aspx

(Scroll down to "Partitions").

Quoting:

Blobs – Since the partition key is down to the blob name, we can load balance access to different blobs across as many servers in order to scale out access to them. This allows the containers to grow as large as you need them to (within the storage account space limit). The tradeoff is that we don’t provide the ability to do atomic transactions across multiple blobs.

Clothing answered 16/11, 2011 at 22:10 Comment(2)
Please, is there any need to keep the blob name as short as possible? (I have "one really large container with tons of blobs", option 1 in the question.)Teens
Link is broken.Slab
D
21

Theoretically speaking, there should be no difference between lots of containers or fewer containers with more blobs. The extra containers can be nice as additional security boundaries (for public anonymous access or different SAS signatures for instance). Extra containers can also make housekeeping a bit easier when pruning (deleting a single container versus targeting each blob). I tend to use more containers for these reasons (not for performance).

Theoretically, the performance impact should not exist. The blob itself (full URL) is the partition key in Windows Azure (has been for a long time). That is the smallest thing that will be load-balanced from a partition server. So, you could (and often will) have two different blobs in same container being served out by different servers.

Jeremy indicates there is a performance difference between more and fewer containers. I have not dug into those benchmarks enough to explain why that might be the case, but I would suspect other factors (like size, duration of test, etc.) to explain any discrepancies.

Derision answered 16/11, 2011 at 22:11 Comment(0)
S
8

There is also one more factor that get's into this. Price!

Currently operation List and Create container are for the same price: 0,054 US$ / 10.000 calls

Same price is actually for writing the blob.

So in extreme cause you can pay a lot more, if you create and delete many containers

  • delete is free

you can see the calculator here: https://azure.microsoft.com/en-us/pricing/calculator/

Sadden answered 13/10, 2017 at 9:46 Comment(0)
D
1

https://learn.microsoft.com/en-us/azure/storage/blobs/storage-performance-checklist#partitioning

Understanding how Azure Storage partitions your blob data is useful for enhancing performance. Azure Storage can serve data in a single partition more quickly than data that spans multiple partitions. By naming your blobs appropriately, you can improve the efficiency of read requests.

Blob storage uses a range-based partitioning scheme for scaling and load balancing. Each blob has a partition key comprised of the full blob name (account+container+blob). The partition key is used to partition blob data into ranges. The ranges are then load-balanced across Blob storage.

Dilution answered 30/9, 2021 at 0:50 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.