Getting blob count in an Azure Storage container
Asked Answered
K

15

39

What is the most efficient way to get the count on the number of blobs in an Azure Storage container?

Right now I can't think of any way other than the code below:

CloudBlobContainer container = GetContainer("mycontainer");
var count = container.ListBlobs().Count();
Kattiekatuscha answered 28/7, 2011 at 15:49 Comment(0)
L
11

The API doesn't contain a container count method or property, so you'd need to do something like what you posted. However, you'll need to deal with NextMarker if you exceed 5,000 items returned (or if you specify max # to return and the list exceeds that number). Then you'll make add'l calls based on NextMarker and add the counts.

EDIT: Per smarx: the SDK should take care of NextMarker for you. You'll need to deal with NextMarker if you're working at the API level, calling List Blobs through REST.

Alternatively, if you're controlling the blob insertions/deletions (through a wcf service, for example), you can use the blob container's metadata area to store a cached container count that you compute with each insert or delete. You'll just need to deal with write concurrency to the container.

Leathery answered 28/7, 2011 at 16:13 Comment(3)
I'm pretty sure ListBlobs will automatically follow the continuation tokens. (So I don't think you need to do anything explicit with NextMarker to get this to work.)Kai
Oops! I'm spending too much time at the API level, it seems... :)Leathery
Similarly, how to get the size of the entire container blobs without iterating each blob and sum blob.Properties.ContentLength.Value. My use case is to get the high-level total count and size of each container containing around 10M files and 5 TB of data.Autotype
S
48

If you just want to know how many blobs are in a container without writing code you can use the Microsoft Azure Storage Explorer application.

  1. Open the desired BlobContainer enter image description here
  2. Click the Folder Statistics icon enter image description here
  3. Observe the count of blobs in the Activities window enter image description here
Subtangent answered 14/2, 2018 at 21:37 Comment(3)
The statistics are only available for normal storage accounts. They are not available if ADLS Gen2 is activated.Elianaelianora
@Elianaelianora OTOH listing files is blazingly fast with the ADLSgen2 API, compared to the blob API.Ferocity
This takes forever on large blob storage's that have a lot of files.Evzone
C
16

I tried counting blobs using ListBlobs() and for a container with about 400,000 items, it took me well over 5 minutes.

If you have complete control over the container (that is, you control when writes occur), you could cache the size information in the container metadata and update it every time an item gets removed or inserted. Here is a piece of code that would return the container blob count:

static int CountBlobs(string storageAccount, string containerId)
{
    CloudStorageAccount cloudStorageAccount = CloudStorageAccount.Parse(storageAccount);
    CloudBlobClient blobClient = cloudStorageAccount.CreateCloudBlobClient();
    CloudBlobContainer cloudBlobContainer = blobClient.GetContainerReference(containerId);

    cloudBlobContainer.FetchAttributes();

    string count = cloudBlobContainer.Metadata["ItemCount"];
    string countUpdateTime = cloudBlobContainer.Metadata["CountUpdateTime"];

    bool recountNeeded = false;

    if (String.IsNullOrEmpty(count) || String.IsNullOrEmpty(countUpdateTime))
    {
        recountNeeded = true;
    }
    else
    {
        DateTime dateTime = new DateTime(long.Parse(countUpdateTime));

        // Are we close to the last modified time?
        if (Math.Abs(dateTime.Subtract(cloudBlobContainer.Properties.LastModifiedUtc).TotalSeconds) > 5) {
            recountNeeded = true;
        }
    }

    int blobCount;
    if (recountNeeded)
    {
        blobCount = 0;
        BlobRequestOptions options = new BlobRequestOptions();
        options.BlobListingDetails = BlobListingDetails.Metadata;

        foreach (IListBlobItem item in cloudBlobContainer.ListBlobs(options))
        {
            blobCount++;
        }

        cloudBlobContainer.Metadata.Set("ItemCount", blobCount.ToString());
        cloudBlobContainer.Metadata.Set("CountUpdateTime", DateTime.Now.Ticks.ToString());
        cloudBlobContainer.SetMetadata();
    }
    else
    {
        blobCount = int.Parse(count);
    }

    return blobCount;
}

This, of course, assumes that you update ItemCount/CountUpdateTime every time the container is modified. CountUpdateTime is a heuristic safeguard (if the container did get modified without someone updating CountUpdateTime, this will force a re-count) but it's not reliable.

Custos answered 21/12, 2011 at 0:9 Comment(3)
If this is approach is used in a system where calls can be executed in parallel e.g. a web API then you run into a race condition around who last updated values. Another approach might be to store the file names in an Azure Storage table as an index.Pigmy
Ok, maybe not Storage Tables because it doesn't have a native count method, only "get all items". Maybe a DocumentDB table or a relatively more expensive SQL table.Pigmy
Or, since the blobs and table entities have ETags for detecting concurrency issues you could have 1 blob/entity with the count or list of file names.Pigmy
L
11

The API doesn't contain a container count method or property, so you'd need to do something like what you posted. However, you'll need to deal with NextMarker if you exceed 5,000 items returned (or if you specify max # to return and the list exceeds that number). Then you'll make add'l calls based on NextMarker and add the counts.

EDIT: Per smarx: the SDK should take care of NextMarker for you. You'll need to deal with NextMarker if you're working at the API level, calling List Blobs through REST.

Alternatively, if you're controlling the blob insertions/deletions (through a wcf service, for example), you can use the blob container's metadata area to store a cached container count that you compute with each insert or delete. You'll just need to deal with write concurrency to the container.

Leathery answered 28/7, 2011 at 16:13 Comment(3)
I'm pretty sure ListBlobs will automatically follow the continuation tokens. (So I don't think you need to do anything explicit with NextMarker to get this to work.)Kai
Oops! I'm spending too much time at the API level, it seems... :)Leathery
Similarly, how to get the size of the entire container blobs without iterating each blob and sum blob.Properties.ContentLength.Value. My use case is to get the high-level total count and size of each container containing around 10M files and 5 TB of data.Autotype
K
3

Example using PHP API and getNextMarker.

Counts total number of blobs in an Azure container. It takes a long time: about 30 seconds for 100000 blobs.

(assumes we have a valid $connectionString and a $container_name)

$blobRestProxy = ServicesBuilder::getInstance()->createBlobService($connectionString);
$opts = new ListBlobsOptions();
$nblobs = 0;

while($cont) {

  $blob_list = $blobRestProxy->listBlobs($container_name, $opts);      

  $nblobs += count($blob_list->getBlobs());

  $nextMarker = $blob_list->getNextMarker();

  if (!$nextMarker || strlen($nextMarker) == 0) $cont = false;
  else $opts->setMarker($nextMarker);
}
echo $nblobs;
Kerwon answered 23/7, 2013 at 8:35 Comment(0)
A
2

If you are not using virtual directories, the following will work as previously answered.

CloudBlobContainer container = GetContainer("mycontainer");
var count = container.ListBlobs().Count();

However, the above code snippet may not have the desired count if you are using virtual directories.

For instance, if your blobs are stored similar to the following: /container/directory/filename.txt where the blob name = directory/filename.txt the container.ListBlobs().Count(); will only count how many "/directory" virtual directories you have. If you want to list blobs contained within virtual directories, you need to set the useFlatBlobListing = true in the ListBlobs() call.

CloudBlobContainer container = GetContainer("mycontainer");
var count = container.ListBlobs(null, true).Count();

Note: the ListBlobs() call with useFlatBlobListing = true is a much more expensive/slow call...

Adrell answered 6/10, 2015 at 14:23 Comment(1)
I have completely different experience: it's much faster to do the flat listing than listing root, then list each subfolder separately with root prefix.Paraphrastic
V
2

Bearing in mind all the performance concerns from the other answers, here is a version for v12 of the Azure SDK leveraging IAsyncEnumerable. This requires a package reference to System.Linq.Async.

public async Task<int> GetBlobCount()
{
    var container = await GetBlobContainerClient();
    var blobsPaged = container.GetBlobsAsync();
    return await blobsPaged
        .AsAsyncEnumerable()
        .CountAsync();
}
Vibraculum answered 18/3, 2021 at 22:14 Comment(3)
Similarly, do you know how to get the size of the entire container blobs without iterating each blob and sum blob.Properties.ContentLength.Value. My use case is to get the high-level total count and size of each container containing around 10M files and 5 TB of data.Autotype
Perhaps AggregateAsync? https://mcmap.net/q/409364/-how-to-aggregate-results-of-an-iasyncenumerable I guess it will be pretty slow.Vibraculum
I think the correct value would be returned not by await blobsPaged.AsAsyncEnumerable().CountAsync() but rather by await blobsPaged.SumAsync(page => page.Values.Count) since it's a paged result and each page might contain 1 or more blobs.Idioblast
C
1

With Python API of Azure Storage it is like:

from azure.storage import *
blob_service = BlobService(account_name='myaccount', account_key='mykey')
blobs = blob_service.list_blobs('mycontainer')
len(blobs)  #returns the number of blob in a container
Calen answered 22/8, 2014 at 16:36 Comment(3)
This isn't correct. list_blobs has an upper limit of 5,000Cohesive
For first request it usually returns all blobs but @Cohesive is right for subsequent requests you still have limit of 5,000.Factory
So.... what's the answer? Does the python API just return the first 5000, or does it return everything? Is there a way to return everything from the Python API or is it buggy?Coypu
E
1

If you are using Azure.Storage.Blobs library, you can use something like below:

public int GetBlobCount(string containerName)
{
    int count = 0;
    BlobContainerClient container = new BlobContainerClient(blobConnctionString, containerName);
    container.GetBlobs().ToList().ForEach(blob => count++);
    return count;
}
Euripus answered 9/3, 2022 at 13:36 Comment(1)
You could shorten that to just container.GetBlobs().Count() I believeDrily
T
1

With azure-cli it would be as follow:

az storage blob list --account-name <name> --container-name <name> --num-results "*" --query "length(@)"
Tacita answered 6/6, 2023 at 21:47 Comment(0)
E
1

This answer is for someone with large blob storage with millions of blobs.

The top-rated answer on this thread is pretty much unusable with large blob storages. The azure storage explorer application simply calls list blobs API under the hood which is paginated and allows 5000 records at a time. In case you have millions of blobs, this will take forever to return the blob count.

If you are ok with approximate value, then the storage browser option in azure portal is extremely useful. However, note that this value is not very accurate on blob storages that have high write/delete operations.

enter image description here Also, this data should be visible by default. If not, enable the diagnostics metrics. Monitoring -> Diagnostic Settings(classic). (Turn the status on and enable the hour metrics)

If you want more accurate results, then the only option is to enable blob storage inventory report. The downside is that this is a background job, and the report can be generated only once per day. Here is the document on the same. For large blob storage's, my suggestion is to generate a parquet report every day and when you need to inspect/read the report, either use Dbeaver(along with DuckDB) or Databricks or Synapse. Below listed few resources on how this can be achieved.

If you do not wish to use inventory report, here is a PowerShell script to achieve something similar. However, this can take many hours to return blob count on large blob storages.

Evzone answered 26/6, 2023 at 5:39 Comment(0)
T
0

Another Python example, works slow but correctly with >5000 files:

from azure.storage.blob import BlobServiceClient

constr="Connection string"
container="Container name"

blob_service_client = BlobServiceClient.from_connection_string(constr)
container_client = blob_service_client.get_container_client(container)
blobs_list = container_client.list_blobs()

num = 0
size = 0
for blob in blobs_list:
    num += 1
    size += blob.size
    print(blob.name,blob.size)

print("Count: ", num)
print("Size: ", size)
Tobey answered 18/6, 2020 at 13:27 Comment(0)
E
0

I have spend quite period of time to find the below solution - I don't want to some one like me to waste time - so replying here even after 9 years

package com.sai.koushik.gandikota.test.app;

import com.microsoft.azure.storage.CloudStorageAccount;
import com.microsoft.azure.storage.blob.*;


public class AzureBlobStorageUtils {


    public static void main(String[] args) throws Exception {
        AzureBlobStorageUtils getCount =  new AzureBlobStorageUtils();
        String storageConn = "<StorageAccountConnection>";
        String blobContainerName = "<containerName>";
        String subContainer =  "<subContainerName>";
        Integer fileContainerCount = getCount.getFileCountInSpecificBlobContainersSubContainer(storageConn,blobContainerName, subContainer);
        System.out.println(fileContainerCount);
    }

    public Integer getFileCountInSpecificBlobContainersSubContainer(String storageConn, String blobContainerName, String subContainer) throws Exception {
        try {
            CloudStorageAccount storageAccount = CloudStorageAccount.parse(storageConn);
            CloudBlobClient blobClient = storageAccount.createCloudBlobClient();
            CloudBlobContainer blobContainer = blobClient.getContainerReference(blobContainerName);
            return ((CloudBlobDirectory) blobContainer.listBlobsSegmented().getResults().stream().filter(listBlobItem -> listBlobItem.getUri().toString().contains(subContainer)).findFirst().get()).listBlobsSegmented().getResults().size();
        } catch (Exception e) {
            throw new Exception(e.getMessage());
        } 
    }

}


Expressman answered 29/7, 2020 at 0:25 Comment(1)
listBlobsSegmented, gets the first 5000, at least in the earlier SDKs.Canvas
C
0

Count all blobs in a classic and new blob storage account. Building on @gandikota-saikoushik, this solution works for blob containers with a very large number of blobs.

//setup set values from Azure Portal
var accountName = "<ACCOUNTNAME>";
var accountKey = "<ACCOUTNKEY>";
var containerName = "<CONTAINTERNAME>";
uristr = $"DefaultEndpointsProtocol=https;AccountName={accountName};AccountKey={accountKey}";

var storageAccount = Microsoft.WindowsAzure.Storage.CloudStorageAccount.Parse(uristr);
var client = storageAccount.CreateCloudBlobClient();
var container = client.GetContainerReference(containerName);
BlobContinuationToken continuationToken = new BlobContinuationToken();
blobcount = CountBlobs(container, continuationToken).ConfigureAwait(false).GetAwaiter().GetResult();
Console.WriteLine($"blobcount:{blobcount}");


public static async Task<int> CountBlobs(CloudBlobContainer container, BlobContinuationToken currentToken)
{
    BlobContinuationToken continuationToken = null;
    var result = 0;
    do
    {
        var response = await container.ListBlobsSegmentedAsync(continuationToken);
        continuationToken = response.ContinuationToken;
        result += response.Results.Count();
    }
    while (continuationToken != null);

    return result;
}
Canvas answered 4/3, 2022 at 7:35 Comment(0)
T
0

List blobs approach is accurate but slow if you have millions of blobs. Another way that works in a few cases but is relatively fast is querying the MetricsHourPrimaryTransactionsBlob table. It is at the account level and metrics get aggregated hourly.

https://learn.microsoft.com/en-us/azure/storage/common/storage-analytics-metrics

Tenishatenn answered 14/8, 2022 at 8:7 Comment(0)
S
0

You can use this

 public static async Task<List<IListBlobItem>> ListBlobsAsync()
 {
   BlobContinuationToken continuationToken = null;
   List<IListBlobItem> results = new List<IListBlobItem>();
   do
    {
      CloudBlobContainer container = GetContainer("containerName");
      
      var response = await container.ListBlobsSegmentedAsync(null,
         true, BlobListingDetails.None, 5000, continuationToken, null, null);
      
      continuationToken = response.ContinuationToken;

      results.AddRange(response.Results);

      } while (continuationToken != null);
       return results;
 }

and then call

var count = await ListBlobsAsync().Count;

hope it will be useful

Structure answered 22/10, 2022 at 19:9 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.