How do I list all the top-level folders in given GCS bucket?
Asked Answered
H

4

7

I start with

    client = storage.Client()
    bucket = client.get_bucket(BUCKET_NAME)

    <what's next? Need something like client.list_folders(path)>

I know how to:

  1. list all the blobs (including blobs in sub-sub-sub-folders, of any depth) with bucket.list_blobs()

  2. or how to list all the blobs recursively in given folder with bucket.list_blobs(prefix=<path to subfolder>)

but what if my file system structure has 100 top level folders, each having thousands of files. Any efficient way to get only those 100 top level folder names without listing all the inside blobs?

Herd answered 30/12, 2019 at 5:47 Comment(3)
Yes, by only processing the prefixes returned. I do not have an example to post. Google Cloud Storage uses prefix and separator to facilitate listing objects. Hopefully, this tip will help you.Taryntaryne
@JohnHanley excellent tip regarding "prefixes". This won't load however until you iterate list_blobs first element. PS it's actually delimiter not separator on GCS but we know what you mean.Troublous
Official documentation cloud.google.com/storage/docs/json_api/v1/objects/listRegularly
S
2

I do not think you can get the 100 top level folders without listing all the inside blobs. Google Cloud Storage does not have folders or subdirectories, the library just creates an illusion of a hierarchical file tree.

I used this simple code :

from google.cloud import storage
storage_client = storage.Client()
blobs = storage_client.list_blobs('my-project')
res = []

for blob in blobs:
   if blob.name.split('/')[0] not in res:
       res.append(blob.name.split('/')[0]) 

print(res) 
Summerwood answered 30/12, 2019 at 8:49 Comment(5)
"Google Cloud Storage does not have folders or subdirectories" -> that what I thought as well. thanks!Herd
This is not true - you missed the delimiter parameter.Yapok
Please add a working solution for this case if my statement is not trueSummerwood
Not true. It can be done without listing all the inside blobs.Troublous
Without using delimiter argument, this can be very inefficient. Also, the OP's question was how to list directories only, and that can be done using: if not isinstance(blob, str) (Also, checking if the name is already in res is a bit superfluous, because you can't have multiple folders with the same name.)Knightly
T
11

All the response here have a piece of the answer but you will need to combine: the prefix, the delimiter and the prefixes in a loaded list_blobs(...) iterator. Let me throw down the code to get the 100 top level folders and then we'll walk through it.

import google.cloud.storage as gcs
client = gcs.Client()
blobs = client.list_blobs(
    bucket_or_name=BUCKET_NAME, 
    prefix="", 
    delimiter="/", 
    max_results=1
)
next(blobs, ...) # Force list_blobs to make the api call (lazy loading)
# prefixes is now a set, convert to list
print(list(blobs.prefixes)[:100])

In first eight lines we build the GCS client and make the client.list_blobs(...) call. In your question you mention the bucket.list_blobs(..) method - as of version 1.43 this still works but the page on Buckets in the docs say this is now deprecated. The only difference is the keword arg bucket_or_name, on line 4.

We want folders at the top level, so we don't actually need to specify prefix at all, however, it will be useful for other readers to know that if you had wanted to list folders in a top-level directory stuff then you should specify a trailing slash. This kwarg would then become prefix="stuff/".

Someone already mentioned the delimiter kwarg, but to iterate, you should specify this so GCS knows how to interpret the blob names as directories. Simple enough.

The max_results=1 is for efficiency. Remember that we don't want blobs here, we want only folder names. Therefore if we tell GCS to stop looking once it finds a single blob, it might be faster. In practice, I have not found this to be the case but it could easily be if you have vast numbers of blobs, or if the storage is cold-line or whatever. YMMV. Consider it optional.

The blobs object returned is an lazy-loading iterator, which means that it won't load - including not even populating its members - until the first api call is made. To get this first call, we ask for the next element in the iterator. In your case, you know you have at least one file, so simply calling next(blobs) will work. It fetches the blob that is next in line (at the front of the line) and then throws it away.

However, if you could not guarantee to have at least one blob, then next(blobs), which needs to return something from the interator, will raise a StopIteration exception. To get round this, we put the default value of the ellipsis ....

Now the member of blobs we want, prefixes, is loaded, we print out the first 100. The output will be something like:

{'dir0/','dir1/','dir2/', ...}
Troublous answered 19/10, 2021 at 9:19 Comment(3)
This is awesome, thank you. Just a note, when I tried this exact code I got TypeError: 'set' object is not subscriptable for blobs.prefixes[:100], so it seems like prefixes is now a set. No big deal, since the results could just be converted to a list. I couldn't find anything in the docs specifying why this has changed.Simile
@Simile thanks for the pointer, I have updated my code with your suggestion. Hero!Troublous
Be careful, in my GCP bucket the proposed method can give different number of results when changing max_results param to a higher number. And it has nothing to do with limiting the output to 100.Menhaden
S
2

I do not think you can get the 100 top level folders without listing all the inside blobs. Google Cloud Storage does not have folders or subdirectories, the library just creates an illusion of a hierarchical file tree.

I used this simple code :

from google.cloud import storage
storage_client = storage.Client()
blobs = storage_client.list_blobs('my-project')
res = []

for blob in blobs:
   if blob.name.split('/')[0] not in res:
       res.append(blob.name.split('/')[0]) 

print(res) 
Summerwood answered 30/12, 2019 at 8:49 Comment(5)
"Google Cloud Storage does not have folders or subdirectories" -> that what I thought as well. thanks!Herd
This is not true - you missed the delimiter parameter.Yapok
Please add a working solution for this case if my statement is not trueSummerwood
Not true. It can be done without listing all the inside blobs.Troublous
Without using delimiter argument, this can be very inefficient. Also, the OP's question was how to list directories only, and that can be done using: if not isinstance(blob, str) (Also, checking if the name is already in res is a bit superfluous, because you can't have multiple folders with the same name.)Knightly
P
0

I have tried the following solution proposed in another answer:

import google.cloud.storage as gcs
client = gcs.Client()
blobs = client.list_blobs(
    bucket_or_name=BUCKET_NAME, 
    prefix="", 
    delimiter="/", 
    max_results=1
)
next(blobs) 
print(blobs.prefixes)

But without success. As others have noted, only some results are fetched when using max_results=1. And it seems blobs.prefixes only returns the folders that were part of these results. Without max_results set, i got the correct list of folders, but the call was rather slow (> 2 sec in my case). I assume it had to send a list of all blobs to the client (~10k blobsin my case).

Using the following code, I was able to get the top level folders in reasonable time (~0.3s ). I assume the match_glob is evaluated on server side, hence more efficient:

blobs = client.list_blobs(
    bucket_or_name=BUCKET_NAME,
    match_glob="**/",
    delimiter="/",
)
list(blobs)  # dummy call to trigger evaluation of lazy iterator
print(blobs.prefixes)
Pomfret answered 26/9, 2024 at 16:0 Comment(0)
Y
-1

You can get the top-level prefixes by using a delimited listing. See the list_blobs documentation:

delimiter (str) – (Optional) Delimiter, used with prefix to emulate hierarchy.

Something like this:

from google.cloud import storage
storage_client = storage.Client()
storage_client.list_blobs(BUCKET_NAME, delimiter='/')
Yapok answered 30/12, 2019 at 17:53 Comment(3)
For me your code is not working, this is what works: storage_client.list_blobs('my-bucket', prefix='source/',delimiter='/'),Summerwood
@Summerwood - are you saying that delimiter doesn't work without a prefix? That should not be the case.Yapok
doesn't work for me, somehow still lists all the blobs in the subdirectory. The documentation that you posted is also very vague.Metabolism

© 2022 - 2025 — McMap. All rights reserved.