How to get list_blobs to behave like gsutil
Asked Answered
S

4

7

I would like to only get the first level of a fake folder structure on GCS.

If I run e.g.:

gsutil ls 'gs://gcp-public-data-sentinel-2/tiles/' I get a list like this: gs://gcp-public-data-sentinel-2/tiles/01/ gs://gcp-public-data-sentinel-2/tiles/02/ gs://gcp-public-data-sentinel-2/tiles/03/ gs://gcp-public-data-sentinel-2/tiles/04/ gs://gcp-public-data-sentinel-2/tiles/05/ gs://gcp-public-data-sentinel-2/tiles/06/ gs://gcp-public-data-sentinel-2/tiles/07/ gs://gcp-public-data-sentinel-2/tiles/08/ gs://gcp-public-data-sentinel-2/tiles/09/ gs://gcp-public-data-sentinel-2/tiles/10/ gs://gcp-public-data-sentinel-2/tiles/11/ gs://gcp-public-data-sentinel-2/tiles/12/ gs://gcp-public-data-sentinel-2/tiles/13/ gs://gcp-public-data-sentinel-2/tiles/14/ gs://gcp-public-data-sentinel-2/tiles/15/ . . .

Running code like the following in the Python API give me an empty result:

from google.cloud import storage
bucket_name = 'gcp-public-data-sentinel-2'
prefix = 'tiles/'
storage_client = storage.Client()
bucket = storage_client.get_bucket(bucket_name)
for blob in bucket.list_blobs(max_results=10, prefix=prefix,
                              delimiter='/'):
    print blob.name

If I don't use the delimiter option I get all the results in the bucket which is not very useful.

Sessile answered 17/7, 2018 at 10:27 Comment(0)
W
7

Maybe not the best way, but, inspired by this comment on the official repository:

iterator = bucket.list_blobs(delimiter='/', prefix=prefix)
response = iterator._get_next_page_response()
for prefix in response['prefixes']:
    print('gs://'+bucket_name+'/'+prefix)

Gives:

gs://gcp-public-data-sentinel-2/tiles/01/
gs://gcp-public-data-sentinel-2/tiles/02/
gs://gcp-public-data-sentinel-2/tiles/03/
gs://gcp-public-data-sentinel-2/tiles/04/
gs://gcp-public-data-sentinel-2/tiles/05/
gs://gcp-public-data-sentinel-2/tiles/06/
gs://gcp-public-data-sentinel-2/tiles/07/
gs://gcp-public-data-sentinel-2/tiles/08/
gs://gcp-public-data-sentinel-2/tiles/09/
gs://gcp-public-data-sentinel-2/tiles/10/
...
Wingo answered 17/7, 2018 at 12:30 Comment(1)
what if tiles/ has a file like abc.txt ? with this approach I'm seeing only folders being returned as part of prefixes. while gsutil ls returns tiles/abc.txt also part of the resultsVoltameter
T
0

If one finds this ticket like me after a long time: currently (google-cloud-storage 2.1.0) one can list the bucket contents using '//' instead of '/'. However, it lists "recursively" down to the actual blob (as it is not a real FS)

Throughput answered 10/6, 2022 at 11:27 Comment(0)
O
0

Here is a faster way (found this in a github thread, posted by @evanj https://github.com/GoogleCloudPlatform/google-cloud-python/issues/920):

def list_gcs_directories(bucket, prefix):
    iterator = bucket.list_blobs(prefix=prefix, delimiter='/')
    prefixes = set()
    for page in iterator.pages:
        print(page, page.prefixes)
        prefixes.update(page.prefixes)
    return prefixes

You want to call this function as follows:

client = storage.Client()
bucket_name = 'my_bucket_name'
bucket_obj = client.bucket(bucket_name)
list_folders = list_gcs_directories(bucket_obj, prefix='my/prefix/path/within/bucket/')

# Getting rid of the prefix
list_folders = [''.join(indiv_folder.split('/')[-1])
                  for indiv_folder in list_folders]

Oshea answered 1/9, 2022 at 9:22 Comment(0)
M
0

Here is the right answer that works

To achieve the simple listing of a directory also called as a blob in google storage bucket.

Sample Link: 'gs://BUCKET_A/FOLDER_1/FOLDER_2/FILE_10.txt'

Function to be used: list_blobs.

Parameters required to be passed to the list_blobs

  1. bucket_name - Name of the storage bucket. Example: "BUCKET_A"
  2. prefix - Example: "FOLDER_1/FOLDER_2"
  3. delimiter - The listing shouldn't exceed beyond the character passed to this. For simple listing, the delimiter has to be '/'. Meaning, the folders path for the next hierarchy has to cross '/' and so they will be ignored while traversing by the API implementation.

Sample code

storage_client = storage.Client()

# Note: Client.list_blobs requires at least package version 1.17.0.
blobs = storage_client.list_blobs(bucket_name, prefix=prefix, delimiter=delimiter)

# Note: The call returns a response only when the iterator is consumed.
print("Blobs:")
for blob in blobs:
    print(blob.name)

if delimiter:
    print("Prefixes:")
    for prefix in blobs.prefixes:
        print(prefix)

To achieve what we need:

  1. Pass the prefix with trailing slash "/".
  2. Pass delimiter as "/" to restrict listing not go beyond current directory.
  3. Process the results in two forms. Say the blobs is the return value from the list_blobs. Simple iteration of the blobs will return the files available in that level. If one want's the subdirectories in that level, iterate over blobs.prefixes.

In Summary,

Access the files by simply iterating the blobs. Access the sub-folders by simply iterating the blobs.prefixes.

Microscopic answered 6/6 at 6:56 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.