Is it possible to limit the depth of a recursive directory listing in S3 bucket?
Asked Answered
E

4

8

I used the following command:

aws s3 ls s3://mybucket/mydir --recursive > bigfile

The resulting file was too huge (9.5MB) to conveniently work with, since I need to eyeball the info I'm looking for.

All I really need is the information three levels down. Is it possible to adjust this command so that I only recurse down N number of levels instead of all the way down every directory? I don't see any thing like -maxdepthfor S3 CLI ls commands

Update: Here is the command I ended up using to get the info I needed, though I'm not satisfied with it. It still gave me 77000 results when I only wanted the 40 or so unique values, but it was short enough to port into excel and whittle down with text-to-columns and remove duplicates.

 aws s3 ls s3://mybucket/mydir --human-readable --summarize --recursive | egrep '*_keytext_*' | tr -s ' ' | cut -d' ' -f5 >smallerfile
Edgy answered 24/1, 2019 at 21:17 Comment(0)
I
3

Amazon S3 does not have the concept of 'levels'. It is a flat storage system, with the path being part of the object name (Key). Some API calls, however, support the ability to specify a Prefix, which can operate like looking in a particular directory.

An alternative to using aws s3 ls is to use Amazon S3 Inventory, which can provide a daily CSV file listing the contents of a bucket.

Issi answered 25/1, 2019 at 1:36 Comment(0)
N
6

While the accepted answer is strictly true, it's still very useful to have this feature, as evidenced by the bug report on the aws-cli (https://github.com/aws/aws-cli/issues/2683).

I worked around this with a bash script and an awk script. The bash scripts gets a single level, the awk script parses the output and will recursively call the bash script to get the next level.

#!/bin/bash
# Save as ./s3-tree.sh
bucket=$1; max_depth=$2; path=${3:-}; depth=${4:-1};
[ $depth -gt $max_depth ] || \
  aws s3 ls "s3://$bucket/$path" | \
  awk -v bucket="$bucket" -v path="$path" -v depth="$depth" -v max_depth="$max_depth" -f s3-tree.awk
#!/bin/awk
# Save as: ./s3-tree.awk
BEGIN  { FIELDWIDTHS = "10 1 8 1 10 1 600" }
$5 == 0 { next } # Ignore zero-size files
{ print $1 " " $3 " " $5 " " path $7 }
$5 == "       PRE" && depth <= max_depth { system("./s3-tree.sh " bucket " " max_depth " " path $7 " " depth+1); next }

invoke as:

./s3-tree.sh <my-bucket-name> <max-depth> [<starting-path>]

Share and enjoy!

Netsuke answered 2/10, 2021 at 1:3 Comment(4)
it didn't work for me (MacOS), it shows results like this: PRE <my-prefix> and that's itSurra
@KhaledAbuShqear - MacOS has some quite different versions of Posix's tools. You could try brew installing GNU awk and change awk to gawk. That's just a very quick guess, but worth isolating first. Also possible that a change in the tooling has resulted in a different output format: worth checking if the FIELDWIDTHS took correct.Netsuke
Also not working for me on RHEL (Linux). awk --version is GNU Awk 4.0.2. Any idea how we could figure out the appropriate FIELDWIDTHS?Monarchist
@Monarchist Just run the aws s3 ls on one of the levels of the bucket and hopefully the FIELDWIDTHS can be determained from that. This could be a change in the aws cli output, or it may be that it's trying to utilises full-screen-widths.Netsuke
I
3

Amazon S3 does not have the concept of 'levels'. It is a flat storage system, with the path being part of the object name (Key). Some API calls, however, support the ability to specify a Prefix, which can operate like looking in a particular directory.

An alternative to using aws s3 ls is to use Amazon S3 Inventory, which can provide a daily CSV file listing the contents of a bucket.

Issi answered 25/1, 2019 at 1:36 Comment(0)
R
2

Since aws cp provides --include, --exclude, and --dryrun flags, we can make use of those to achieve ls with a depth.

For example, for a recursive ls for depth=1:

aws s3 cp --recursive --exclude '*/*' s3://bucket/folder . --dryrun | awk '{print $3}'

For depth=2:

aws s3 cp --recursive --exclude '*/*/*' s3://bucket/folder . --dryrun | awk '{print $3}'

List only .py files for depth=2:

aws s3 cp --recursive --exclude '*' --include '*.py' --exclude '*/*/*' s3://bucket/folder . --dryrun | awk '{print $3}'

Note that the order of --exclude and --include flags matter.

Rotherham answered 16/2 at 10:32 Comment(1)
Thank you, this is actually a better answer than than the accepted oneLingo
S
0

While it isn't a pure shell+awscli workaround, the Python package s3fs has a reasonably fast recursive globbing function, and s3fs is pretty mature

for example

# pip install s3fs
import s3fs
s3 = s3fs.S3FileSystem()

s3.glob(f'{BUCKET}/*/*findme*')
# much slower of course
s3.glob(f'{BUCKET}/**/*findme*')

# returns
'''
['BUCKET/checkpoints/findme',
 'BUCKET/checkpoints/temp_findme_other',
 'BUCKET/tables/findme',
 'BUCKET/tables/temp_findme_other']
'''
Sauna answered 23/1 at 18:9 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.