Google Cloud Storage: How to get list of new files in bucket/folder using gsutil
Asked Answered
C

4

8

I have a bucket/folder into which a lot for files are coming in every minutes. How can I read only the new files based on file timestamp.

eg: list all files with timestamp > my_timestamp

Chanson answered 17/5, 2017 at 6:45 Comment(0)
Z
5

This is not a feature that gsutil or the GCS API provides, as there is no way to list objects by timestamp.

Instead, you could subscribe to new objects using the GCS Cloud Pub/Sub feature.

Zwiebel answered 17/5, 2017 at 18:16 Comment(1)
+1. See this question for an example of how to set this up using gsutil: #43075334Porphyrin
C
21

You could use some bash-fu:

gsutil ls -l gs://<your-bucket-name> | sort -k2n | tail -n1 | awk 'END {$1=$2=""; sub(/^[ \t]+/, ""); print }'

breaking that down:

# grab detailed list of objects in bucket
gsutil ls -l gs://your-bucket-name 

# sort by number on the date field
sort -k2n

# grab the last row returned 
tail -n1

# delete first two cols (size and date) and ltrim to remove whitespace
awk 'END {$1=$2=""; sub(/^[ \t]+/, ""); print }'`

Tested with Google Cloud SDK v186.0.0, gsutil v4.28

Congregate answered 17/1, 2018 at 10:25 Comment(4)
this solution is very brittle, as it will break easily if google decides to change the format a little bitDamselfly
@remisharoon You never asked for super robust in your original question. I'll add the gsutil version number to my answer so that people get an idea.Congregate
Update that works for me on newer Google CloudSDK 190.0.1 and gsutil 4.28: gsutil ls -l gs://your-bucket-name | sort -k2 | tail -n2 | head -n1 | awk 'END {$1=$2=""; sub(/^[ \t]+/, ""); print }'Lustreware
is it possible to choose asc or dsc order while ouput the sort resultsPurpura
Z
5

This is not a feature that gsutil or the GCS API provides, as there is no way to list objects by timestamp.

Instead, you could subscribe to new objects using the GCS Cloud Pub/Sub feature.

Zwiebel answered 17/5, 2017 at 18:16 Comment(1)
+1. See this question for an example of how to set this up using gsutil: #43075334Porphyrin
S
1

For finding list of files with timestamp greater than specified timestamp, use below command by replacing <bucket-name/folder> and <my_timestamp in epoch> with required value:

Mac:

gsutil ls -l gs://<bucket-name/folder> | sed \$d | grep -v '/$' | awk '{ split($0,a," "); extracttimestamp="date -jf '%Y-%m-%dT%H:%M:%SZ' " a[2] " +%s"; extracttimestamp | getline $1; close(extracttimestamp); { if ( $1 > <my_timestamp in epoch> ) { print a[3],a[2],$1 }}}'

Linux:

gsutil ls -l gs://<bucket-name/folder> | grep -v '/$' | grep -v '^TOTAL' | awk -F, '{ split($0,a," "); gsub("-","/",a[2]); gsub("T"," ",a[2]); gsub("Z","",a[2]); extracttimestamp="date -d " "\"" a[2] "\""  " +%s"; extracttimestamp | getline $1; close(extracttimestamp); { if ( $1 > <my_timestamp in epoch> ) { print a[3],a[2],$1 }}}' 

Final Output

FileName Timestamp Epoch

gs://bucket/test/obj2.html 2020-03-02T19:30:27Z 1583157627
gs://bucket/test/obj3.txt 2020-03-02T19:37:45Z 1583158065

Understanding Above Command

gsutil ls -l gs://bucket/test/

#If you specify the -l option, 
#gsutil outputs additional information about each matching object e.g. file timestamp
#output
   1453783  2020-03-02T19:25:16Z  gs://bucket/test/
   2276224  2020-03-02T19:25:17Z  gs://bucket/test/obj1.html
   3914624  2020-03-02T19:30:27Z  gs://bucket/test/obj2.html
       131  2020-03-02T19:37:45Z  gs://bucket/test/obj3.txt
TOTAL: 3 objects, 6190979 bytes (5.9 MiB)

sed \$d

#all line except the last line
#output
   1453783  2020-03-02T19:25:16Z  gs://bucket/test/
   2276224  2020-03-02T19:25:17Z  gs://bucket/test/obj1.html
   3914624  2020-03-02T19:30:27Z  gs://bucket/test/obj2.html
       131  2020-03-02T19:37:45Z  gs://bucket/test/obj3.txt

grep -v '/$'

#text that does not end with "/", there by removing folder
#output
   2276224  2020-03-02T19:25:17Z  gs://bucket/test/obj1.html
   3914624  2020-03-02T19:30:27Z  gs://bucket/test/obj2.html
       131  2020-03-02T19:37:45Z  gs://bucket/test/obj3.txt

awk '{ split($0,a," "); extracttimestamp="date -jf '%Y-%m-%dT%H:%M:%SZ' " a[2] " +%s"; extracttimestamp | getline $1; close(extracttimestamp); { if ( $1 > <my_timestamp in epoch> ) { print a[3],a[2],$1 }}}'

#split each line into array of token
#split($0,a," ")
    #split("2276224  2020-03-02T19:25:17Z  gs://bucket/test/obj1.html", a, " ")
        a=("2276224" "2020-03-02T19:25:17Z" "gs://bucket/test/obj1.html")

#extract epoch time from second token(file timestamp)
#extracttimestamp="date -jf '%Y-%m-%dT%H:%M:%SZ' " a[2] " +%s"
    #date -jf '%Y-%m-%dT%H:%M:%SZ' 2020-03-02T19:25:17Z +%s = 1583157317
        #extracttimestamp = 1583157317

#read extracttimestamp and bind value to $1
#extracttimestamp | getline $1;
    #$1 = 1583157317

#compare file timestamp with my_timestamp and print filename
#if ( $1 > <my_timestamp in epoch> ) { print a[3],a[2],$1 }
    #if ( 1583157317 > 1583157320 ) { print a[3],a[2],$1 }

#final output
gs://bucket/test/obj2.html 2020-03-02T19:30:27Z 1583157627
gs://bucket/test/obj3.txt 2020-03-02T19:37:45Z 1583158065
Siusan answered 27/2, 2023 at 20:23 Comment(0)
N
0

If you are interested in new files or we can say in another words the files which are not present in your destination bucket then alternatively you can use gsutil -n option as it copies only those files which are not present in destination bucket.

From documentation https://cloud.google.com/storage/docs/gsutil/commands/cp?hl=ru

No-clobber. When specified, existing files or objects at the destination will not be overwritten. Any items that are skipped by this option will be reported as being skipped. This option will perform an additional GET request to check if an item exists before attempting to upload the data. This will save retransmitting data, but the additional HTTP requests may make small object transfers slower and more expensive.

cons with this approach is, it makes a check request for every file present in your source bucket

Nativeborn answered 21/7, 2019 at 1:0 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.