I have a bucket/folder into which a lot for files are coming in every minutes. How can I read only the new files based on file timestamp.
eg: list all files with timestamp > my_timestamp
I have a bucket/folder into which a lot for files are coming in every minutes. How can I read only the new files based on file timestamp.
eg: list all files with timestamp > my_timestamp
This is not a feature that gsutil or the GCS API provides, as there is no way to list objects by timestamp.
Instead, you could subscribe to new objects using the GCS Cloud Pub/Sub feature.
You could use some bash-fu:
gsutil ls -l gs://<your-bucket-name> | sort -k2n | tail -n1 | awk 'END {$1=$2=""; sub(/^[ \t]+/, ""); print }'
breaking that down:
# grab detailed list of objects in bucket
gsutil ls -l gs://your-bucket-name
# sort by number on the date field
sort -k2n
# grab the last row returned
tail -n1
# delete first two cols (size and date) and ltrim to remove whitespace
awk 'END {$1=$2=""; sub(/^[ \t]+/, ""); print }'`
Tested with Google Cloud SDK v186.0.0
, gsutil v4.28
gsutil
version number to my answer so that people get an idea. –
Congregate gsutil ls -l gs://your-bucket-name | sort -k2 | tail -n2 | head -n1 | awk 'END {$1=$2=""; sub(/^[ \t]+/, ""); print }'
–
Lustreware This is not a feature that gsutil or the GCS API provides, as there is no way to list objects by timestamp.
Instead, you could subscribe to new objects using the GCS Cloud Pub/Sub feature.
For finding list of files with timestamp greater than specified timestamp, use below command by replacing <bucket-name/folder> and <my_timestamp in epoch> with required value:
Mac:
gsutil ls -l gs://<bucket-name/folder> | sed \$d | grep -v '/$' | awk '{ split($0,a," "); extracttimestamp="date -jf '%Y-%m-%dT%H:%M:%SZ' " a[2] " +%s"; extracttimestamp | getline $1; close(extracttimestamp); { if ( $1 > <my_timestamp in epoch> ) { print a[3],a[2],$1 }}}'
Linux:
gsutil ls -l gs://<bucket-name/folder> | grep -v '/$' | grep -v '^TOTAL' | awk -F, '{ split($0,a," "); gsub("-","/",a[2]); gsub("T"," ",a[2]); gsub("Z","",a[2]); extracttimestamp="date -d " "\"" a[2] "\"" " +%s"; extracttimestamp | getline $1; close(extracttimestamp); { if ( $1 > <my_timestamp in epoch> ) { print a[3],a[2],$1 }}}'
Final Output
FileName Timestamp Epoch
gs://bucket/test/obj2.html 2020-03-02T19:30:27Z 1583157627
gs://bucket/test/obj3.txt 2020-03-02T19:37:45Z 1583158065
Understanding Above Command
gsutil ls -l gs://bucket/test/
#If you specify the -l option,
#gsutil outputs additional information about each matching object e.g. file timestamp
#output
1453783 2020-03-02T19:25:16Z gs://bucket/test/
2276224 2020-03-02T19:25:17Z gs://bucket/test/obj1.html
3914624 2020-03-02T19:30:27Z gs://bucket/test/obj2.html
131 2020-03-02T19:37:45Z gs://bucket/test/obj3.txt
TOTAL: 3 objects, 6190979 bytes (5.9 MiB)
sed \$d
#all line except the last line
#output
1453783 2020-03-02T19:25:16Z gs://bucket/test/
2276224 2020-03-02T19:25:17Z gs://bucket/test/obj1.html
3914624 2020-03-02T19:30:27Z gs://bucket/test/obj2.html
131 2020-03-02T19:37:45Z gs://bucket/test/obj3.txt
grep -v '/$'
#text that does not end with "/", there by removing folder
#output
2276224 2020-03-02T19:25:17Z gs://bucket/test/obj1.html
3914624 2020-03-02T19:30:27Z gs://bucket/test/obj2.html
131 2020-03-02T19:37:45Z gs://bucket/test/obj3.txt
awk '{ split($0,a," "); extracttimestamp="date -jf '%Y-%m-%dT%H:%M:%SZ' " a[2] " +%s"; extracttimestamp | getline $1; close(extracttimestamp); { if ( $1 > <my_timestamp in epoch> ) { print a[3],a[2],$1 }}}'
#split each line into array of token
#split($0,a," ")
#split("2276224 2020-03-02T19:25:17Z gs://bucket/test/obj1.html", a, " ")
a=("2276224" "2020-03-02T19:25:17Z" "gs://bucket/test/obj1.html")
#extract epoch time from second token(file timestamp)
#extracttimestamp="date -jf '%Y-%m-%dT%H:%M:%SZ' " a[2] " +%s"
#date -jf '%Y-%m-%dT%H:%M:%SZ' 2020-03-02T19:25:17Z +%s = 1583157317
#extracttimestamp = 1583157317
#read extracttimestamp and bind value to $1
#extracttimestamp | getline $1;
#$1 = 1583157317
#compare file timestamp with my_timestamp and print filename
#if ( $1 > <my_timestamp in epoch> ) { print a[3],a[2],$1 }
#if ( 1583157317 > 1583157320 ) { print a[3],a[2],$1 }
#final output
gs://bucket/test/obj2.html 2020-03-02T19:30:27Z 1583157627
gs://bucket/test/obj3.txt 2020-03-02T19:37:45Z 1583158065
If you are interested in new files or we can say in another words the files which are not present in your destination bucket then alternatively you can use gsutil -n option as it copies only those files which are not present in destination bucket.
From documentation https://cloud.google.com/storage/docs/gsutil/commands/cp?hl=ru
No-clobber. When specified, existing files or objects at the destination will not be overwritten. Any items that are skipped by this option will be reported as being skipped. This option will perform an additional GET request to check if an item exists before attempting to upload the data. This will save retransmitting data, but the additional HTTP requests may make small object transfers slower and more expensive.
cons with this approach is, it makes a check request for every file present in your source bucket
© 2022 - 2024 — McMap. All rights reserved.