Moving multiple files with gsutil
Asked Answered
B

7

16

Let's say I've got the following files in a Google Cloud Storage bucket:

file_A1.csv
file_B2.csv
file_C3.csv

Now I want to move a subset of these files, lets say file_A1.csv and file_B2.csv. Currently I do this like that:

gsutil mv gs://bucket/file_A1.csv gs://bucket/file_A11.csv
gsutil mv gs://bucket/file_B2.csv gs://bucket/file_B22.csv

This approach requires two call of more or less the same command and moves each file separately. I know, that if I move a complete directory I can add the -m option in order to accelerate this process. However, unfortunately I just want to move a subset of all files and keep the rest untouched in the bucket.

When moving 100 files this way I need to execute 100 commands or so and this becomes quite time consuming. I there a way to combine each of the 100 files into just one command with addtionally the -m option?

Bison answered 29/4, 2015 at 13:50 Comment(1)
Do you have a rule for what the destination's name is? Is that also in a file, or is it "repeat the last letter of the existing file", or something more elaborate?Beige
M
4

gsutil does not support this currently but what you could do is create a number of shell scripts, each performing a portion of the moves, and run them concurrently.

Note that gsutil mv is based on the syntax of the unix mv command, which also doesn't support the feature you're asking for.

Mockery answered 29/4, 2015 at 14:56 Comment(3)
Yeah, I already thought about that. However, is there a limit of concurrent commands that are allowed to be executed simultaneously?Bison
Only normal operating system limitations would apply; the tool itself can be executed any number of times concurrently.Kingsly
Okay, I wrote a small script that move 100 files in parallel. The result was that just 25 files were moved and the whole process took 10 minutes. Defintely not a solution.Bison
M
8

If you have a list of the files you want to move you can use the -I option from the cp command which, according to the docs, is also valid for the mv command:

cat filelist | gsutil -m mv -I gs://my-bucket
Manque answered 10/8, 2020 at 15:57 Comment(1)
that's what I came here for!Trinh
D
7

That worked for me for moving all txt files from gs://config to gs://config/new_folder

gsutil mv 'gs://config/*.txt' gs://config/new_folder/

I had some problems using the wildcard * in zsh, so that is the reason for the quotes around the origin path

Dortheydorthy answered 24/6, 2022 at 11:36 Comment(1)
works perfectly in the cloud console shell as well +1Gusgusba
M
4

gsutil does not support this currently but what you could do is create a number of shell scripts, each performing a portion of the moves, and run them concurrently.

Note that gsutil mv is based on the syntax of the unix mv command, which also doesn't support the feature you're asking for.

Mockery answered 29/4, 2015 at 14:56 Comment(3)
Yeah, I already thought about that. However, is there a limit of concurrent commands that are allowed to be executed simultaneously?Bison
Only normal operating system limitations would apply; the tool itself can be executed any number of times concurrently.Kingsly
Okay, I wrote a small script that move 100 files in parallel. The result was that just 25 files were moved and the whole process took 10 minutes. Defintely not a solution.Bison
M
4

you can achieve that using bash by iterating over the gsutil ls output for example:

  • source folder name: old_folder
  • new folder name: new_folder
for x in `gsutil ls "gs://<bucket_name>/old_folder"`; do y=$(basename -- "$x");gsutil mv ${x} gs://<bucket_name>/new_folder/${y}; done

you can run in parallel if you have a huge number of files using:

N=8 # number of parallel workers
(
for x in `gsutil ls "gs://<bucket_name>/old_folder"`; do 
   ((i=i%N)); ((i++==0)) && wait
   y=$(basename -- "$x");gsutil mv ${x} gs://<bucket_name>/new_folder/${y} & 
done
)
Michalmichalak answered 18/12, 2020 at 8:51 Comment(0)
I
3

Not documented widely but this works all the time

To move the contents of the third folder to the root or any folder before it

gsutil ls gs://my-bucket/first/second/third/ | gsutil -m mv -I gs://my-bucket/first/

and to copy

gsutil ls gs://my-bucket/first/second/third/ | gsutil -m cp -I gs://my-bucket/first/
Intercontinental answered 7/11, 2022 at 13:14 Comment(0)
V
1

To do this you can run the follow gsutil command:

gsutil mv gs://bucket_name/common_file_name*  gs://bucket_destiny_name/common_file_name*    

In your case; common_file_name is "file_"

Virgenvirgie answered 27/7, 2021 at 20:42 Comment(0)
B
0

The lack of -m flag is the real hang up here. Facing the same issue I originally managed this by using python multiprocessing and os.system to call gsutil. I had 60k files and it was going to take hours. With some experimenting I found using the python client gave a 20x speed-up!

If you are willing to move away from gsutil - its a better approach.

Here is a copy(or move) method. If you create a list of src keys/uri's you can call this using multi-threading for fast results.

Note: the method a tuple of (destination-name,exception) which you can pop into a dataframe or something to look for failures

def cp_blob(key=None,bucket=BUCKET_NAME,uri=None,delete_src=False):
    try:
        if uri:
            uri=re.sub('gs://','',uri)
            bucket,key=uri.split('/',maxsplit=1)
        client=storage.Client()
        bucket=client.get_bucket(bucket)
        blob=bucket.blob(key)
        dest=re.sub(THING1,THING2,blob.name)  ## OR SOME OTHER WAY TO GET NEW DESTINATIONS
        out=bucket.copy_blob(blob,bucket,dest)
        if delete_src:
            blob.delete()
        return out.name, None
    except Exception as e:
        return None, str(e)
Backfire answered 16/2, 2021 at 0:12 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.