How to copy folders with 'gsutil'?
Asked Answered
L

3

18

I've read the documentation on the gsutil cp command, but still don't understand how to copy folders to keep the same permissions. I tried this command:

gsutil cp gs://bucket-name/folder1/folder_to_copy gs://bucket-name/folder1/new_folder

But it resulted with the error:

CommandException: No URLs matched

Although, when I tried it with slashes in the end of each name, it didn't show any error:

gsutil cp gs://bucket-name/folder1/folder_to_copy/ gs://bucket-name/folder1/new_folder/

However, there was no new folder in the bucket when I checked with gsutil ls. What am I doing wrong?

Latency answered 24/12, 2019 at 9:16 Comment(1)
What do you mean with folder permission's ? You talking about ACLs ?Indocile
P
48

Using cp

You should use the -r option to copy a folder and its contents recursively:

gsutil cp -r gs://bucket-name/folder1/folder_to_copy gs://bucket-name/folder1/new_folder

Note that this will only work if folder_to_copy contains files. This is due to the fact Cloud Storage doesn't really have "folders" as one would expect in a typical GUI, it instead provides the illusion of a hierarchical file tree atop the "flat" name space, as explained here. In other words, the files within a folder are simply objects that have the folder prefix appended to them. Therefore, when you're doing gsutil cp, it expects actual objects to be copied and not empty directories which is something the CLI does not understand.


Using rsync

Another approach would be to simply use rsync instead, which tolerates the use of empty folders and also synchronizes the contents between source and destination folders:

gsutil rsync -r gs://bucket-name/folder1/folder_to_copy gs://bucket-name/folder1/new_folder

If you also want to preserve the ACL (permissions) of the objects, use the -p option:

gsutil rsync -p -r gs://bucket-name/folder1/folder_to_copy gs://bucket-name/folder1/new_folder
Paltry answered 24/12, 2019 at 9:47 Comment(6)
@praytic - To add to this answer, which is correct, you can use wildcards *. cloud.google.com/storage/docs/gsutil/addlhelp/WildcardNamesPiscatorial
But I don't want to copy contents of the folder. I need to copy permissions of the folder and create a new one with the same permissions.Latency
@Latency Well, folders don't really exist in Cloud Storage as mentioned here, so you can't assign permissions to them. You can give an ACL to individual objects though, and preserve it with the -p option when doing gsutil rsync or other gsutil commands.Paltry
Note that gs://bucket-name/folder1/folder_to_copy should not contain a trailing slash like gs://bucket-name/folder1/folder_to_copy/Expanded
There is now gcloud storage cp which is "fast by default".Circumlocution
Adding to the cp with the -r option, if you're copying a large amount of files I recommend using gsutil -m cp -r to perform multiple requests in parallel which results in a huge speedup.Dope
B
11

To add to @Maxim's answer, you might consider using the -m argument when calling gsutil to allow parallel copy.

gsutil -m cp -r gs://bucket-name/folder1/folder_to_copy gs://bucket-name/folder1/new_folder

The -m arg enables parallelism.

As advised in the gsutil doc, the -m arg might not yield better performance with a slow network (i.e., at home). But for the case of inter bucket copy (machines in data center) the performance are likely to "significantly improve" to quote gsutil manual. See below

 -m          Causes supported operations (acl ch, acl set, cp, mv, rm, rsync,
              and setmeta) to run in parallel. This can significantly improve
              performance if you are performing operations on a large number of
              files over a reasonably fast network connection.

              gsutil performs the specified operation using a combination of
              multi-threading and multi-processing, using a number of threads
              and processors determined by the parallel_thread_count and
              parallel_process_count values set in the boto configuration
              file. You might want to experiment with these values, as the
              best values can vary based on a number of factors, including
              network speed, number of CPUs, and available memory.

              Using the -m option may make your performance worse if you
              are using a slower network, such as the typical network speeds
              offered by non-business home network plans. It can also make
              your performance worse for cases that perform all operations
              locally (e.g., gsutil rsync, where both source and destination
              URLs are on the local disk), because it can "thrash" your local
              disk.

              If a download or upload operation using parallel transfer fails
              before the entire transfer is complete (e.g. failing after 300 of
              1000 files have been transferred), you will need to restart the
              entire transfer.

              Also, although most commands will normally fail upon encountering
              an error when the -m flag is disabled, all commands will
              continue to try all operations when -m is enabled with multiple
              threads or processes, and the number of failed operations (if any)
              will be reported as an exception at the end of the command's
              execution.

Notes: at the time of this writing python3.8 seems to lead to problems with the -m flag. Use python3.7. More info on this Github Issue

Baring answered 2/8, 2020 at 8:53 Comment(1)
Shall we want to copy files from gs buckets in a way that is shown below, gsutil cp -r gs://ukbb-exome-public/300k/results/results.mt ./ gsutil cp -r gs://ukbb-exome-public/300k/results/variant_results.mt ./ is there a way to enable parallelism ?Gaseous
F
0

For people not wanting to install the whole SDK and use Docker instead, here is the series of commands i used to download a Bucket to a Docker volume named googledata. (Replace gs://assets with the name of your Bucket)

docker pull google/cloud-sdk:latest
docker run -ti --name gcloud-config google/cloud-sdk gcloud auth login
docker run --rm -ti -v googledata:/tmp --volumes-from gcloud-config google/cloud-sdk gsutil cp -r gs://assets /tmp

Docker Container see here.

Quite some effort just to get your data...

Fluoroscopy answered 5/2, 2022 at 20:58 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.