What should I do about this gsutil "parallel composite upload" warning?
Asked Answered
P

2

16

I am running a python script and using the os library to execute a gsutil command, which is typically executed in the command prompt on Windows. I have some file on my local computer and I want to put it into a Google Bucket so I do:

import os

command = 'gsutil -m cp myfile.csv  gs://my/bucket/myfile.csv'
os.system(command)

I get a message like:

==> NOTE: You are uploading one or more large file(s), which would run significantly faster if you enable parallel composite uploads. This feature can be enabled by editing the "parallel_composite_upload_threshold" value in your .boto configuration file. However, note that if you do this large files will be uploaded as 'composite objects https://cloud.google.com/storage/docs/composite-objects'_, which means that any user who downloads such objects will need to have a compiled crcmod installed (see "gsutil help crcmod"). This is because without a compiled crcmod, computing checksums on composite objects is so slow that gsutil disables downloads of composite objects.

I want to get rid of this message either by hiding it if it's irrelevant od actually doing what it suggests, but I can't find the .boto file. What should I do?

Polygon answered 31/10, 2017 at 19:46 Comment(1)
You've got bigger problems with this code than performance -- if you don't have tight control of your filenames, it also could be used in a security breach (to provide a concrete example, trying to upload a file created with touch '$(rm -rf ~).csv' wouldn't go well). Much safer to use subprocess.Popen or a derivative without shell=True, passing each piece of the command line as a separate list element.Submersible
S
22

The Parallel Composite Uploads section of the documentation for gsutil describes how to resolve this (assuming, as the warning specifies, that this content will be used by clients with the crcmod module available):

gsutil -o GSUtil:parallel_composite_upload_threshold=150M cp bigfile gs://your-bucket

To do this safely from Python would look like:

filename='myfile.csv'
gs_bucket='my/bucket'
parallel_threshold='150M' # minimum size for parallel upload; 0 to disable

subprocess.check_call([
  'gsutil',
  '-o', 'GSUtil:parallel_composite_upload_threshold=%s' % (parallel_threshold,),
  'cp', filename, 'gs://%s/%s' % (gs_bucket, filename)
])

Note that here you're explicitly providing argument vector boundaries, and not relying on a shell to do this for you; this prevents a malicious or buggy filename from performing undesired operations.


If you don't know that the clients accessing content in this bucket will have the crcmod module, consider setting parallel_threshold='0' above, which will disable this support.

Submersible answered 31/10, 2017 at 19:54 Comment(6)
Thanks for your reply, do you agree that if the size of bigfile is less than 150M then the upload command will still work?Polygon
Yes; it simply won't parallelize in that case.Submersible
If in python, wouldn't it be better to use the python api rather than subprocess -> cli? googleapis.github.io/google-cloud-python/latest/storage/…Submission
@CosminLehene, sounds like a good reason to add your own answer.Submersible
So you should not use the -m flag if you set GSUtil:parallel_composite_upload_threshold, right?Serra
Also, from the documentation: A file can be broken into as many as 32 component pieces; until this piece limit is reached, the maximum size of each component piece is determined by the variable "parallel_composite_upload_component_size,"Serra
F
7

Another way is to set the configuration that the prompt says inside a file in the BOTO_PATH. usually $HOME/.boto.

[GSUtil]
parallel_composite_upload_threshold = 150M

For max speed install the crcmod C library

Fattal answered 23/8, 2020 at 18:58 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.