Which is better to use with production transfers, gsutil, or the google cloud storage API?
gsutil
uses a Google Cloud Storage API to transfer data, specifically the JSON API (by default, you can change it). Its main advantage over using the API directly is that it has been tuned to transfer data quickly. For example, it can open up multiple simultaneous connections to GCS, each of which is uploading or downloading part of the file concurrently, which in many cases can provide a significant boost to total throughput.
There's no reason that programming against the API directly could not also provide the same or even better performance, but I would expect gsutil to be at least a little bit faster on average if you implement things in the simplest possible manner.
I'm not sure this is adding much over what Brandon has said. I'm very new to gcloud storage and Python, but I've quickly found that I prefer to use the gsutil command line over the python client library whereever possible. I create compute instances that copy a few GB of input data from cloud storage after they have booted. I found that its both neater and faster to do this using the gsutil command line where possible, so in my python code I use:
import subprocess
subprocess.call("gsutil -m cp gs://my-uberdata-archive/* /home/<username>/rawdata/", shell=True)
The main reasons being that I can do the command in a single line whereas it takes several lines using the client library, and as Brandon points out, gsutil supports multi-threading with the '-m' flag. I haven't found an equivalent way to do this with the Python Client library yet.
© 2022 - 2024 — McMap. All rights reserved.