gsutil: Argument list too long
Asked Answered
D

3

7

I am trying to upload many thousands of files to Google Cloud Storage, with the following command:

gsutil -m cp *.json gs://mybucket/mydir

But I get this error:

-bash: Argument list too long

What is the best way to handle this? I can obviously write a bash script to iterate over different numbers:

gsutil -m cp 92*.json gs://mybucket/mydir
gsutil -m cp 93*.json gs://mybucket/mydir
gsutil -m cp ...*.json gs://mybucket/mydir

But the problem is that I don't know in advance what my filenames are going to be, so writing that command isn't trivial.

Is there either a way to handle this with gsutil natively (I don't think so, from the documentation), or a way to handle this in bash where I can list say 10,000 files at a time, then pipe them to the gsutil command?

Dvandva answered 27/6, 2017 at 12:16 Comment(0)
B
25

Eric's answer should work, but another option would be to rely on gsutil's built-in wildcarding, by quoting the wildcard expression:

gsutil -m cp "*.json" gs://mybucket/mydir

To explain more: The "Argument list too long" error is coming from the shell, which has a limited size buffer for expanded wildcards. By quoting the wildcard you prevent the shell from expanding the wildcard and instead the shell passes that literal string to gsutil. gsutil then expands the wildcard in a streaming fashion, i.e., expanding it while performing the operations, so it never needs to buffer an unbounded amount of expanded text. As a result you can use gsutil wildcards over arbitrarily large expressions. The same is true when using gsutil wildcards over object names, so for example this would work:

gsutil -m cp "gs://my-bucket1/*" gs://my-bucket2

even if there are a billion objects at the top-level of gs://my-bucket1.

Builder answered 27/6, 2017 at 13:24 Comment(3)
As good practice, you should still quote gs://my-bucket1/*. The shell will still treat that string as a pattern to match, and although it will almost certainly fail to match anything, it is possible to set a shell option to treat non-matching patterns as an error rather than as a literal string.Hadfield
Thanks chepner - I added quotes to my answer per your suggestion.Builder
Thanks, saved me some time!Quadrant
H
3

If your filenames are safe from newlines you could use gsutil cp's ability to read from stdin like

find . -maxdepth 1 -type f -name '*.json' | gsutil -m cp -I gs://mybucket/mydir

or if you're not sure if your names are safe and your find and xargs support it you could do

find . -maxdepth 1 -type f -name '*.json' -print0 | xargs -0 -I {} gsutil -m cp {} gs://mybucket/mydir
Heal answered 27/6, 2017 at 12:42 Comment(5)
The last example would be simpler as find ... -exec gsutil -m cp {} gs://mybucket/mydir \;. (In either case, I think -m is unnecessary, since you are only passing a single file/URL to each instance of gsutil.)Hadfield
@Hadfield does xargs then spawn a new instance of gsutil for each file if the argument isn't at the end as with -exec using \; instead of +? Your version of the find command would be portable at least thoughHeal
@Hadfield yup, reading the man page does indeed confirm as you said, -I implies -L 1Heal
I think there are ways of using xargs to similarly batch like -exec ... +, but I think the issue here is that gsutil either takes a pattern or a single file. I didn't see a way to write something like -exec cp -T src_dir {} + like you could with GNU cp.Hadfield
@Hadfield I worked out a way to do it, shame I didn't refresh the page before posting!Recrement
R
1

Here's a way you could do it, using xargs to limit the number of files that are passed to gsutil at once. Null bytes are used to prevent problems with spaces in or newlines in the filenames.

printf '%s\0' *.json | xargs -0 sh -c 'copy_all () { 
    gsutil -m cp "$@" gs://mybucket/mydir
}
copy_all "$@"'

Here we define a function which is used to put the file arguments in the right place in the gsutil command. This whole process should happen the minimum number of times required to process all arguments, passing the maximum number of filename arguments possible each time.

Alternatively you can define the function separately and then export it (this is bash-specific):

copy_all () { 
    gsutil -m cp "$@" gs://mybucket/mydir
}
printf '%s\0' *.json | xargs -0 bash -c 'export -f copy_all; copy_all "$@"'
Recrement answered 27/6, 2017 at 17:6 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.