As Aniket mentions, pig sh
would itself be considered the script-runner for Dataproc jobs; instead of having to turn your wrapper script into a Pig script in itself, just use Pig to bootstrap any bash script you want to run. For example, suppose you have an arbitrary bash script hello.sh
:
gsutil cp hello.sh gs://${BUCKET}/hello.sh
gcloud dataproc jobs submit pig --cluster ${CLUSTER} \
-e 'fs -cp -f gs://${BUCKET}/hello.sh file:///tmp/hello.sh; sh chmod 750 /tmp/hello.sh; sh /tmp/hello.sh'
The pig fs
command uses Hadoop paths so to copy your script from GCS you must copy to a destination specified as file:///
to make sure it's on the local filesystem instead of HDFS; then the sh
commands afterwards will be referencing local filesystem automatically so you don't use file:///
there.
Alternatively, you can take advantage of the way --jars
works to automatically stage a file into the temporary directory created just for your Pig job rather than explicitly copying from GCS into a local directory; you simply specify your shell script itself as a --jars
argument:
gcloud dataproc jobs submit pig --cluster ${CLUSTER} \
--jars hello.sh \
-e 'sh chmod 750 ${PWD}/hello.sh; sh ${PWD}/hello.sh'
Or:
gcloud dataproc jobs submit pig --cluster ${CLUSTER} \
--jars gs://${BUCKET}/hello.sh \
-e 'sh chmod 750 ${PWD}/hello.sh; sh ${PWD}/hello.sh'
In these cases, the script would only temporarily be downloaded into a directory that looks like /tmp/59bc732cd0b542b5b9dcc63f112aeca3
and which only exists for the lifetime of the pig job.
-r
flag for thisfs -cp -f
– Macronucleus