Run Bash script on GCP Dataproc

Asked 14/10, 2019 at 12:17 Answered 15/10, 2019 at 0:57

I want to run shell script on Dataproc which will execute my Pig scripts with arguments. These arguments are always dynamic and are calculated by shell script.

Currently this scripts are running on AWS with the help of script-runner.jar. I am not sure how to move this to Dataproc. Is there anything similar which is available for Dataproc?

Or I will have to change all my scripts and calculate the arguments in Pig with the help of pig sh or pig fs?

Michaelamichaele answered 14/10, 2019 at 12:17 Comment(0)

As Aniket mentions, pig sh would itself be considered the script-runner for Dataproc jobs; instead of having to turn your wrapper script into a Pig script in itself, just use Pig to bootstrap any bash script you want to run. For example, suppose you have an arbitrary bash script hello.sh:

gsutil cp hello.sh gs://${BUCKET}/hello.sh
gcloud dataproc jobs submit pig --cluster ${CLUSTER} \
    -e 'fs -cp -f gs://${BUCKET}/hello.sh file:///tmp/hello.sh; sh chmod 750 /tmp/hello.sh; sh /tmp/hello.sh'

The pig fs command uses Hadoop paths so to copy your script from GCS you must copy to a destination specified as file:/// to make sure it's on the local filesystem instead of HDFS; then the sh commands afterwards will be referencing local filesystem automatically so you don't use file:/// there.

Alternatively, you can take advantage of the way --jars works to automatically stage a file into the temporary directory created just for your Pig job rather than explicitly copying from GCS into a local directory; you simply specify your shell script itself as a --jars argument:

gcloud dataproc jobs submit pig --cluster ${CLUSTER} \
    --jars hello.sh \
    -e 'sh chmod 750 ${PWD}/hello.sh; sh ${PWD}/hello.sh'

Or:

gcloud dataproc jobs submit pig --cluster ${CLUSTER} \
    --jars gs://${BUCKET}/hello.sh \
    -e 'sh chmod 750 ${PWD}/hello.sh; sh ${PWD}/hello.sh'

In these cases, the script would only temporarily be downloaded into a directory that looks like /tmp/59bc732cd0b542b5b9dcc63f112aeca3 and which only exists for the lifetime of the pig job.

Woodworth answered 15/10, 2019 at 0:57 Comment(1)

How would you copy an entire directory. It doesn't seem like there's a -r flag for this fs -cp -f – Macronucleus 22/11, 2021 at 17:35

There is no shell job in Dataproc at the moment. As an alternative, you can use a use a pig job with sh command that forks your shell script which can then (again) run your pig job. (You can use pyspark similarly if you prefer python). For example-

# cat a.sh
HELLO=hello
pig -e "sh echo $HELLO"
# pig -e "sh $PWD/a.sh"

Miffy answered 15/10, 2019 at 0:15 Comment(0)

Recommended topics

Hot tags