What is Spark Job ?

Asked 10/3, 2015 at 20:5 Answered 21/10, 2016 at 3:50

Solved apache-spark batch-processing job-scheduling

I have already done with spark installation and executed few testcases setting master and worker nodes. That said, I have a very fat confusion of what exactly a job is meant in Spark context(not SparkContext). I have below questions

How different is job from a Driver program.
Application itself is a part of Driver program?
Spark submit in a way is a job?

I read the Spark documention but still this thing is not clear for me.

Having said, my implementation is to write spark jobs{programmatically} which would to a spark-submit.

Kindly help with some example if possible . It would be very helpdful.

Note: Kindly do not post spark links because I have already tried it. Even though the questions sounds naive but still I need more clarity in understanding.

Perspective answered 10/3, 2015 at 20:5 Comment(0)

Well, terminology can always be difficult since it depends on context. In many cases, you can be used to "submit a job to a cluster", which for spark would be to submit a driver program.

That said, Spark has his own definition for "job", directly from the glossary:

Job A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action (e.g. save, collect); you'll see this term used in the driver's logs.

So I this context, let's say you need to do the following:

Load a file with people names and addresses into RDD1
Load a file with people names and phones into RDD2
Join RDD1 and RDD2 by name, to get RDD3
Map on RDD3 to get a nice HTML presentation card for each person as RDD4
Save RDD4 to file.
Map RDD1 to extract zipcodes from the addresses to get RDD5
Aggregate on RDD5 to get a count of how many people live on each zipcode as RDD6
Collect RDD6 and prints these stats to the stdout.

So,

The driver program is this entire piece of code, running all 8 steps.
Producing the entire HTML card set on step 5 is a job (clear because we are using the save action, not a transformation). Same with the collect on step 8
Other steps will be organized into stages, with each job being the result of a sequence of stages. For simple things a job can have a single stage, but the need to repartition data (for instance, the join on step 3) or anything that breaks the locality of the data usually causes more stages to appear. You can think of stages as computations that produce intermediate results, which can in fact be persisted. For instance, we can persist RDD1 since we'll be using it more than once, avoiding recomputation.
All 3 above basically talk about how the logic of a given algorithm will be broken. In contrast, a task is a particular piece of data that will go through a given stage, on a given executor.

Hope it makes things clearer ;-)

Karnes answered 10/3, 2015 at 21:23 Comment(5)

it is clear for me now :) but nonetheless I have a query on how to write job scheduling. I have read docs but unable to get hook on code. – Perspective 11/3, 2015 at 7:2

Well, that depends a lot on the kind of infrastructure you have (are you using Spark on Yarn for instance?) Not my strong suit, but in principle, I launch all my driver programs from Bash scripts (in order to remember parameters, create output folders, etc). Any normal scheduling tool able to run a console command should work IMHO. If each job uses all resources in the cluster, then you can just submit programs and they'll wait for resources to be freed. – Karnes 11/3, 2015 at 13:55

@DanielLangdon step 1, loding a file into RDDq is also a job?? – Homely 8/10, 2020 at 20:9

@akashpatel No it's not. A job means a Spark action (e.g. save, collect) and any tasks that need to run to evaluate that action – Lawsuit 5/2, 2021 at 10:37

Sorry, could anyone confirm that the relation is always 1:1 (1 action = 1 Job) ie. for each [Spark action], 1 and only 1 job would be triggered, and the other way around. Thank you. – Isomerism 12/3, 2022 at 17:12

-2

Hey here's something I did before, hope it works for you:

#!/bin/bash
# Hadoop and Server Variables
HADOOP="hadoop fs"
HDFS_HOME="hdfs://ha-edge-group/user/max"
LOCAL_HOME="/home/max"

# Cluster Variables
DRIVER_MEM="10G"
EXECUTOR_MEM="10G"
CORES="5"
EXECUTORS="15"

# Script Arguments
SCRIPT="availability_report.py" # Arg[0]
APPNAME="Availability Report" # arg[1]

DAY=`date -d yesterday +%Y%m%d`

for HOUR in 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23
do
        #local directory to getmerge to
        LOCAL_OUTFILE="$LOCAL_HOME/availability_report/data/$DAY/$HOUR.txt"

        # Script arguments
        HDFS_SOURCE="webhdfs://1.2.3.4:0000/data/lbs_ndc/raw_$DAY'_'$HOUR" # arg[2]
        HDFS_CELLS="webhdfs://1.2.3.4:0000/data/cells/CELLID_$DAY.txt" # arg[3]
        HDFS_OUT_DIR="$HDFS_HOME/availability/$DAY/$HOUR" # arg[4]

        spark-submit \
        --master yarn-cluster \
        --driver-memory $DRIVER_MEM \
        --executor-memory $EXECUTOR_MEM \
        --executor-cores $CORES \
        --num-executors $EXECUTORS \
        --conf spark.scheduler.mode=FAIR \
        $SCRIPT $APPNAME $HDFS_SOURCE $HDFS_CELLS $HDFS_OUT_DIR

        $HADOOP -getmerge $HDFS_OUT_DIR $LOCAL_OUTFILE
done

Dev answered 21/10, 2016 at 3:50 Comment(0)

Recommended topics

Hot tags