Spark: get number of cluster cores programmatically

Asked 20/11, 2017 at 18:50 Answered 2/12, 2022 at 18:33

Solved java apache-spark dataset hadoop-yarn cpu-cores

I run my spark application in yarn cluster. In my code I use number available cores of queue for creating partitions on my dataset:

Dataset ds = ...
ds.coalesce(config.getNumberOfCores());

My question: how can I get number available cores of queue by programmatically way and not by configuration?

Dialectal answered 20/11, 2017 at 18:50 Comment(3)

which resource manager are you using? yarn or mesos – Radiotherapy 20/11, 2017 at 19:6

I'm using yarn. – Dialectal 20/11, 2017 at 19:17

Extract required queue parameters from yarn cluster API then use it in coalesce – Radiotherapy 20/11, 2017 at 20:42

There are ways to get both the number of executors and the number of cores in a cluster from Spark. Here is a bit of Scala utility code that I've used in the past. You should easily be able to adapt it to Java. There are two key ideas:

The number of workers is the number of executors minus one or sc.getExecutorStorageStatus.length - 1.
The number of cores per worker can be obtained by executing java.lang.Runtime.getRuntime.availableProcessors on a worker.

The rest of the code is boilerplate for adding convenience methods to SparkContext using Scala implicits. I wrote the code for 1.x years ago, which is why it is not using SparkSession.

One final point: it is often a good idea to coalesce to a multiple of your cores as this can improve performance in the case of skewed data. In practice, I use anywhere between 1.5x and 4x, depending on the size of data and whether the job is running on a shared cluster or not.

import org.apache.spark.SparkContext

import scala.language.implicitConversions


class RichSparkContext(val sc: SparkContext) {

  def executorCount: Int =
    sc.getExecutorStorageStatus.length - 1 // one is the driver

  def coresPerExecutor: Int =
    RichSparkContext.coresPerExecutor(sc)

  def coreCount: Int =
    executorCount * coresPerExecutor

  def coreCount(coresPerExecutor: Int): Int =
    executorCount * coresPerExecutor

}


object RichSparkContext {

  trait Enrichment {
    implicit def enrichMetadata(sc: SparkContext): RichSparkContext =
      new RichSparkContext(sc)
  }

  object implicits extends Enrichment

  private var _coresPerExecutor: Int = 0

  def coresPerExecutor(sc: SparkContext): Int =
    synchronized {
      if (_coresPerExecutor == 0)
        sc.range(0, 1).map(_ => java.lang.Runtime.getRuntime.availableProcessors).collect.head
      else _coresPerExecutor
    }

}

Update

Recently, getExecutorStorageStatus has been removed. We have switched to using SparkEnv's blockManager.master.getStorageStatus.length - 1 (the minus one is for the driver again). The normal way to get to it, via env of SparkContext is not accessible outside of the org.apache.spark package. Therefore, we use an encapsulation violation pattern:

package org.apache.spark

object EncapsulationViolator {
  def sparkEnv(sc: SparkContext): SparkEnv = sc.env
}

Paratyphoid answered 21/11, 2017 at 4:44 Comment(12)

sc.getExecutorStorageStatus.length - 1 is good for me. Thank you – Dialectal 22/11, 2017 at 17:14

sometimes executor cores are overprovisioned or underprovisioned, which means JVM runtime function may be inaccurate. – Lacrimator 26/7, 2018 at 4:21

@Lacrimator absolutely true and also true in the case of complex dynamic pool provisioning in various cluster management systems. This is for the common/easy case and needs to be adjusted for complex scenarios. – Paratyphoid 28/7, 2018 at 20:19

In this code sample, _coresPerExecutor is not assigned to. – Antakiya 27/8, 2019 at 5:49

@AndyKershaw I think you are mistaken private var _coresPerExecutor: Int = 0. – Paratyphoid 12/9, 2019 at 4:21

FYI getExecutorStorageStatus is no longer available as of Spark 2.4.4 – Photometer 21/2, 2020 at 23:40

Thanks for the reminder @DenisMakarenko. We updated our code long ago but I did not update this answer. – Paratyphoid 23/2, 2020 at 2:5

Since you're hacking around anyway, you don't really need that violator. Just do val env: org.apache.spark.SparkEnv = sc.getClass.getMethod("env").invoke(sc).asInstanceOf[org.apache.spark.SparkEnv]. Handy from a spark-shell, too. – Isochronism 18/4, 2020 at 15:43

@JamesMoore I prefer compile-time errors to runtime errors in case APIs change. – Paratyphoid 19/4, 2020 at 4:40

@Paratyphoid I assumed _coresPerExecutor was meant to hold the result of the calculation to avoid repeating it and that's why it's a var? Subtracting one for the driver doesn't work if you run single threaded when testing. I currently use Math.max(sparkSession.sparkContext.getExecutorMemoryStatus.size - 1, 1) as getExecutorMemoryStatus is still available. – Antakiya 10/9, 2020 at 11:38

@AndyKershaw I'd strongly recommend against testing with a single worker thread: turning off parallelism can hide certain types of problems until production. We test with 2+ executors and 2+ partitions. – Paratyphoid 29/10, 2020 at 14:47

@Paratyphoid Correct. Debugging would have been a better word for me to use as sometimes it is helpful to do that single threaded. – Antakiya 9/12, 2020 at 15:29

Found this while looking for the answer to pretty much the same question.

I found that:

Dataset ds = ...
ds.coalesce(sc.defaultParallelism());

does exactly what the OP was looking for.

For example, my 5 node x 8 core cluster returns 40 for the defaultParallelism.

Evacuate answered 16/3, 2020 at 23:46 Comment(0)

According to Databricks if the driver and executors are of the same node type, this is the way to go:

java.lang.Runtime.getRuntime.availableProcessors * (sc.statusTracker.getExecutorInfos.length -1)

Exclaim answered 17/3, 2020 at 15:11 Comment(5)

java.lang.Runtime.getRuntime.availableProcessors tells you how many cpus are on the current machine. Can't assume that's true for all machines in the cluster. – Isochronism 15/4, 2020 at 20:36

@JamesMoore you are correct. This works only in the case the Driver and Worker nodes are of the same node type. – Exclaim 16/4, 2020 at 10:17

@Exclaim I ran this and only got a value for java.lang.Runtime.getRuntime.availableProcessors, and get 0 for sc.statusTracker.getExecutorInfos.length. Is there an appropriate time to request this? I am using with EMR cluster, not databricks. Thanks! – Up 13/7, 2022 at 16:4

@Up I haven't worked with EMR but maybe this would be useful for you - https://mcmap.net/q/748805/-getting-number-of-cores-for-emr-cluster – Exclaim 13/7, 2022 at 20:43

@Exclaim thanks, this looks like I would just manually feed in the info, suggested threads appear to give number of processors, which I can currently get. I need a way for executors to provide a value (sc.statusTracker.getExecutorInfos.length returns 0), which would allow me to programmatically tell my jobs how many partitions to give without having to set up a config for it – Up 18/7, 2022 at 14:51

You could run jobs on every machine and ask it for the number of cores, but that's not necessarily what's available for Spark (as pointed out by @tribbloid in a comment on another answer):

import spark.implicits._
import scala.collection.JavaConverters._
import sys.process._
val procs = (1 to 1000).toDF.map(_ => "hostname".!!.trim -> java.lang.Runtime.getRuntime.availableProcessors).collectAsList().asScala.toMap
val nCpus = procs.values.sum

Running it in the shell (on a tiny test cluster with two workers) gives:

scala> :paste
// Entering paste mode (ctrl-D to finish)

    import spark.implicits._
    import scala.collection.JavaConverters._
    import sys.process._
    val procs = (1 to 1000).toDF.map(_ => "hostname".!!.trim -> java.lang.Runtime.getRuntime.availableProcessors).collectAsList().asScala.toMap
    val nCpus = procs.values.sum

// Exiting paste mode, now interpreting.

import spark.implicits._                                                        
import scala.collection.JavaConverters._
import sys.process._
procs: scala.collection.immutable.Map[String,Int] = Map(ip-172-31-76-201.ec2.internal -> 2, ip-172-31-74-242.ec2.internal -> 2)
nCpus: Int = 4

Add zeros to your range if you typically have lots of machines in your cluster. Even on my two-machine cluster 10000 completes in a couple seconds.

This is probably only useful if you want more information than sc.defaultParallelism() will give you (as in @SteveC 's answer)

Isochronism answered 18/4, 2020 at 16:40 Comment(0)

For all of those that aren't using yarn clusters: If you are doing it in Python/Databricks here is a function I wrote that will help solve the opportunity. This will get you both the number of worker nodes as well as the number of CPU's and return the multiplied final CPU count of your worker distribution.

def GetDistCPUCount():
    nWorkers = int(spark.sparkContext.getConf().get('spark.databricks.clusterUsageTags.clusterTargetWorkers'))
    GetType = spark.sparkContext.getConf().get('spark.databricks.clusterUsageTags.clusterNodeType')
    GetSubString = pd.Series(GetType).str.split(pat = '_', expand = True)
    GetNumber = GetSubString[1].str.extract('(\d+)')
    ParseOutString = GetNumber.iloc[0,0]
    WorkerCPUs = int(ParseOutString)
    nCPUs = nWorkers * WorkerCPUs
    return nCPUs

Hollington answered 2/12, 2022 at 18:33 Comment(1)

What is "test" in the 4th line? It is throwing error due to this – Banneret 12/12, 2022 at 11:0

Recommended topics

Hot tags