How to find the size of a dataframe in pyspark

Asked 3/6, 2020 at 13:31 Answered 11/3 at 13:48

How can I replicate this code to get the dataframe size in pyspark?

scala> val df = spark.range(10)
scala> print(spark.sessionState.executePlan(df.queryExecution.logical).optimizedPlan.stats)
Statistics(sizeInBytes=80.0 B, hints=none)

What I would like to do is get the sizeInBytes value into a variable.

Ramrod answered 3/6, 2020 at 13:31 Comment(1)

Essentially I am trying to work out the size of the dataframe from the optimise like so:- val sizeInBytes = spark.sessionState.executePlan(df.queryExecution.logical).optimizedPlan.stats.sizeInBytes – Ramrod 4/6, 2020 at 12:34

In Spark 2.4 you can do

df = spark.range(10)
df.createOrReplaceTempView('myView')
spark.sql('explain cost select * from myView').show(truncate=False)

|== Optimized Logical Plan ==
Range (0, 10, step=1, splits=Some(8)), Statistics(sizeInBytes=80.0 B, hints=none)

In Spark 3.0.0-preview2 you can use explain with the cost mode:

df = spark.range(10)
df.explain(mode='cost')

== Optimized Logical Plan ==
Range (0, 10, step=1, splits=Some(8)), Statistics(sizeInBytes=80.0 B)

Clef answered 3/6, 2020 at 20:6 Comment(4)

How can is get the sizeInBytes into a variable? – Ramrod 4/6, 2020 at 7:35

@DinoG Well, to get it into a variable will be more tricky, but you can always parse the string spark.sql(explain cost ...).collect()[0]['plan'], it will be not pretty, but it is certainly possible – Clef 4/6, 2020 at 13:7

This seems to work but isn't as neat as the following code which can be used in scala spark.sessionState.executePlan(df.queryExecution.logical).optimizedPlan.stats.sizeInBytes. How can I replicate this code in Pyspark? – Ramrod 5/6, 2020 at 14:33

I would like to ask how it would be in Java. Would it be correct to just write System.out.println(df.logicalPlan().stats().sizeInBytes()) ?? – Davie 17/2, 2022 at 10:42

See of this helps-

Reading the json file source and computing stats like size in bytes , number of rows etc. This stat will also help spark to take it=ntelligent decision while optimizing execution plan This code should be same in pysparktoo

/**
      * file content
      * spark-test-data.json
      * --------------------
      * {"id":1,"name":"abc1"}
      * {"id":2,"name":"abc2"}
      * {"id":3,"name":"abc3"}
      */
    val fileName = "spark-test-data.json"
    val path = getClass.getResource("/" + fileName).getPath

    spark.catalog.createTable("df", path, "json")
      .show(false)

    /**
      * +---+----+
      * |id |name|
      * +---+----+
      * |1  |abc1|
      * |2  |abc2|
      * |3  |abc3|
      * +---+----+
      */
    // Collect only statistics that do not require scanning the whole table (that is, size in bytes).
    spark.sql("ANALYZE TABLE df COMPUTE STATISTICS NOSCAN")
    spark.sql("DESCRIBE EXTENDED df ").filter(col("col_name") === "Statistics").show(false)

    /**
      * +----------+---------+-------+
      * |col_name  |data_type|comment|
      * +----------+---------+-------+
      * |Statistics|68 bytes |       |
      * +----------+---------+-------+
      */
    spark.sql("ANALYZE TABLE df COMPUTE STATISTICS")
    spark.sql("DESCRIBE EXTENDED df ").filter(col("col_name") === "Statistics").show(false)

    /**
      * +----------+----------------+-------+
      * |col_name  |data_type       |comment|
      * +----------+----------------+-------+
      * |Statistics|68 bytes, 3 rows|       |
      * +----------+----------------+-------+
      */

more info - databricks sql doc

Mcnary answered 4/6, 2020 at 4:10 Comment(0)

Typically, you can access the scala methods through py4j. I just tried this in the pyspark shell:

>>> spark._jsparkSession.sessionState().executePlan(df._jdf.queryExecution().logical()).optimizedPlan().stats().sizeInBytes()
716

Hindi answered 17/5, 2021 at 22:56 Comment(1)

When I try this I get a Py4JException: Method executePlan([class org.apache.spark.sql.catalyst.analysis.UnresolvedRelation]) does not exist – Besotted 23/11, 2022 at 13:50

You can use RepartiPy to get the accurate size of your DataFrame as follows:

import repartipy

# Use this if you have enough (executor) memory to cache the whole DataFrame
# If you have NOT enough memory (i.e. too large DataFrame), use 'repartipy.SamplingSizeEstimator' instead.
with repartipy.SizeEstimator(spark=spark, df=df) as se:
    df_size_in_bytes = se.estimate()

RepartiPy leverages executePlan method internally, as you mentioned already, in order to calculate the in-memory size of your DataFrame.

Please see the docs for more details.

Upbringing answered 11/3 at 13:48 Comment(0)

Recommended topics

Hot tags