How to find the size or shape of a DataFrame in PySpark?

Asked 23/9, 2016 at 4:42 Answered 31/3, 2020 at 11:51

167

I am trying to find out the size/shape of a DataFrame in PySpark. I do not see a single function that can do this.

In Python, I can do this:

data.shape()

Is there a similar function in PySpark? This is my current solution, but I am looking for an element one

row_number = data.count()
column_number = len(data.dtypes)

The computation of the number of columns is not ideal...

Salic answered 23/9, 2016 at 4:42 Comment(3)

Put this in a function ? – Rhyne 23/9, 2016 at 12:9

You mean data.shape for NumPy and Pandas? shape is not a function. – Tailored 3/7, 2019 at 8:44

What is not ideal? I am not sure what else you would like to accomplish than what you already have (except for replacing data.dtypes with data.columns, but it makes little difference). – Accra 27/8, 2020 at 11:55

270

You can get its shape with:

print((df.count(), len(df.columns)))

Roxie answered 11/8, 2017 at 17:28 Comment(3)

Will this work fine for larger datasets spread across nodes? – Thermolabile 7/7, 2021 at 7:45

Why doesn't Pyspark Dataframe simply store the shape values like pandas dataframe does with .shape? Having to call count seems incredibly resource-intensive for such a common and simple operation. – Redwing 24/2, 2022 at 4:6

@THISUSERNEEDSHELP I suspect it is because Pyspark DFs are lazy and do not do operations like filter() and flatMap() immediately, and these operations change the shape of the dataframe in an unpredictable way. It could be made cached for operations that do not change the number of rows, but this would give an inconsistent API and cost some extra engineering effort. – Elba 6/9, 2023 at 15:1

Use df.count() to get the number of rows.

Girt answered 18/8, 2017 at 13:33 Comment(0)

Add this to the your code:

import pyspark
def spark_shape(self):
    return (self.count(), len(self.columns))
pyspark.sql.dataframe.DataFrame.shape = spark_shape

Then you can do

>>> df.shape()
(10000, 10)

But just remind you that .count() can be very slow for very large table that has not been persisted.

Papilla answered 19/12, 2018 at 19:20 Comment(2)

I really think it's a bad idea to change the DataFrame API, without a valid reason to do so. just call spark_shape(my_df)... Moreover, possibly name the function something clearer like compute_dataframe_shape... – Faintheart 10/8, 2022 at 13:50

This tip is for data scientists or data analysts who have to constantly type those command every day multiple times when they do analysis on data. It is not for better engineering perspective or production code. – Papilla 5/9, 2022 at 9:7

print((df.count(), len(df.columns)))

is easier for smaller datasets.

However if the dataset is huge, an alternative approach would be to use pandas and arrows to convert the dataframe to pandas df and call shape

spark.conf.set("spark.sql.execution.arrow.enabled", "true")
spark.conf.set("spark.sql.crossJoin.enabled", "true")
print(df.toPandas().shape)

Climacteric answered 31/3, 2020 at 11:51 Comment(2)

Isn't .toPandas an action? Meaning: isn't this going to collect the data to your master, and then call shape on it? If so, it would be inadvisable to do that, unless you're sure it's going to fit in master's memory. – Coumarin 6/4, 2020 at 16:57

If the dataset is huge, collecting to Pandas is exactly what you do NOT want to do. Btw: Why do you enable cross join for this? And does the arrow configuration help collecting to pandas? – Accra 27/8, 2020 at 11:49

I think there is not similar function like data.shape in Spark. But I will use len(data.columns) rather than len(data.dtypes)

Retarded answered 30/10, 2016 at 13:34 Comment(1)

that just gives you number of columns. What about number of rows? – Intemperate 23/7, 2017 at 12:56

Recommended topics

Hot tags