How to estimate dataframe real size in pyspark?

Asked 6/5, 2016 at 16:38 Answered 11/3 at 1:26

How to determine a dataframe size?

Right now I estimate the real size of a dataframe as follows:

headers_size = key for key in df.first().asDict()
rows_size = df.map(lambda row: len(value for key, value in row.asDict()).sum()
total_size = headers_size + rows_size

It is too slow and I'm looking for a better way.

Veratridine answered 6/5, 2016 at 16:38 Comment(4)

You have to collect the RDD to determine its size, so of course it'll be slow for a large dataset – Giuditta 7/5, 2016 at 16:10

I was thinking to use SizeEstimator object to estimate a sample of the rdd. Unfortunately there is no way I could find to do it in python. – Veratridine 7/5, 2016 at 17:2

I think this addresses what you are asking. spark.apache.org/docs/latest/… – Giuditta 7/5, 2016 at 17:5

I am actually looking for a python implementation as I stated. @cricket_007 – Veratridine 15/5, 2016 at 11:53

Currently I am using the below approach, but not sure if this is the best way:

df.persist(StorageLevel.Memory)
df.count()

On the spark-web UI under the Storage tab you can check the size which is displayed in MB's and then I do unpersist to clear the memory:

df.unpersist()

Marietta answered 11/8, 2016 at 23:54 Comment(4)

Thanks, I can check the size in the storage Tab. Gr8 Help – Gruff 10/8, 2018 at 5:51

This is probably a bad idea if you have a very large dataset. – Menderes 14/10, 2020 at 11:56

If you have a very large dataset, it's just a matter or sampling (e.g. df.sample(.01)) and following the same steps. Then you can approximate the size of the whole dataset. – Bessette 28/7, 2021 at 12:50

Use df.persist(StorageLevel.MEMORY_AND_DISK) – Hostility 24/8, 2021 at 1:7

nice post from Tamas Szuromi http://metricbrew.com/how-to-estimate-rdd-or-dataframe-real-size-in-pyspark/

from pyspark.serializers import PickleSerializer, AutoBatchedSerializer
def _to_java_object_rdd(rdd):  
    """ Return a JavaRDD of Object by unpickling
    It will convert each Python object into Java object by Pyrolite, whenever the
    RDD is serialized in batch or not.
    """
    rdd = rdd._reserialize(AutoBatchedSerializer(PickleSerializer()))
    return rdd.ctx._jvm.org.apache.spark.mllib.api.python.SerDe.pythonToJava(rdd._jrdd, True)

JavaObj = _to_java_object_rdd(df.rdd)

nbytes = sc._jvm.org.apache.spark.util.SizeEstimator.estimate(JavaObj)

Horgan answered 20/7, 2017 at 19:4 Comment(5)

How does this suppose to work? i have tested this code and, in my opinion, the results are more of a "random function" as of an estimation. Or maybe did i misinterpret them? I am using spark 1.6 in cdh 5.11.2 – Beekeeper 27/9, 2017 at 14:34

This returns always the same size for me, no matter the dataframe. it always returns 216 MB. – Anting 14/12, 2017 at 22:29

I saw very little change -- from 185,704,232 to 186,020,448 to 187,366,176. However, the number of records changed from 5 to 2,000,000 to 1,500,000,000. – Torey 31/1, 2020 at 22:58

I use pyspark 2.4.4 ，is not worked,TypeError javaPackage not callable – Aldershot 8/6, 2020 at 8:43

Do not use this. This is not true memory usage. It reports close number for a DataFrame of 1B records and another one with 10M records. – Holly 6/8, 2020 at 9:47

You can use RepartiPy instead to get the accurate size of your DataFrame as follows:

import repartipy

# Use this if you have enough (executor) memory to cache the whole DataFrame
# If you have NOT enough memory (i.e. too large DataFrame), use 'repartipy.SamplingSizeEstimator' instead.
with repartipy.SizeEstimator(spark=spark, df=df) as se:
    df_size_in_bytes = se.estimate()

RepartiPy leverages Caching Approach internally, as described in Kiran Thati & David C. 's answer as well, in order to calculate the in-memory size of your DataFrame. Please see the docs for more details.

Imposture answered 11/3 at 1:26 Comment(0)

Recommended topics

Hot tags