Optimizing pyspark code by calculating Dataframe size

I'm using the following function (partly from a code snippet I got from this post: Compute size of Spark dataframe - SizeEstimator gives unexpected results

and adding my calculations according to what I understood will be distributed within the workers based on this tutorial: https://www.youtube.com/watch?v=hvF7tY2-L3U and ended up with the function below:

def get_partitions(df):
    partitions = 1
    df.cache().foreach(lambda x: x)                                                   
    df_size_in_bytes = spark._jsparkSession.sessionState()\
             .executePlan(df._jdf.queryExecution().logical(),\
             df._jdf.queryExecution().mode()).optimizedPlan()\
             .stats()\
             .sizeInBytes()
    kilo_bytes = int(df_size_in_bytes/1024)
    mega_bytes = int(kilo_bytes/1024)
    parts = int(mega_bytes/128)        
    if parts <= 0:
        parts = partitions
    else:
        partitions = parts   
    return partitions

Though is giving me the following error:

Py4JError: An error occurred while calling o8705.mode. Trace:
py4j.Py4JException: Method mode([]) does not exist
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
    at py4j.Gateway.invoke(Gateway.java:274)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:750)

I'm working with AWS Glue jobs and interactive sessions, and in glue jobs I get an empty trace, and in Glue interactive sessions I get the error above, what is missing in the calculation. I'm currently using Spark 3.1 and Glue 3.0. any help will be appreciated!

Recommended topics

Hot tags