Pyspark dataframe: Summing over a column while grouping over another

Asked 27/11, 2015 at 16:57 Answered 26/9, 2019 at 17:14

python pyspark apache-spark-sql apache-spark-1.3

I have a dataframe such as the following

In [94]: prova_df.show()


order_item_order_id order_item_subtotal
1                   299.98             
2                   199.99             
2                   250.0              
2                   129.99             
4                   49.98              
4                   299.95             
4                   150.0              
4                   199.92             
5                   299.98             
5                   299.95             
5                   99.96              
5                   299.98

What I would like to do is to compute, for each different value of the first column, the sum over the corresponding values of the second column. I've tried doing this with the following code:

from pyspark.sql import functions as func
prova_df.groupBy("order_item_order_id").agg(func.sum("order_item_subtotal")).show()

Which gives an output

SUM('order_item_subtotal)
129.99000549316406       
579.9500122070312        
199.9499969482422        
634.819995880127         
434.91000747680664

Which I'm not so sure if it's doing the right thing. Why isn't it showing also the information from the first column? Thanks in advance for your answers

Bedcover answered 27/11, 2015 at 16:57 Comment(0)

Why isn't it showing also the information from the first column?

Most likely because you're using outdated Spark 1.3.x. If thats the case you have to repeat grouping columns inside agg as follows:

(df
    .groupBy("order_item_order_id")
    .agg(func.col("order_item_order_id"), func.sum("order_item_subtotal"))
    .show())

Burroughs answered 28/11, 2015 at 5:35 Comment(0)

A similar solution for your problem using PySpark 2.7.x would look like this:

df = spark.createDataFrame(
    [(1, 299.98),
    (2, 199.99),
    (2, 250.0),
    (2, 129.99),
    (4, 49.98),
    (4, 299.95),
    (4, 150.0),
    (4, 199.92),
    (5, 299.98),
    (5, 299.95),
    (5, 99.96),
    (5, 299.98)],
    ['order_item_order_id', 'order_item_subtotal'])

df.groupBy('order_item_order_id').sum('order_item_subtotal').show()

Which results in the following output:

+-------------------+------------------------+
|order_item_order_id|sum(order_item_subtotal)|
+-------------------+------------------------+
|                  5|       999.8700000000001|
|                  1|                  299.98|
|                  2|                  579.98|
|                  4|                  699.85|
+-------------------+------------------------+

Bombay answered 26/9, 2019 at 17:14 Comment(0)

You can use partition in a window function for that:

from pyspark.sql import Window

df.withColumn("value_field", f.sum("order_item_subtotal") \
  .over(Window.partitionBy("order_item_order_id"))) \
  .show()

Garbanzo answered 19/7, 2018 at 10:27 Comment(2)

what is value_field here? – Nisbet 7/8, 2018 at 5:48

Just an arbitrary string that you want the column to be named – Enabling 13/8, 2019 at 21:18

Recommended topics

Hot tags