I have a dataframe such as the following
In [94]: prova_df.show()
order_item_order_id order_item_subtotal
1 299.98
2 199.99
2 250.0
2 129.99
4 49.98
4 299.95
4 150.0
4 199.92
5 299.98
5 299.95
5 99.96
5 299.98
What I would like to do is to compute, for each different value of the first column, the sum over the corresponding values of the second column. I've tried doing this with the following code:
from pyspark.sql import functions as func
prova_df.groupBy("order_item_order_id").agg(func.sum("order_item_subtotal")).show()
Which gives an output
SUM('order_item_subtotal)
129.99000549316406
579.9500122070312
199.9499969482422
634.819995880127
434.91000747680664
Which I'm not so sure if it's doing the right thing. Why isn't it showing also the information from the first column? Thanks in advance for your answers