How to sum values of one column based on other columns in pandas?
Asked Answered
A

2

5

Working with a dataframe that looks like this (text version below): enter image description here

I am supposed to calculate which country has scored the most goals since 2010 in tournaments. So far I have managed to manipulate the dataframe by filtering out friendlies like this:

no_friendlies = df[df.tournament != "Friendly"]

Then I set the date column to be the index in order to filter out all matches before 2010:

no_friendlies_indexed = no_friendlies.set_index('date')
since_2010 = no_friendlies_indexed.loc['2010-01-01':]

I am pretty lost from this point onward as I can't figure out how to sum goals scored by each country both home and away

Any help/advice is appreciated!

EDIT:

Text version of sample data:

date    home_team   away_team   home_score  away_score  tournament  city    country     neutral
0   1872-11-30  Scotland    England     0   0       Friendly    Glasgow     Scotland    False
1   1873-03-08  England     Scotland    4   2       Friendly    London  England     False
2   1874-03-07  Scotland    England     2   1       Friendly    Glasgow     Scotland    False
3   1875-03-06  England     Scotland    2   2       Friendly    London  England     False
4   1876-03-04  Scotland    England     3   0       Friendly    Glasgow     Scotland    False
5   1876-03-25  Scotland    Wales       4   0       Friendly    Glasgow     Scotland    False
6   1877-03-03  England     Scotland    1   3       Friendly    London  England     False
7   1877-03-05  Wales       Scotland    0   2       Friendly    Wrexham     Wales   False
8   1878-03-02  Scotland    England     7   2       Friendly    Glasgow     Scotland    False
9   1878-03-23  Scotland    Wales       9   0       Friendly    Glasgow     Scotland    False
10  1879-01-18  England     Wales       2   1       Friendly    London  England     False

EDIT 2:

I have just tried doing this:

since_2010.groupby(['home_team', 'home_score']).sum()

But it doesn't return the sum of home goals scored by the home teams (if this worked i would just repeat it for away teams to get total)

Afford answered 23/7, 2020 at 0:46 Comment(5)
Paste text version of sample dataGuerin
and expected outputQuantify
You will want to reshape this so that there are 3 (or 4) columns something like ['date', 'team', 'Home_or_Away', 'score']. pd.wide_to_long or melt can accomplish that.Ravishment
What @Ravishment said and then .groupby the Team_name and get the sum() of the Score.Quantify
i just pasted the text version of the dataAfford
Q
4

.groupby and .sum() for the home team and then do the same for the away team and add the two together:

df_new = df.groupby('home_team')['home_score'].sum() + df.groupby('away_team')['away_score'].sum()

output:

England     12
Scotland    34
Wales        1

More detailed explanation (per comment):

  1. You need to only .groupby one column home_team. In your answer, you were grouping by ['home_team', 'home_score'] Your goal (no pun intended) is to get the .sum() of the home_score -- so you should NOT .groupby() it. As you can see ['home_score'] is after the part where I use .groupby, so that I can get the .sum() of it. That gets you set for the home teams.
  2. Then, you do the same for the away_team.
  3. At that point python / pandas is smart enough that since the results of the home_team and away_team groups have the same values for countries, you can simply add them together...
Quantify answered 23/7, 2020 at 1:11 Comment(6)
Hi, thank you very much for your help. This seems to be working. If you have the time, would you be able to explain the difference between your code and my code as I was trying to use groupby as well before posting here but it was not working: df.groupby(['home_team', 'home_score']).sum()Afford
@Afford You need to only groupby one column home_team. In your answer, you were grouping by ['home_team', 'home_score'] Your goal (no pun intended) is to get get the sum() of the home_score -- so you should NOT groupby it. As you can see ['home_score'] is after the part where I use .groupby, so that I can get the .sum() of it. That gets you set for the home teams. Then, you do the same for the away teams. At that point python / pandas is smart enough that since the results of the home and away groups have the same values for countries, you can simply add them together...Quantify
Smart and simple :D. Though it's highly unlikely that a team only plays a single game as a home or away team, you might consider using Series.add with fill_value=0 (df.groupby('home_team')['home_score'].sum().add(df.groupby('away_team')['away_score'].sum(), fill_value=0)) as normal + addition will cause the sum for a team missing in one Series to be NaN, regardless of points scored in the other:Ravishment
This is a good point, although I agree it would be highly unlikely for a team to be permanently away / home. As you may know in all or most sports, teams have a schedule that is 50% home and 50% away :). For other types of data I agree though, you might want a more robust approach.Quantify
thanks again for the explanations! this discussion is also helping me understand working with this type of dataAfford
no problem sak, please accept one of the answers as solution by clicking the checkmark next to the solution.Quantify
R
2

Use pd.wide_to_long to reshape. The benefit is it automatically creates a 'home_or_away' indicator, but we will first change the columns so that they are 'score_home' (as opposed to 'home_score').

# Swap column stubs around `'_'`
df.columns = ['_'.join(x[::-1]) for x in df.columns.str.split('_')]

# Your code to filter, would drop everything in your provided example
# df['date'] = pd.to_datetime(df['date'])
# df[df['date'].dt.year.gt(2010) & df['tournament'].ne('Friendly')]

df = pd.wide_to_long(df, i='date', j='home_or_away',
                     stubnames=['team', 'score'], sep='_', suffix='.*')

#                          country  neutral tournament     city      team  score
#date       home_or_away                                                        
#1872-11-30 home          Scotland    False   Friendly  Glasgow  Scotland      0
#1873-03-08 home           England    False   Friendly   London   England      4
#1874-03-07 home          Scotland    False   Friendly  Glasgow  Scotland      2
#...
#1878-03-02 away          Scotland    False   Friendly  Glasgow   England      2
#1878-03-23 away          Scotland    False   Friendly  Glasgow     Wales      0
#1879-01-18 away           England    False   Friendly   London     Wales      1

So now regardless of home or away, you can get the points scored:

df.groupby('team')['score'].sum()
#team
#England     12
#Scotland    34
#Wales        1
#Name: score, dtype: int64
Ravishment answered 23/7, 2020 at 1:15 Comment(1)
this is great, had never used the wide to long function before, thank you so much for the detailed reply!Afford

© 2022 - 2024 — McMap. All rights reserved.