How can repetitive rows of data be collected in a single row in pandas?

Asked 15/8, 2021 at 15:24 Answered 26/9, 2022 at 8:46

Solved python pandas dataframe data-science

I have a dataset that contains the NBA Player's average statistics per game. Some player's statistics are repeated because of they've been in different teams in season.

For example:

      Player       Pos  Age Tm    G     GS   MP      FG
8   Jarrett Allen   C   22  TOT  28     10  26.2     4.4
9   Jarrett Allen   C   22  BRK  12     5   26.7     3.7
10  Jarrett Allen   C   22  CLE  16     5   25.9     4.9

I want to average Jarrett Allen's stats and put them into a single row. How can I do this?

Footling answered 15/8, 2021 at 15:24 Comment(4)

Did you try groupby? – Dardanelles 15/8, 2021 at 15:28

I'm a beginner sir, Idk how to do it. Can you send me some sources about this please? – Footling 15/8, 2021 at 15:30

The syntax is: df.groupby([ <list of columns to keep> ]).mean(). Added an example in answer. – Dardanelles 15/8, 2021 at 15:35

Are you sure you want to average all those rows? Mathematically it doesn't make sense to average all those rows since a) the "TOT" row is already his stats averaged of all the teams he's played for for the season, b) you'd be taking an average of an average in MP and FG. – Wera 16/8, 2021 at 10:48

You can groupby and use agg to get the mean. For the non numeric columns, let's take the first value:

df.groupby('Player').agg({k: 'mean' if v in ('int64', 'float64') else 'first'
                          for k,v in df.dtypes[1:].items()})

output:

              Pos  Age   Tm          G        GS         MP        FG
Player                                                               
Jarrett Allen   C   22  TOT  18.666667  6.666667  26.266667  4.333333

NB. content of the dictionary comprehension:

{'Pos': 'first',
 'Age': 'mean',
 'Tm': 'first',
 'G': 'mean',
 'GS': 'mean',
 'MP': 'mean',
 'FG': 'mean'}

Shaylynn answered 15/8, 2021 at 15:31 Comment(3)

Excuse me, can i ask this code's explanation please? What's happening inside of "agg" ? Or can you send me links about this subject? – Footling 15/8, 2021 at 15:56

agg aggregates the values with functions. Here I used a dictionary to tell it which aggregate to compute per column (mean or first). I added a link to the doc in the answer. The dictionary is computed using the columns type. If numeric (int/float) we aggregate with 'mean', else we take the first value. Let me know if you want more details. – Shaylynn 15/8, 2021 at 15:58

If one wants to use the same methods but specifying what to do in each column, one can do the following

df = df.groupby('Player').agg({'Pos': 'first', 'Age': 'mean', 'Tm': 'first', 'G': 'mean', 'GS': 'mean', 'MP': 'mean', 'FG': 'mean'})

. – Haunt 13/10, 2022 at 8:30

x = [['a', 12, 5],['a', 12, 7], ['b', 15, 10],['b', 15, 12],['c', 20, 1]]

import pandas as pd
df = pd.DataFrame(x, columns=['name', 'age', 'score'])
print(df)
print('-----------')

df2 = df.groupby(['name', 'age']).mean()
print(df2)

Output:

  name  age  score
0    a   12      5
1    a   12      7
2    b   15     10
3    b   15     12
4    c   20      1
-----------
          score
name age       
a    12       6
b    15      11
c    20       1

Dardanelles answered 15/8, 2021 at 15:34 Comment(0)

Option 1

If one considers the dataframe that OP shares in the question df the following will do the work

df_new = df.groupby('Player').agg(lambda x: x.iloc[0] if pd.api.types.is_string_dtype(x.dtype) else x.mean())

[Out]:
              Pos   Age   Tm          G        GS         MP        FG
Player                                                                
Jarrett Allen   C  22.0  TOT  18.666667  6.666667  26.266667  4.333333

This one uses:

pandas.DataFrame.groupby to group by the Player column
pandas.core.groupby.GroupBy.agg to aggregate the values based on a custom made lambda function.
pandas.api.types.is_string_dtype to check if a column is of string type (see here how the method is implemented)

Let's test it with a new dataframe, df2, with more elements in the Player column.

import numpy as np

df2 = pd.DataFrame({'Player': ['John Collins', 'John Collins', 'John Collins', 'Trae Young', 'Trae Young', 'Clint Capela', 'Jarrett Allen', 'Jarrett Allen', 'Jarrett Allen'],
                    'Pos': ['PF', 'PF', 'PF', 'PG', 'PG', 'C', 'C', 'C', 'C'],
                    'Age': np.random.randint(0, 100, 9),
                    'Tm': ['ATL', 'ATL', 'ATL', 'ATL', 'ATL', 'ATL', 'TOT', 'BRK', 'CLE'],
                    'G': np.random.randint(0, 100, 9),
                    'GS': np.random.randint(0, 100, 9),
                    'MP': np.random.uniform(0, 100, 9),
                    'FG': np.random.uniform(0, 100, 9)})

[Out]:
          Player Pos  Age   Tm   G  GS         MP         FG
0   John Collins  PF   71  ATL  75  39  16.123225  77.949756
1   John Collins  PF   60  ATL  49  49  30.308092  24.788401
2   John Collins  PF   52  ATL  33  92  11.087317  58.488575
3     Trae Young  PG   72  ATL  20  91  62.862313  60.169282
4     Trae Young  PG   85  ATL  61  77  30.248551  85.169038
5   Clint Capela   C   73  ATL   5  67  45.817690  21.966777
6  Jarrett Allen   C   23  TOT  60  51  93.076624  34.160823
7  Jarrett Allen   C   12  BRK   2  77  74.318568  78.755869
8  Jarrett Allen   C   44  CLE  82  81   7.375631  40.930844

If one tests the operation on df2, one will get the following

df_new2 = df2.groupby('Player').agg(lambda x: x.iloc[0] if pd.api.types.is_string_dtype(x.dtype) else x.mean())

[Out]:
              Pos        Age   Tm          G         GS         MP         FG
Player                                                                       
Clint Capela    C  95.000000  ATL  30.000000  98.000000  46.476398  17.987104
Jarrett Allen   C  60.000000  TOT  48.666667  19.333333  70.050540  33.572896
John Collins   PF  74.333333  ATL  50.333333  52.666667  78.181457  78.152235
Trae Young     PG  57.500000  ATL  44.500000  47.500000  46.602543  53.835455

Option 2

Depending on the desired output, assuming that one only wants to group by player (independently of Age or Tm), a simpler solution would be to just group by and pass .mean() as follows

df_new3 = df.groupby('Player').mean()

[Out]:

                Age          G        GS         MP        FG
Player                                                       
Jarrett Allen  22.0  18.666667  6.666667  26.266667  4.333333

Notes:

The output of this previous operation won't display non-numerical columns (apart from the Player name).

Blockhead answered 26/9, 2022 at 8:46 Comment(0)

Recommended topics

Hot tags