How can repetitive rows of data be collected in a single row in pandas?
Asked Answered
F

3

16

I have a dataset that contains the NBA Player's average statistics per game. Some player's statistics are repeated because of they've been in different teams in season.

For example:

      Player       Pos  Age Tm    G     GS   MP      FG
8   Jarrett Allen   C   22  TOT  28     10  26.2     4.4
9   Jarrett Allen   C   22  BRK  12     5   26.7     3.7
10  Jarrett Allen   C   22  CLE  16     5   25.9     4.9

I want to average Jarrett Allen's stats and put them into a single row. How can I do this?

Footling answered 15/8, 2021 at 15:24 Comment(4)
Did you try groupby?Dardanelles
I'm a beginner sir, Idk how to do it. Can you send me some sources about this please?Footling
The syntax is: df.groupby([ <list of columns to keep> ]).mean(). Added an example in answer.Dardanelles
Are you sure you want to average all those rows? Mathematically it doesn't make sense to average all those rows since a) the "TOT" row is already his stats averaged of all the teams he's played for for the season, b) you'd be taking an average of an average in MP and FG.Wera
S
49

You can groupby and use agg to get the mean. For the non numeric columns, let's take the first value:

df.groupby('Player').agg({k: 'mean' if v in ('int64', 'float64') else 'first'
                          for k,v in df.dtypes[1:].items()})

output:

              Pos  Age   Tm          G        GS         MP        FG
Player                                                               
Jarrett Allen   C   22  TOT  18.666667  6.666667  26.266667  4.333333

NB. content of the dictionary comprehension:

{'Pos': 'first',
 'Age': 'mean',
 'Tm': 'first',
 'G': 'mean',
 'GS': 'mean',
 'MP': 'mean',
 'FG': 'mean'}
Shaylynn answered 15/8, 2021 at 15:31 Comment(3)
Excuse me, can i ask this code's explanation please? What's happening inside of "agg" ? Or can you send me links about this subject?Footling
agg aggregates the values with functions. Here I used a dictionary to tell it which aggregate to compute per column (mean or first). I added a link to the doc in the answer. The dictionary is computed using the columns type. If numeric (int/float) we aggregate with 'mean', else we take the first value. Let me know if you want more details.Shaylynn
If one wants to use the same methods but specifying what to do in each column, one can do the following df = df.groupby('Player').agg({'Pos': 'first', 'Age': 'mean', 'Tm': 'first', 'G': 'mean', 'GS': 'mean', 'MP': 'mean', 'FG': 'mean'}).Haunt
D
11
x = [['a', 12, 5],['a', 12, 7], ['b', 15, 10],['b', 15, 12],['c', 20, 1]]

import pandas as pd
df = pd.DataFrame(x, columns=['name', 'age', 'score'])
print(df)
print('-----------')

df2 = df.groupby(['name', 'age']).mean()
print(df2)

Output:

  name  age  score
0    a   12      5
1    a   12      7
2    b   15     10
3    b   15     12
4    c   20      1
-----------
          score
name age       
a    12       6
b    15      11
c    20       1
Dardanelles answered 15/8, 2021 at 15:34 Comment(0)
B
6

Option 1

If one considers the dataframe that OP shares in the question df the following will do the work

df_new = df.groupby('Player').agg(lambda x: x.iloc[0] if pd.api.types.is_string_dtype(x.dtype) else x.mean())

[Out]:
              Pos   Age   Tm          G        GS         MP        FG
Player                                                                
Jarrett Allen   C  22.0  TOT  18.666667  6.666667  26.266667  4.333333

This one uses:

Let's test it with a new dataframe, df2, with more elements in the Player column.

import numpy as np

df2 = pd.DataFrame({'Player': ['John Collins', 'John Collins', 'John Collins', 'Trae Young', 'Trae Young', 'Clint Capela', 'Jarrett Allen', 'Jarrett Allen', 'Jarrett Allen'],
                    'Pos': ['PF', 'PF', 'PF', 'PG', 'PG', 'C', 'C', 'C', 'C'],
                    'Age': np.random.randint(0, 100, 9),
                    'Tm': ['ATL', 'ATL', 'ATL', 'ATL', 'ATL', 'ATL', 'TOT', 'BRK', 'CLE'],
                    'G': np.random.randint(0, 100, 9),
                    'GS': np.random.randint(0, 100, 9),
                    'MP': np.random.uniform(0, 100, 9),
                    'FG': np.random.uniform(0, 100, 9)})

[Out]:
          Player Pos  Age   Tm   G  GS         MP         FG
0   John Collins  PF   71  ATL  75  39  16.123225  77.949756
1   John Collins  PF   60  ATL  49  49  30.308092  24.788401
2   John Collins  PF   52  ATL  33  92  11.087317  58.488575
3     Trae Young  PG   72  ATL  20  91  62.862313  60.169282
4     Trae Young  PG   85  ATL  61  77  30.248551  85.169038
5   Clint Capela   C   73  ATL   5  67  45.817690  21.966777
6  Jarrett Allen   C   23  TOT  60  51  93.076624  34.160823
7  Jarrett Allen   C   12  BRK   2  77  74.318568  78.755869
8  Jarrett Allen   C   44  CLE  82  81   7.375631  40.930844

If one tests the operation on df2, one will get the following

df_new2 = df2.groupby('Player').agg(lambda x: x.iloc[0] if pd.api.types.is_string_dtype(x.dtype) else x.mean())

[Out]:
              Pos        Age   Tm          G         GS         MP         FG
Player                                                                       
Clint Capela    C  95.000000  ATL  30.000000  98.000000  46.476398  17.987104
Jarrett Allen   C  60.000000  TOT  48.666667  19.333333  70.050540  33.572896
John Collins   PF  74.333333  ATL  50.333333  52.666667  78.181457  78.152235
Trae Young     PG  57.500000  ATL  44.500000  47.500000  46.602543  53.835455

Option 2

Depending on the desired output, assuming that one only wants to group by player (independently of Age or Tm), a simpler solution would be to just group by and pass .mean() as follows

df_new3 = df.groupby('Player').mean()

[Out]:

                Age          G        GS         MP        FG
Player                                                       
Jarrett Allen  22.0  18.666667  6.666667  26.266667  4.333333

Notes:

  • The output of this previous operation won't display non-numerical columns (apart from the Player name).
Blockhead answered 26/9, 2022 at 8:46 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.