Get total of Pandas column
Asked Answered
E

6

230

I have a Pandas data frame, as shown below, with multiple columns and would like to get the total of column, MyColumn.

           X           MyColumn      Y              Z   
0          A           84         13.0           69.0   
1          B           76         77.0          127.0   
2          C           28         69.0           16.0   
3          D           28         28.0           31.0   
4          E           19         20.0           85.0   
5          F           84        193.0           70.0   

Expected Output

I'd have expected the output to be the total of this column: 319.

Or alternatively, I would like df to be edited with a new row entitled TOTAL containing the total:

           X           MyColumn      Y              Z   
0          A           84         13.0           69.0   
1          B           76         77.0          127.0   
2          C           28         69.0           16.0   
3          D           28         28.0           31.0   
4          E           19         20.0           85.0   
5          F           84        193.0           70.0   
TOTAL                  319

I have attempted to get the sum of the column using groupby and .sum():

Total = df.groupby['MyColumn'].sum()

This causes the following error:

TypeError: 'instancemethod' object has no attribute '__getitem__'
Eskew answered 22/12, 2016 at 15:29 Comment(1)
For an illustration of why pandas is not pythonic, look no further than the confusion over how to simply sum a column.Legpull
P
396

You should use sum:

Total = df['MyColumn'].sum()
print(Total)
319

Then you use loc with Series, in that case the index should be set as the same as the specific column you need to sum:

df.loc['Total'] = pd.Series(df['MyColumn'].sum(), index=['MyColumn'])
print(df)
         X  MyColumn      Y      Z
0        A      84.0   13.0   69.0
1        B      76.0   77.0  127.0
2        C      28.0   69.0   16.0
3        D      28.0   28.0   31.0
4        E      19.0   20.0   85.0
5        F      84.0  193.0   70.0
Total  NaN     319.0    NaN    NaN

because if you pass scalar, the values of all rows will be filled:

df.loc['Total'] = df['MyColumn'].sum()
print(df)
         X  MyColumn      Y      Z
0        A        84   13.0   69.0
1        B        76   77.0  127.0
2        C        28   69.0   16.0
3        D        28   28.0   31.0
4        E        19   20.0   85.0
5        F        84  193.0   70.0
Total  319       319  319.0  319.0

Two other solutions are with at, and ix see the applications below:

df.at['Total', 'MyColumn'] = df['MyColumn'].sum()
print(df)
         X  MyColumn      Y      Z
0        A      84.0   13.0   69.0
1        B      76.0   77.0  127.0
2        C      28.0   69.0   16.0
3        D      28.0   28.0   31.0
4        E      19.0   20.0   85.0
5        F      84.0  193.0   70.0
Total  NaN     319.0    NaN    NaN

df.ix['Total', 'MyColumn'] = df['MyColumn'].sum()
print(df)
         X  MyColumn      Y      Z
0        A      84.0   13.0   69.0
1        B      76.0   77.0  127.0
2        C      28.0   69.0   16.0
3        D      28.0   28.0   31.0
4        E      19.0   20.0   85.0
5        F      84.0  193.0   70.0
Total  NaN     319.0    NaN    NaN

Note: Since Pandas v0.20, ix has been deprecated. Use loc or iloc instead.

Profit answered 22/12, 2016 at 15:30 Comment(7)
That's great :) Thanks for the explanation, may I ask what .loc does in the above example?Eskew
loc is for setting with enlargement.Profit
at works for setting with enlargement too, see last edit.Profit
Thanks, Is there any preferred method?Eskew
Hmmm, docs says The .loc/.ix/[] operations can perform enlargement when setting a non-existant key for that axis., so loc or ix or []. in next section is writes at may enlarge the object in-place as above if the indexer is missing. So all methods are good, but at is fastest I think.Profit
This is great. Is it possible to put "Total" in column X by modifying df.loc['Total'] = df['MyColumn'].sum()Torray
What's the equivalent of df['col'].sum() in pyspark ?Peccant
S
42

Another option you can go with here:

df.loc["Total", "MyColumn"] = df.MyColumn.sum()

#         X  MyColumn      Y       Z
#0        A     84.0    13.0    69.0
#1        B     76.0    77.0   127.0
#2        C     28.0    69.0    16.0
#3        D     28.0    28.0    31.0
#4        E     19.0    20.0    85.0
#5        F     84.0   193.0    70.0
#Total  NaN    319.0     NaN     NaN

You can also use append() method:

df.append(pd.DataFrame(df.MyColumn.sum(), index = ["Total"], columns=["MyColumn"]))

enter image description here


Update:

In case you need to append sum for all numeric columns, you can do one of the followings:

Use append to do this in a functional manner (doesn't change the original data frame):

# select numeric columns and calculate the sums
sums = df.select_dtypes(pd.np.number).sum().rename('total')

# append sums to the data frame
df.append(sums)
#         X  MyColumn      Y      Z
#0        A      84.0   13.0   69.0
#1        B      76.0   77.0  127.0
#2        C      28.0   69.0   16.0
#3        D      28.0   28.0   31.0
#4        E      19.0   20.0   85.0
#5        F      84.0  193.0   70.0
#total  NaN     319.0  400.0  398.0

Use loc to mutate data frame in place:

df.loc['total'] = df.select_dtypes(pd.np.number).sum()
df
#         X  MyColumn      Y      Z
#0        A      84.0   13.0   69.0
#1        B      76.0   77.0  127.0
#2        C      28.0   69.0   16.0
#3        D      28.0   28.0   31.0
#4        E      19.0   20.0   85.0
#5        F      84.0  193.0   70.0
#total  NaN     638.0  800.0  796.0
Saylor answered 22/12, 2016 at 15:39 Comment(4)
How about the sum of all columns?Titus
1.st exampe: FutureWarning: The pandas.np module is deprecated and will be removed from pandas in a future version. Import numpy directly insteadCovell
How to plot this as a % on bar chart?Vip
this solution works well for me thanks. How do I select certain columns to compute totals ?Lothair
S
12

Similar to getting the length of a dataframe, len(df), the following worked for pandas and blaze:

Total = sum(df['MyColumn'])

or alternatively

Total = sum(df.MyColumn)
print Total
Sculpsit answered 12/10, 2017 at 15:2 Comment(0)
U
5

There are two ways to sum of a column

dataset = pd.read_csv("data.csv")

1: sum(dataset.Column_name)

2: dataset['Column_Name'].sum()

If there is any issue in this the please correct me..

Unbated answered 20/1, 2020 at 6:51 Comment(0)
E
2

As other option, you can do something like below

Group   Valuation   amount
    0   BKB Tube    156
    1   BKB Tube    143
    2   BKB Tube    67
    3   BAC Tube    176
    4   BAC Tube    39
    5   JDK Tube    75
    6   JDK Tube    35
    7   JDK Tube    155
    8   ETH Tube    38
    9   ETH Tube    56

Below script, you can use for above data

import pandas as pd    
data = pd.read_csv("daata1.csv")
bytreatment = data.groupby('Group')
bytreatment['amount'].sum()
Emotional answered 27/6, 2018 at 13:57 Comment(0)
A
1

Total of multiple columns

You can select the columns you want the total of and call sum() on them. To add a new row, use loc[].

df.loc['Total'] = df[['Y', 'Z']].sum()

Using the example in the OP, it makes the following transformation (note the new added row):

result


Python's built-in sum vs pandas's sum method

For a single column, we can sum in two ways: use Python's built-in sum() function and use pandas' sum() method. It should be noted that pandas' method is optimized and much faster than Python's sum(). For example, to sum values in a column with 1mil rows, pandas' sum method is ~160 times faster than Python's built-in sum() function.

df = pd.DataFrame({'Y': range(1000000)})

%timeit a = df['Y'].sum()
# 1 ms ± 143 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit b = sum(df['Y'])
# 160 ms ± 6.1 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

So use pandas' methods wherever possible.

Assumed answered 6/2 at 18:35 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.