sum values of columns starting with the same string in pandas dataframe
Asked Answered
P

3

13

I have a dataframe with about 100 columns that looks like this:

   Id  Economics-1  English-107  English-2  History-3  Economics-zz  Economics-2  \
0  56          1            1          0        1       0           0   
1  11          0            0          0        0       1           0   
2   6          0            0          1        0       0           1   
3  43          0            0          0        1       0           1   
4  14          0            1          0        0       1           0   

   Histo      Economics-51      Literature-re         Literatureu4  
0           1            0           1                0  
1           0            0           0                1  
2           0            0           0                0  
3           0            1           1                0  
4           1            0           0                0  

My goal is to leave only global categories -- English, History, Literature -- and write the sum of the value of their components, respectively, in this dataframe. For instance, "English" would be the sum of "English-107" and "English-2":

    Id  Economics      English    History  Literature  
0  56          1            1          2        1                     
1  11          1            0          0        1                    
2   6          0            1          1        0                     
3  43          2            0          1        1                     
4  14          0            1          1        0          

For this purpose, I have tried two methods. First method:

df = pd.read_csv(file_path, sep='\t')
df['History'] = df.loc[df[df.columns[pd.Series(df.columns).str.startswith('History')]].sum(axes=1)]

Second method:

df = pd.read_csv(file_path, sep='\t')
filter_col = [col for col in list(df) if col.startswith('History')]
df['History'] = 0 # initialize value, otherwise throws KeyError
for c in df[filter_col]:
    df['History'] = df[filter_col].sum(axes=1)
    print df['History', df[filter_col]]

However, both gives the error:

TypeError: 'DataFrame' objects are mutable, thus they cannot be hashed

My question is either: how can I debug this error or is there another solution for my problem. Notice that I have a rather large dataframe with about 100 columns and 400000 rows, so I'm looking for an optimized solution, like using loc in pandas.

Pigeonhearted answered 2/3, 2016 at 12:9 Comment(0)
C
14

I'd suggest that you do something different, which is to perform a transpose, groupby the prefix of the rows (your original columns), sum, and transpose again.

Consider the following:

df = pd.DataFrame({
        'a_a': [1, 2, 3, 4],
        'a_b': [2, 3, 4, 5],
        'b_a': [1, 2, 3, 4],
        'b_b': [2, 3, 4, 5],
    })

Now

[s.split('_')[0] for s in df.T.index.values]

is the prefix of the columns. So

>>> df.T.groupby([s.split('_')[0] for s in df.T.index.values]).sum().T
    a   b
0   3   3
1   5   5
2   7   7
3   9   9

does what you want.

In your case, make sure to split using the '-' character.

Cassandra answered 2/3, 2016 at 12:19 Comment(2)
groupby accepts an axis=1 argument, so something like df.groupby(df.columns.str.split("-").str[0],axis=1).sum() might work. (That Histo column makes me wonder if we'd have to use a different grouping condition, but you get the idea.)Copperplate
Adding to this great answer that one can also incorporate RegEx here in case the column names include multiple _ characters: import re then [re.split('_\d{2}', s)[0] for s in df.T.index.values]Dragrope
F
13

You can use these to create sum of columns starting with specific name,

df['Economics']= df[list(df.filter(regex='Economics'))].sum(axis=1)
Fluid answered 1/3, 2021 at 13:43 Comment(1)
This is a much more elegant solution. Thanks for contributing to this 5 year old postEndarch
O
4

Using brilliant DSM's idea:

from __future__ import print_function

import pandas as pd

categories = set(['Economics', 'English', 'Histo', 'Literature'])

def correct_categories(cols):
    return [cat for col in cols for cat in categories if col.startswith(cat)]    

df = pd.read_csv('data.csv', sep=r'\s+', index_col='Id')

#print(df)
print(df.groupby(correct_categories(df.columns),axis=1).sum())

Output:

    Economics  English  Histo  Literature
Id
56          1        1      2           1
11          1        0      0           1
6           1        1      0           0
43          2        0      1           1
14          1        1      1           0

Here is another version, which takes care of "Histo/History" problematic..

from __future__ import print_function

import pandas as pd

#categories = set(['Economics', 'English', 'Histo', 'Literature'])

#
# mapping: common starting pattern: desired name
#
categories = {
    'Histo': 'History',
    'Economics': 'Economics',
    'English': 'English',
    'Literature': 'Literature'
}

def correct_categories(cols):
    return [categories[cat] for col in cols for cat in categories.keys() if col.startswith(cat)]

df = pd.read_csv('data.csv', sep=r'\s+', index_col='Id')
#print(df.columns, len(df.columns))
#print(correct_categories(df.columns), len(correct_categories(df.columns)))
#print(df.groupby(pd.Index(correct_categories(df.columns)),axis=1).sum())

rslt = df.groupby(correct_categories(df.columns),axis=1).sum()
print(rslt)
print('History\n', rslt['History'])

Output:

    Economics  English  History  Literature
Id
56          1        1        2           1
11          1        0        0           1
6           1        1        0           0
43          2        0        1           1
14          1        1        1           0
History
 Id
56    2
11    0
6     0
43    1
14    1
Name: History, dtype: int64

PS You may want to add missing categories to categories map/dictionary

Officiary answered 2/3, 2016 at 12:45 Comment(4)
Thanks for your answer, awesome solution, but I get an issue with accessing the columns : with your methods I have always keyError, even after trying different solutions of encoding while reading CSV. Do even when I try to take df['History'] I have an error, but no error with df.get('history')Pigeonhearted
@Amanda, one of your columns had 'Histo' as a name, so i decided to use "Histo" as a common name, otherwise you would have both: Histo and History. So try df['Histo'] insteadOfficiary
@Amanda, what would you like to do with Histo column? Do you want to have both "Histo" and "History" in the result data frame?Officiary
thank you for your answer, actually the problem is not only in history, but in accessin the column name in general : #35764672 ! your answer is good, but when there a multiple dataframes that should be treated, the KeyError blocks everythingPigeonhearted

© 2022 - 2024 — McMap. All rights reserved.