I have a dataframe with about 100 columns that looks like this:
Id Economics-1 English-107 English-2 History-3 Economics-zz Economics-2 \
0 56 1 1 0 1 0 0
1 11 0 0 0 0 1 0
2 6 0 0 1 0 0 1
3 43 0 0 0 1 0 1
4 14 0 1 0 0 1 0
Histo Economics-51 Literature-re Literatureu4
0 1 0 1 0
1 0 0 0 1
2 0 0 0 0
3 0 1 1 0
4 1 0 0 0
My goal is to leave only global categories -- English, History, Literature -- and write the sum of the value of their components, respectively, in this dataframe. For instance, "English" would be the sum of "English-107" and "English-2":
Id Economics English History Literature
0 56 1 1 2 1
1 11 1 0 0 1
2 6 0 1 1 0
3 43 2 0 1 1
4 14 0 1 1 0
For this purpose, I have tried two methods. First method:
df = pd.read_csv(file_path, sep='\t')
df['History'] = df.loc[df[df.columns[pd.Series(df.columns).str.startswith('History')]].sum(axes=1)]
Second method:
df = pd.read_csv(file_path, sep='\t')
filter_col = [col for col in list(df) if col.startswith('History')]
df['History'] = 0 # initialize value, otherwise throws KeyError
for c in df[filter_col]:
df['History'] = df[filter_col].sum(axes=1)
print df['History', df[filter_col]]
However, both gives the error:
TypeError: 'DataFrame' objects are mutable, thus they cannot be hashed
My question is either: how can I debug this error or is there another solution for my problem. Notice that I have a rather large dataframe with about 100 columns and 400000 rows, so I'm looking for an optimized solution, like using loc
in pandas.
groupby
accepts an axis=1 argument, so something likedf.groupby(df.columns.str.split("-").str[0],axis=1).sum()
might work. (ThatHisto
column makes me wonder if we'd have to use a different grouping condition, but you get the idea.) – Copperplate