Pandas dataframes with multi-level columns:rename a specific level of column so that it's same as another level
Asked Answered
E

2

1

Sorry for the seemingly confusing title. I was reading Excel data using Pandas. However, the original Excel data has multiple rows for header and some of the cells are merged. It sort of looks like this:

excel

It shows in my Jupyter Notebook like this

have1

My plan is to just the 2nd level as my column names and drop the level0. But the original data has about 15 columns that shows as "Unnamed...", I wonder if I can rename those before dropping the level0 column names.

The desirable output looks like:

output

I may do this repeatedly so I didn't save it as CSV first and then read it in Pandas. Now I have spent longer than I care to admit on fixing the column names. I wonder if there is a way to do this with a function instead of renaming every individual column of interest.

Thanks.

Entoderm answered 8/4, 2020 at 22:56 Comment(0)
D
2

I think simpliest here is use list comprehension - get values of MultiIndex only if no Unnamed text:

df.columns = [first if 'Unnamed' in second else second for first, second in df.columns]
print (df)
   Purchase/sell_time  Quantity  Price Side
0 2020-04-09 15:22:00        20     43    B
1 2020-04-09 16:22:00        30     56    S

But if more levels in real data is possible some columns should be duplicated, so cannot select them (if select by duplicated column get all columns, not only one, e.g. by df['dup_column_name']).

You can test it:

print (df.columns[df.columns.duplicated(keep=False)])

Then I suggest join all unnamed levels for prevent it:

df.columns = ['_'.join(y for y in x if 'Unnamed' not in y) for x in df.columns]
print (df)
   Purchase/sell_time  Purchase/sell_time_Quantity  Purchase/sell_time_Price  \
0 2020-04-09 15:22:00                           20                        43   
1 2020-04-09 16:22:00                           30                        56   

  Side  
0    B  
1    S  
Disused answered 9/4, 2020 at 6:21 Comment(1)
Hey, sorry for the late response. Really appreciate your reply. This is so detailed and more than I expected. Thanks for the additional advice to join all unnamed levels to prevent duplicate columns. It did become a problem later on. Thanks again.Entoderm
F
2

Your columns are multiindex, and index are immutable, meaning you can't change only a part of them. This is why I suggest to retrieve both levels of the multiindex, then to create an array with your desired columns and to replace the DataFrame column with this, as follows:

# First I reproduce your dataframe
df1 = pd.DataFrame({("Purchase/sell_time","Unnamed:"):  pd.date_range("2020-04-09 15:22:00", 
                                                        freq="H", periods = 2),
                    ("Purchase/sell_time", "Quantity"): [20,30],
                    ("Purchase/sell_time", "Price"): [43, 56],
                    ("Side", "Unnamed:") : ["B", "S"]})
df1 = df1.sort_index()

It looks like this:

 Purchase/sell_time                    Side
             Unnamed: Quantity Price Unnamed:
0 2020-04-09 15:22:00       20    43        B
1 2020-04-09 16:22:00       30    56        S

The column is a multiindex as you can see:

MultiIndex([('Purchase/sell_time', 'Unnamed:'),
            ('Purchase/sell_time', 'Quantity'),
            ('Purchase/sell_time',    'Price'),
            (              'Side', 'Unnamed:')],
           )
# I retrieve the first and second level of the multiindex then create an array conditionally 
# on the second level not starting with "Unnamed" 
first_header = df1.columns.get_level_values(0)
second_header = df1.columns.get_level_values(1)
merge_header = np.where(second_header.str.startswith("Unnamed:"),
                        first_header, second_header)
df1.columns = merge_header

Here is the result:

 Purchase/sell_time  Quantity  Price Side
0 2020-04-09 15:22:00        20     43    B
1 2020-04-09 16:22:00        30     56    S

Hope it helps

Florencia answered 9/4, 2020 at 6:18 Comment(2)
Hey Raphaele, thanks a lot for your help. I didn't think of the np.where method. And I like the way you used get_level_values(0) and get_level_values(1) to clearly show the values of each column level. Very clean and readable. Thank you for not only showing why your method works, but also presenting it in a way that's easy for me to understand. Really appreciate it.Entoderm
Just a simple question to add regarding your code: what is the purpose of df1 = df1.sort_index()? I assume you are trying to sort the order of columns here, in that case, don't we need to add axis = 1? Thanks.Entoderm
D
2

I think simpliest here is use list comprehension - get values of MultiIndex only if no Unnamed text:

df.columns = [first if 'Unnamed' in second else second for first, second in df.columns]
print (df)
   Purchase/sell_time  Quantity  Price Side
0 2020-04-09 15:22:00        20     43    B
1 2020-04-09 16:22:00        30     56    S

But if more levels in real data is possible some columns should be duplicated, so cannot select them (if select by duplicated column get all columns, not only one, e.g. by df['dup_column_name']).

You can test it:

print (df.columns[df.columns.duplicated(keep=False)])

Then I suggest join all unnamed levels for prevent it:

df.columns = ['_'.join(y for y in x if 'Unnamed' not in y) for x in df.columns]
print (df)
   Purchase/sell_time  Purchase/sell_time_Quantity  Purchase/sell_time_Price  \
0 2020-04-09 15:22:00                           20                        43   
1 2020-04-09 16:22:00                           30                        56   

  Side  
0    B  
1    S  
Disused answered 9/4, 2020 at 6:21 Comment(1)
Hey, sorry for the late response. Really appreciate your reply. This is so detailed and more than I expected. Thanks for the additional advice to join all unnamed levels to prevent duplicate columns. It did become a problem later on. Thanks again.Entoderm

© 2022 - 2024 — McMap. All rights reserved.