I have a dataframe like below where first column contains dates and other columns contain data on those dates:
date k1-v1 k1-v2 k2-v1 k2-v2 k1k3-v1 k1k3-v2 k4-v1 k4-v2
0 2021-01-05 2.0 7.0 NaN NaN NaN NaN 9.0 6.0
1 2021-01-31 NaN NaN 8.0 5.0 NaN NaN 7.0 6.0
2 2021-02-15 9.0 5.0 NaN 3.0 4.0 NaN NaN NaN
3 2021-02-28 NaN 9.0 0.0 1.0 NaN NaN 8.0 8.0
4 2021-03-20 7.0 NaN NaN NaN NaN NaN NaN NaN
5 2021-03-31 NaN NaN 8.0 NaN 3.0 NaN 8.0 0.0
6 2021-04-10 NaN NaN 7.0 6.0 NaN NaN NaN 9.0
7 2021-04-30 NaN 6.0 NaN NaN NaN NaN 1.0 NaN
8 2021-05-14 8.0 NaN 3.0 3.0 4.0 NaN NaN NaN
9 2021-05-31 NaN NaN 2.0 1.0 NaN NaN NaN NaN
The columns are always in pair: (
k1-v1
,
k1-v2
)
;(
k2-v1
,
k2-v2
)
;(
k1k3-v1
,
k1k3-v2
)
and so on N pairs. But the pair columns are not always in that order. So k1-v1 will not necessarily be followed by k1-v2 only, but there will be k1-v2 column somewhere in the dataframe. For simplicity, I showed them side-by-side.
I need to find last valid data date in each pair columns, and summarize it as below:
keys v1-last v2-last
0 k1 2021-05-14 2021-04-30
1 k2 2021-05-31 2021-05-31
2 k1k3 2021-05-14 NaN
3 k4 2021-04-30 2021-04-10
So for (
k1-v1
)
last valid data is 8.0
on date 2021-05-14
, for (
k2-v2
)
its 6.0
on 2021-04-30
. The columns v1-last
and v2-last
in above dataframe are then filled accordingly for k1, and similarly for others.
Currently I am doing it as below which is not very efficient on larger datasets:
df.set_index('date', inplace=True)
unique_cols = set([col[0] for col in df.columns.str.split('-')])
summarized_data = []
for col in unique_cols:
pair_df = df.loc[:,[col+'-v1',col+'-v2']].dropna(how='all')
v1_last_valid = pair_df.iloc[:,0].last_valid_index()
v2_last_valid = pair_df.iloc[:,1].last_valid_index()
summarized_data.append([col, v1_last_valid, v2_last_valid])
summarized_df = pd.DataFrame(summarized_data, columns=['keys','v1-last','v2-last'])
This works for now and gives me expected result but takes considerable amount of time when running on big datasets. Can the loop be avoided and this be done in a different and efficient manner?