Pick last valid data dates from pair columns in a large dataset
Asked Answered
H

5

7

I have a dataframe like below where first column contains dates and other columns contain data on those dates:

         date  k1-v1  k1-v2  k2-v1  k2-v2  k1k3-v1  k1k3-v2  k4-v1  k4-v2
0  2021-01-05    2.0    7.0    NaN    NaN      NaN      NaN    9.0    6.0
1  2021-01-31    NaN    NaN    8.0    5.0      NaN      NaN    7.0    6.0
2  2021-02-15    9.0    5.0    NaN    3.0      4.0      NaN    NaN    NaN
3  2021-02-28    NaN    9.0    0.0    1.0      NaN      NaN    8.0    8.0
4  2021-03-20    7.0    NaN    NaN    NaN      NaN      NaN    NaN    NaN
5  2021-03-31    NaN    NaN    8.0    NaN      3.0      NaN    8.0    0.0
6  2021-04-10    NaN    NaN    7.0    6.0      NaN      NaN    NaN    9.0
7  2021-04-30    NaN    6.0    NaN    NaN      NaN      NaN    1.0    NaN
8  2021-05-14    8.0    NaN    3.0    3.0      4.0      NaN    NaN    NaN
9  2021-05-31    NaN    NaN    2.0    1.0      NaN      NaN    NaN    NaN

The columns are always in pair: (k1-v1, k1-v2);(k2-v1, k2-v2);(k1k3-v1, k1k3-v2) and so on N pairs. But the pair columns are not always in that order. So k1-v1 will not necessarily be followed by k1-v2 only, but there will be k1-v2 column somewhere in the dataframe. For simplicity, I showed them side-by-side.

I need to find last valid data date in each pair columns, and summarize it as below:

   keys     v1-last     v2-last
0    k1  2021-05-14  2021-04-30
1    k2  2021-05-31  2021-05-31
2  k1k3  2021-05-14         NaN
3    k4  2021-04-30  2021-04-10

So for (k1-v1) last valid data is 8.0 on date 2021-05-14, for (k2-v2) its 6.0 on 2021-04-30. The columns v1-last and v2-last in above dataframe are then filled accordingly for k1, and similarly for others.

Currently I am doing it as below which is not very efficient on larger datasets:

df.set_index('date', inplace=True)
unique_cols = set([col[0] for col in df.columns.str.split('-')])
summarized_data = []
for col in unique_cols:
    pair_df = df.loc[:,[col+'-v1',col+'-v2']].dropna(how='all')
    v1_last_valid = pair_df.iloc[:,0].last_valid_index()
    v2_last_valid = pair_df.iloc[:,1].last_valid_index()
    summarized_data.append([col, v1_last_valid, v2_last_valid])

summarized_df = pd.DataFrame(summarized_data, columns=['keys','v1-last','v2-last'])

This works for now and gives me expected result but takes considerable amount of time when running on big datasets. Can the loop be avoided and this be done in a different and efficient manner?

Hypethral answered 5/6, 2021 at 10:25 Comment(1)
"for (k2-v2) its 6.0 on 2021-04-30" did you mean "for (k1-v2)" ? Because k2-v2 has a 6.0 on 2021-04-10, but that's not the last valid data.Slop
F
3

Solution

s = df.set_index('date').stack()
s = s.reset_index().drop_duplicates('level_1', keep='last')
s[['keys', 'val']] = s['level_1'].str.split('-', expand=True)
s = s.pivot('keys', 'val', 'date').add_suffix('-last')

Explanations

Set the index of dataframe to date and stack to reshape

date               
2021-01-05  k1-v1      2.0
            k1-v2      7.0
            k4-v1      9.0
            k4-v2      6.0
2021-01-31  k2-v1      8.0
            k2-v2      5.0
            k4-v1      7.0
            k4-v2      6.0
...
2021-05-31  k2-v1      2.0
            k2-v2      1.0
dtype: float64

Reset the index and drop the rows having duplicate values in level_1

          date  level_1    0
24  2021-04-10    k4-v2  9.0
25  2021-04-30    k1-v2  6.0
26  2021-04-30    k4-v1  1.0
27  2021-05-14    k1-v1  8.0
30  2021-05-14  k1k3-v1  4.0
31  2021-05-31    k2-v1  2.0
32  2021-05-31    k2-v2  1.0

Split the strings in the level_1 column to create two additional columns keys and val

          date  level_1    0  keys val
24  2021-04-10    k4-v2  9.0    k4  v2
25  2021-04-30    k1-v2  6.0    k1  v2
26  2021-04-30    k4-v1  1.0    k4  v1
27  2021-05-14    k1-v1  8.0    k1  v1
30  2021-05-14  k1k3-v1  4.0  k1k3  v1
31  2021-05-31    k2-v1  2.0    k2  v1
32  2021-05-31    k2-v2  1.0    k2  v2

Pivot the dataframe to reshape and add suffix -last to column names

val      v1-last     v2-last
keys                        
k1    2021-05-14  2021-04-30
k1k3  2021-05-14         NaN
k2    2021-05-31  2021-05-31
k4    2021-04-30  2021-04-10
Frisch answered 5/6, 2021 at 11:10 Comment(2)
Thankyou. This one worked and gave me expected results almost instantaneously !Hypethral
@Hypethral Happy coding!Frisch
A
3

We can reverse the columns' names and use pd.wide_to_long where stubnames will be v_js, identifier will be date and we call the k*s as keys in the result. Then we can groupby keys and aggregate with DataFrame.last_valid_index:

# reverse the column names
df.columns = df.columns.str.replace(r"(\w+)-(\w+)", r"\2-\1", regex=True)

# wide to long (and then make `keys` a column with reset_index)
long = pd.wide_to_long(df, stubnames=["v1", "v2"], i="date", j="keys",
                       sep="-", suffix=r"\w+").reset_index("keys")

# get the last valid dates & add a suffix
result = (long.groupby("keys")
              .agg(pd.DataFrame.last_valid_index)
              .add_suffix("-last"))
        

to get

>>> result

         v1-last     v2-last
keys
k1    2021-05-14  2021-04-30
k1k3  2021-05-14        None
k2    2021-05-31  2021-05-31
k4    2021-04-30  2021-04-10


To make the stubnames more generic for v_js:

stubnames = df.columns.str.extract(r"^(\w+)-", expand=False).dropna().unique()
# Index(["v1", "v2"], dtype="object")
Alive answered 5/6, 2021 at 10:56 Comment(2)
Thanks. I tried this on my original dataframe that currently has 706 pair columns but only 50 rows. The last result step took took around 15-20 secs to compute. My original solution takes ~25 secs. Is it due to aggregating with pd.DataFrame.last_valid_index?Hypethral
@Hypethral Yes, it is probably that. But you found a better faster answer so that's the solution :)Acidosis
F
3

Solution

s = df.set_index('date').stack()
s = s.reset_index().drop_duplicates('level_1', keep='last')
s[['keys', 'val']] = s['level_1'].str.split('-', expand=True)
s = s.pivot('keys', 'val', 'date').add_suffix('-last')

Explanations

Set the index of dataframe to date and stack to reshape

date               
2021-01-05  k1-v1      2.0
            k1-v2      7.0
            k4-v1      9.0
            k4-v2      6.0
2021-01-31  k2-v1      8.0
            k2-v2      5.0
            k4-v1      7.0
            k4-v2      6.0
...
2021-05-31  k2-v1      2.0
            k2-v2      1.0
dtype: float64

Reset the index and drop the rows having duplicate values in level_1

          date  level_1    0
24  2021-04-10    k4-v2  9.0
25  2021-04-30    k1-v2  6.0
26  2021-04-30    k4-v1  1.0
27  2021-05-14    k1-v1  8.0
30  2021-05-14  k1k3-v1  4.0
31  2021-05-31    k2-v1  2.0
32  2021-05-31    k2-v2  1.0

Split the strings in the level_1 column to create two additional columns keys and val

          date  level_1    0  keys val
24  2021-04-10    k4-v2  9.0    k4  v2
25  2021-04-30    k1-v2  6.0    k1  v2
26  2021-04-30    k4-v1  1.0    k4  v1
27  2021-05-14    k1-v1  8.0    k1  v1
30  2021-05-14  k1k3-v1  4.0  k1k3  v1
31  2021-05-31    k2-v1  2.0    k2  v1
32  2021-05-31    k2-v2  1.0    k2  v2

Pivot the dataframe to reshape and add suffix -last to column names

val      v1-last     v2-last
keys                        
k1    2021-05-14  2021-04-30
k1k3  2021-05-14         NaN
k2    2021-05-31  2021-05-31
k4    2021-04-30  2021-04-10
Frisch answered 5/6, 2021 at 11:10 Comment(2)
Thankyou. This one worked and gave me expected results almost instantaneously !Hypethral
@Hypethral Happy coding!Frisch
S
2

Rename the column and then use wide_to_long to restructure the dataframe. Stack to remove NAN. Then use groupby-agg to extract the last value.

df2 = (
    pd.wide_to_long(
        df2.rename(columns=(lambda x: ''.join(x.split('-')[::-1]))),
        stubnames=['v2', 'v1'],
        i='date',
        j='keys',
        suffix='.*'
    ).stack()
    .reset_index(0)
    .groupby(level=[0, 1])
    .agg({'date': 'last'})
    .unstack(-1)
).add_suffix('-last')

df2.columns = df2.columns.droplevel()

OUTPUT:

         v2-last     v1-last
keys                        
k1    2021-04-30  2021-05-14
k1k3         NaN  2021-05-14
k2    2021-05-31  2021-05-31
k4    2021-04-10  2021-04-30
Schmooze answered 5/6, 2021 at 10:57 Comment(1)
Your solution is a bit complex for me to understand :) but it works anyways!Hypethral
E
2

Combining ideas from @MustafaAydin and @ShubhamSharma, we could do this:

temp = df.set_index('date')

Apply pd.Series.last_valid_index:

# you could use `agg` since it is a reducer
temp = temp.apply(pd.Series.last_valid_index)

Convert index to MultiIndex:

temp.index = temp.index.str.split("-", expand = True)

unstack and add suffix:

temp.unstack().add_suffix('_last')

         v1_last     v2_last
k1    2021-05-14  2021-04-30
k1k3  2021-05-14        None
k2    2021-05-31  2021-05-31
k4    2021-04-30  2021-04-10
Elsy answered 5/6, 2021 at 23:31 Comment(2)
This is much better and should be faster than rest of the answers, great idea!Frisch
Works nice! Tested on my original df and found it only a little slower than @ShubhamSharma 's solution. But this is more self-documenting for what I needed to do.Hypethral
S
1

First, define a function to get the required date for any given column name:

def last_date(col):
    idx = df[col].last_valid_index()
    return df.loc[idx, 'date'] if idx is not None else np.nan

Then group the keys in a dictionary, and use that to build the resulting dataframe:

keys = {}
for col in df.columns[1:]:
    key, vn = col.split('-')
    keys.setdefault(key, [None]*2)[int(vn[-1])-1] = last_date(col)

pd.DataFrame.from_records([[k] + dt_list for k, dt_list in keys.items()], columns=['keys', 'v1-last', 'v2-last'])

This produces the required output:

   keys     v1-last     v2-last
0    k1  2021-05-14  2021-04-30
1    k2  2021-05-31  2021-05-31
2  k1k3  2021-05-14         NaN
3    k4  2021-04-30  2021-04-10
Slop answered 5/6, 2021 at 12:15 Comment(1)
Thanks for your solution. It is comparable to my original one and takes around ~25 secs to run. But good to know another way of doing it :)Hypethral

© 2022 - 2024 — McMap. All rights reserved.