Search for matches in a row and column name

Asked 3/9, 2024 at 14:17 Answered 3/9, 2024 at 14:50

I have a dataframe with 'execution_date' columns containing dates. So there are also >150 columns which are named as dates '01.03.2023', '02.03.2023' etc. The dataframe itself is empty. The data is populated only in the 'execution_date' column, and there are just column names. Next, I need to put numbers 0 or 1 in the empty columns. If the date in the 'execution_date' column coincides with the name of another column, then I need to put 1, otherwise 0.

execution_date	01.03.2023	02.03.2023	03.03.2023	04.03.2023	...
01.03.2023	NaN	NaN	NaN	NaN	...
01.03.2023	NaN	NaN	NaN	NaN	...
02.03.2023	NaN	NaN	NaN	NaN	...
...	...	...	...	...	...
Cell 3	NaN	NaN	NaN	NaN	...
Cell 3	NaN	NaN	NaN	NaN	...

That is, the output dataframe should be like this

execution_date	01.03.2023	02.03.2023	03.03.2023	04.03.2023	...
01.03.2023	1	0	0	0	...
01.03.2023	1	0	0	0	...
02.03.2023	0	1	0	0	...
...	...	...	...	...	...
03.03.2023	0	0	1	0	...
04.03.2023	0	0	0	1	...

I have written the following code, but it is running very slow. Couse about 600k rows in excel and >150 columns

import pandas as pd
from tqdm import tqdm

df = pd.read_excel(test.xlsx)
column_names = [col for col in df.columns if (col != 'execution_date')]
for index, row in tqdm(df.iterrows()):
    for j in column_names:
        if row['execution_date'] == j:
            df.iloc[index, df.columns.get_loc(j)] = 1
        else:
            df.iloc[index, df.columns.get_loc(j)] = 0

Could you please tell me, what is the best way to speed up the process of such an algorithm?

Blinding answered 3/9, 2024 at 14:17 Comment(0)

You can convert the columns and the date to numpy and compare with broadcasting like this:

df.iloc[:, 1:] = (df['execution_date '].to_numpy()[:,None] == df.columns[1:].to_numpy()[None, :]).astype(int)

Output:

  execution_date  01.03.2023  02.03.2023  03.03.2023  04.03.2023
0     01.03.2023         1.0         0.0         0.0         0.0
1     01.03.2023         1.0         0.0         0.0         0.0
2     02.03.2023         0.0         1.0         0.0         0.0
3     03.03.2023         0.0         0.0         1.0         0.0
4     04.03.2023         0.0         0.0         0.0         1.0

Hexane answered 3/9, 2024 at 14:30 Comment(1)

Broadcasting wizardary! nice... +1 long time no see. – Peristalsis 3/9, 2024 at 14:33

IIUC, you can try this using unpacking and assign column headers to the original dataframe with a pd.Series attribute name for the column headers to check row against the contents of 'execution_date':

df.assign(**df.iloc[:, 1:].apply(lambda x: df['execution_date'] == x.name).astype(int))

Output:

  execution_date  01.03.2023  02.03.2023  03.03.2023  04.03.2023
0     01.03.2023           1           0           0           0
1     01.03.2023           1           0           0           0
2     02.03.2023           0           1           0           0
4     03.03.2023           0           0           1           0
5     04.03.2023           0           0           0           1

Or you can use join:

df[['execution_date']].join(df.iloc[:, 1:].apply(lambda x: df['execution_date'] == x.name).astype(int))

Output:

  execution_date  01.03.2023  02.03.2023  03.03.2023  04.03.2023
0     01.03.2023           1           0           0           0
1     01.03.2023           1           0           0           0
2     02.03.2023           0           1           0           0
4     03.03.2023           0           0           1           0
5     04.03.2023           0           0           0           1

Peristalsis answered 3/9, 2024 at 14:25 Comment(0)

Instead of using nested loops, use a vectorized operation with apply to compare each value in the execution_date column against the column names and create a boolean mask where dates in 'execution_date' match the column names :

import pandas as pd
import numpy as np

data = {
    'execution_date': ['01.03.2023', '01.03.2023', '02.03.2023', '03.03.2023', '04.03.2023'],
    '01.03.2023': [np.nan, np.nan, np.nan, np.nan, np.nan],
    '02.03.2023': [np.nan, np.nan, np.nan, np.nan, np.nan],
    '03.03.2023': [np.nan, np.nan, np.nan, np.nan, np.nan],
    '04.03.2023': [np.nan, np.nan, np.nan, np.nan, np.nan]
}

df = pd.DataFrame(data)

column_names = [col for col in df.columns if col != 'execution_date']

mask = df['execution_date'].apply(lambda x: pd.Series(x == np.array(column_names), index=column_names))

df[column_names] = mask.fillna(0).astype(int)

print("\nProcessed DataFrame:")
print(df)

You can see the result here : https://www.online-python.com/PtVenhdo6r

Snavely answered 3/9, 2024 at 14:27 Comment(0)

Try this loop:

import pandas as pd

df = pd.read_excel(test.xlsx)
for j in df.iloc[:,1:].columns:
    df[j] = df.apply(lambda row: 1 if j == row["execution_date"] else 0, axis=1)

Airlee answered 3/9, 2024 at 14:50 Comment(0)

Recommended topics

Hot tags