Search for matches in a row and column name
Asked Answered
B

4

5

I have a dataframe with 'execution_date' columns containing dates. So there are also >150 columns which are named as dates '01.03.2023', '02.03.2023' etc. The dataframe itself is empty. The data is populated only in the 'execution_date' column, and there are just column names. Next, I need to put numbers 0 or 1 in the empty columns. If the date in the 'execution_date' column coincides with the name of another column, then I need to put 1, otherwise 0.

execution_date 01.03.2023 02.03.2023 03.03.2023 04.03.2023 ...
01.03.2023 NaN NaN NaN NaN ...
01.03.2023 NaN NaN NaN NaN ...
02.03.2023 NaN NaN NaN NaN ...
... ... ... ... ... ...
Cell 3 NaN NaN NaN NaN ...
Cell 3 NaN NaN NaN NaN ...

That is, the output dataframe should be like this

execution_date 01.03.2023 02.03.2023 03.03.2023 04.03.2023 ...
01.03.2023 1 0 0 0 ...
01.03.2023 1 0 0 0 ...
02.03.2023 0 1 0 0 ...
... ... ... ... ... ...
03.03.2023 0 0 1 0 ...
04.03.2023 0 0 0 1 ...

I have written the following code, but it is running very slow. Couse about 600k rows in excel and >150 columns

import pandas as pd
from tqdm import tqdm

df = pd.read_excel(test.xlsx)
column_names = [col for col in df.columns if (col != 'execution_date')]
for index, row in tqdm(df.iterrows()):
    for j in column_names:
        if row['execution_date'] == j:
            df.iloc[index, df.columns.get_loc(j)] = 1
        else:
            df.iloc[index, df.columns.get_loc(j)] = 0

Could you please tell me, what is the best way to speed up the process of such an algorithm?

Blinding answered 3/9, 2024 at 14:17 Comment(0)
H
5

You can convert the columns and the date to numpy and compare with broadcasting like this:

df.iloc[:, 1:] = (df['execution_date '].to_numpy()[:,None] == df.columns[1:].to_numpy()[None, :]).astype(int)

Output:

  execution_date  01.03.2023  02.03.2023  03.03.2023  04.03.2023
0     01.03.2023         1.0         0.0         0.0         0.0
1     01.03.2023         1.0         0.0         0.0         0.0
2     02.03.2023         0.0         1.0         0.0         0.0
3     03.03.2023         0.0         0.0         1.0         0.0
4     04.03.2023         0.0         0.0         0.0         1.0
Hexane answered 3/9, 2024 at 14:30 Comment(1)
Broadcasting wizardary! nice... +1 long time no see.Peristalsis
P
3

IIUC, you can try this using unpacking and assign column headers to the original dataframe with a pd.Series attribute name for the column headers to check row against the contents of 'execution_date':

df.assign(**df.iloc[:, 1:].apply(lambda x: df['execution_date'] == x.name).astype(int))

Output:

  execution_date  01.03.2023  02.03.2023  03.03.2023  04.03.2023
0     01.03.2023           1           0           0           0
1     01.03.2023           1           0           0           0
2     02.03.2023           0           1           0           0
4     03.03.2023           0           0           1           0
5     04.03.2023           0           0           0           1

Or you can use join:

df[['execution_date']].join(df.iloc[:, 1:].apply(lambda x: df['execution_date'] == x.name).astype(int))

Output:

  execution_date  01.03.2023  02.03.2023  03.03.2023  04.03.2023
0     01.03.2023           1           0           0           0
1     01.03.2023           1           0           0           0
2     02.03.2023           0           1           0           0
4     03.03.2023           0           0           1           0
5     04.03.2023           0           0           0           1
Peristalsis answered 3/9, 2024 at 14:25 Comment(0)
S
2

Instead of using nested loops, use a vectorized operation with apply to compare each value in the execution_date column against the column names and create a boolean mask where dates in 'execution_date' match the column names :

import pandas as pd
import numpy as np

data = {
    'execution_date': ['01.03.2023', '01.03.2023', '02.03.2023', '03.03.2023', '04.03.2023'],
    '01.03.2023': [np.nan, np.nan, np.nan, np.nan, np.nan],
    '02.03.2023': [np.nan, np.nan, np.nan, np.nan, np.nan],
    '03.03.2023': [np.nan, np.nan, np.nan, np.nan, np.nan],
    '04.03.2023': [np.nan, np.nan, np.nan, np.nan, np.nan]
}

df = pd.DataFrame(data)

column_names = [col for col in df.columns if col != 'execution_date']

mask = df['execution_date'].apply(lambda x: pd.Series(x == np.array(column_names), index=column_names))

df[column_names] = mask.fillna(0).astype(int)

print("\nProcessed DataFrame:")
print(df)

You can see the result here : https://www.online-python.com/PtVenhdo6r

Snavely answered 3/9, 2024 at 14:27 Comment(0)
A
0

Try this loop:

import pandas as pd

df = pd.read_excel(test.xlsx)
for j in df.iloc[:,1:].columns:
    df[j] = df.apply(lambda row: 1 if j == row["execution_date"] else 0, axis=1)
Airlee answered 3/9, 2024 at 14:50 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.