I have a dataframe with 'execution_date' columns containing dates. So there are also >150 columns which are named as dates '01.03.2023', '02.03.2023' etc. The dataframe itself is empty. The data is populated only in the 'execution_date' column, and there are just column names. Next, I need to put numbers 0 or 1 in the empty columns. If the date in the 'execution_date' column coincides with the name of another column, then I need to put 1, otherwise 0.
execution_date | 01.03.2023 | 02.03.2023 | 03.03.2023 | 04.03.2023 | ... |
---|---|---|---|---|---|
01.03.2023 | NaN | NaN | NaN | NaN | ... |
01.03.2023 | NaN | NaN | NaN | NaN | ... |
02.03.2023 | NaN | NaN | NaN | NaN | ... |
... | ... | ... | ... | ... | ... |
Cell 3 | NaN | NaN | NaN | NaN | ... |
Cell 3 | NaN | NaN | NaN | NaN | ... |
That is, the output dataframe should be like this
execution_date | 01.03.2023 | 02.03.2023 | 03.03.2023 | 04.03.2023 | ... |
---|---|---|---|---|---|
01.03.2023 | 1 | 0 | 0 | 0 | ... |
01.03.2023 | 1 | 0 | 0 | 0 | ... |
02.03.2023 | 0 | 1 | 0 | 0 | ... |
... | ... | ... | ... | ... | ... |
03.03.2023 | 0 | 0 | 1 | 0 | ... |
04.03.2023 | 0 | 0 | 0 | 1 | ... |
I have written the following code, but it is running very slow. Couse about 600k rows in excel and >150 columns
import pandas as pd
from tqdm import tqdm
df = pd.read_excel(test.xlsx)
column_names = [col for col in df.columns if (col != 'execution_date')]
for index, row in tqdm(df.iterrows()):
for j in column_names:
if row['execution_date'] == j:
df.iloc[index, df.columns.get_loc(j)] = 1
else:
df.iloc[index, df.columns.get_loc(j)] = 0
Could you please tell me, what is the best way to speed up the process of such an algorithm?