pandas dataframe create a new dataframe by duplicating n times rows of the previous dataframe and change date
Asked Answered
V

3

5

I have a dataframe with around around 9k rows and 57 cols, this is 'df'.

I need to have a new dataframe: 'df_final' - for each row of 'df' i have to replicate each row 'x' times and increase the day in each row one by one, also 'x' times. While i can do this for a couple of iterations, when i do it for the full length of 'df' ' len(df)' the loop it takes so long (>3 hours) that i actually had to cancel it. I have never seen the end of it. Here's the current code:

df.shape
output: (9454, 57)

df_int = df[0:0]
df_final = df_int[0:0]
range_df = len(df)
for x in range(0,2):
    df_int = df.iloc[0+x:x+1]
    if abs(df_int.iat[-1,3]) > 0:
        df_int = pd.concat([df_int]*abs(df_int.iat[-1,3]), ignore_index=True)
        for i in range(1, abs(df_int.iat[-1,3])):
            df_int['Consumption Date'][i] = df_int['Consumption Date'][i-1] + datetime.timedelta(days = 1)
            i += 1
       df_final = df_final.append(df_int, ignore_index=True)
    x += 1

The result of the loops for the first two rows of ' df' are below.

First two rows of df: enter image description here

Desired result: enter image description here

Is there another way to get to the desired output. It seems pandas do not deal very well with loops. In VBA excel the same loop takes around 3/4 minutes...i am trying to change a process which is currently in excel to python, however, if there's no way to make this work i guess i will stick to the old ways...

Vexatious answered 23/9, 2017 at 15:13 Comment(0)
K
4

Use repeat and cumcount

In [2972]: dff = df.loc[df.index.repeat(3)]

In [2973]: dff
Out[2973]:
        date   name
0 2017-05-03    bob
0 2017-05-03    bob
0 2017-05-03    bob
1 2017-06-13  sally
1 2017-06-13  sally
1 2017-06-13  sally

In [2974]: dff.loc[:, 'date'] += pd.to_timedelta(dff.groupby(level=0).cumcount(), 'D')

In [2975]: dff
Out[2975]:
        date   name
0 2017-05-03    bob
0 2017-05-04    bob
0 2017-05-05    bob
1 2017-06-13  sally
1 2017-06-14  sally
1 2017-06-15  sally

Details

In [2976]: df
Out[2976]:
        date   name
0 2017-05-03    bob
1 2017-06-13  sally

In [2977]: dff.groupby(level=0).cumcount()
Out[2977]:
0    0
0    1
0    2
1    0
1    1
1    2
dtype: int64
Kila answered 23/9, 2017 at 15:45 Comment(1)
really nice solNickola
P
1

Let's use this toy DataFrame:

df = pd.DataFrame({
    'date': pd.to_datetime(['2017-05-03', '2017-06-13']),
    'name': ['bob', 'sally'],
})

It looks like this:

        date   name
0 2017-05-03    bob
1 2017-06-13  sally

Then:

x = 3 # repeat count
ind = np.repeat(np.arange(len(df)), x) # 0,0,0,1,1,1
df_final = df.iloc[ind].copy()

That gives you the repeats:

        date   name
0 2017-05-03    bob
0 2017-05-03    bob
0 2017-05-03    bob
1 2017-06-13  sally
1 2017-06-13  sally
1 2017-06-13  sally

Now you just need to increment the dates:

inc = np.tile(np.arange(x), len(df)) # 0,1,2,0,1,2
df_final.date += pd.to_timedelta(inc, 'D')

And you get:

        date   name
0 2017-05-03    bob
0 2017-05-04    bob
0 2017-05-05    bob
1 2017-06-13  sally
1 2017-06-14  sally
1 2017-06-15  sally
Piano answered 23/9, 2017 at 15:34 Comment(0)
D
1

Here is one solution

df1=df.reset_index().set_index('date').groupby('index').\
    apply(lambda x :x.reindex(pd.date_range(start=x.index[0],periods=3,freq='D'))).ffill()
df1
Out[202]: 
                  index   name
index                         
0     2017-05-03    0.0    bob
      2017-05-04    0.0    bob
      2017-05-05    0.0    bob
1     2017-06-13    1.0  sally
      2017-06-14    1.0  sally
      2017-06-15    1.0  sally

Then

df1.drop('index',1).reset_index().rename(columns={'level_1':'date'}).drop('index',1)

Out[212]: 
        date   name
0 2017-05-03    bob
1 2017-05-04    bob
2 2017-05-05    bob
3 2017-06-13  sally
4 2017-06-14  sally
5 2017-06-15  sally
Delvalle answered 23/9, 2017 at 15:57 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.