Filling gaps for cumulative sum with Pandas
Asked Answered
B

1

2

I'm trying to calculate the inventory of stocks from a table in monthly buckets in Pandas. This is the table:

Goods  |  Incoming  | Date
-------+------------+-----------
'a'    |        10  | 2014-01-10
'a'    |        20  | 2014-02-01
'b'    |        30  | 2014-01-02
'b'    |        40  | 2014-05-13
'a'    |        20  | 2014-06-30
'c'    |        10  | 2014-02-10
'c'    |        50  | 2014-05-10
'b'    |        70  | 2014-03-10
'a'    |        10  | 2014-02-10

This is my code so far:

import pandas as pd
df = pd.DataFrame({
  'goods': ['a', 'a', 'b', 'b', 'a', 'c', 'c', 'b', 'a'], 
  'incoming': [0, 20, 30, 40, 20, 10, 50, 70, 10], 
  'date': ['2014-01-10', '2014-02-01', '2014-01-02', '2014-05-13', '2014-06-30', '2014-02-10', '2014-05-10', '2014-03-10', '2014-02-10']})

df['date'] = pd.to_datetime(df['date'])
# we don't care about year in this example
df['month'] = df['date'].map(lambda x: x.month)
dfg = df.groupby(['goods', 'month'])['incoming'].sum()
# flatten multi-index
dfg = dfg.reset_index ()
dfg['level'] = dfg.groupby(['goods'])['incoming'].cumsum()
dfg

which returns

    goods   month   incoming    level
0   a       1              0    0
1   a       2             30    30
2   a       6             20    50
3   b       1             30    30
4   b       3             70    100
5   b       5             40    140
6   c       2             10    10
7   c       5             50    60

While this is good, the visualisation method that I use requires (1) the same number of data points per group ('goods'), (2) the same extent of the time-series (i.e. earliest/latest month is the same for all time series) and (3) that there are no "gaps" in any time series (a month between min(month) and max(month) with a data point).

How can I do this with Pandas? Note, even thought this structure may be a bit inefficient, I'd like to stick with the general flow of things. Perhaps it's possible to insert some "post-processing" to fill in the gaps.

Update

To summarise the response below, I chose to do this:

piv = dfg.pivot_table(["level"], "month", "goods")
piv = piv.reindex(np.arange(piv.index[0], piv.index[-1] + 1))
piv = piv.ffill(axis=0)
piv = piv.fillna(0)
piv.index.name = 'month'

I also added

piv = piv.stack()
print r.reset_index()

to get a table similar to the input table:

   month goods  level
0       1     a      0
1       1     b     30
2       1     c      0
3       2     a     30
4       2     b     30
5       2     c     10
6       3     a     30
7       3     b    100
8       3     c     10
9       4     a     30
10      4     b    100
11      4     c     10
12      5     a     30
13      5     b    140
14      5     c     60
15      6     a     50
16      6     b    140
17      6     c     60
Blaeberry answered 14/11, 2014 at 3:11 Comment(2)
This feels a bit like the XY Problem, might be helped with how you're plotting?Penknife
I thought long and hard about alternatives before posting. The plotting cannot be changed, unfortunately.Blaeberry
P
2

I think you want to use pivot_table:

In [11]: df.pivot_table(values="incoming", index="month", columns="goods", aggfunc="sum")
Out[11]:
goods   a   b   c
month
1       0  30 NaN
2      30 NaN  10
3     NaN  70 NaN
5     NaN  40  50
6      20 NaN NaN

To get the filled in months, you can reindex (this feels a little hacky, there may be a neater way):

In [12]: res.reindex(np.arange(res.index[0], res.index[-1] + 1))
Out[12]:
goods   a   b   c
1       0  30 NaN
2      30 NaN  10
3     NaN  70 NaN
4     NaN NaN NaN
5     NaN  40  50
6      20 NaN NaN

One issue here is that month is independent of year, in may be preferable to have a period index:

In [21]: df.pivot_table(values="incoming", index=pd.DatetimeIndex(df.date).to_period("M"), columns="goods", aggfunc="sum")
Out[21]:
goods     a   b   c
2014-01   0  30 NaN
2014-02  30 NaN  10
2014-03 NaN  70 NaN
2014-05 NaN  40  50
2014-06  20 NaN NaN

and then you can reindex by the period range:

In [22]: res2.reindex(pd.period_range(res2.index[0], res2.index[-1], freq="M"))
Out[22]:
goods     a   b   c
2014-01   0  30 NaN
2014-02  30 NaN  10
2014-03 NaN  70 NaN
2014-04 NaN NaN NaN
2014-05 NaN  40  50
2014-06  20 NaN NaN

Which is to say, you can do the same with your dfg:

In [31]: dfg.pivot_table(["incoming", "level"], "month", "goods")
Out[31]:
      incoming         level
goods        a   b   c     a    b   c
month
1            0  30 NaN     0   30 NaN
2           30 NaN  10    30  NaN  10
3          NaN  70 NaN   NaN  100 NaN
5          NaN  40  50   NaN  140  60
6           20 NaN NaN    50  NaN NaN

and reindex.

Penknife answered 14/11, 2014 at 7:6 Comment(7)
I wonder if it should be easier to do this reindexing at the end, it feels like a very natural thing to do. Perhaps worth filing a feature request.Penknife
Thanks for the solution. This at least fills the gaps with NaN. How would I go about filling it with the previous value in case there is no other value (NaN). For instance good 'a' would have a level [0, 30, 30, 30, 50]?Blaeberry
You can do res.ffill() to "fill forward".Penknife
Regarding a neater way for the reindexing: This does the same, but is a little shorter: res.reindex(res.count()). Not sure if that's neater :-)Blaeberry
I just realised, in your dfg.pivot_table - example, month #4 is missing. Anything I can do about this?Blaeberry
@Blaeberry that's what the reindex is needed for! :)Penknife
I see. Makes sense now :)Blaeberry

© 2022 - 2024 — McMap. All rights reserved.