Python with DataFrame merge of aggregations ...error: '' is both an index level and a column label, which is ambiguous
Asked Answered
P

1

8
import pandas as pd

data1 = [
{'File': '7396ee8aea09.json', 'Last': '2019-04-25 20:24:28+00:00', 'Size': 78796, 'COLA': 'STANDARD'}, 
{'File': '7777682e4/ip/done', 'Last': '2019-04-24 21:45:01+00:00', 'Size': 4, 'COLA': 'CUSTOM'}, 
{'File': 'erparameters.json', 'Last': '2019-04-24 21:45:01+00:00', 'Size': 351, 'COLA': 'STANDARD'}
]
data2 = [
{'File': '3343e4/ip/ip.json', 'Last': '2019-04-24 21:45:01+00:00', 'Size': 20, 'COLA': 'STANDARD'}, 
{'File': 'ta-00000-of-00001', 'Last': '2019-04-24 22:43:17+00:00', 'Size': 140216236, 'COLA': 'STANDARD'}, 
{'File': '-20514.ckpt.index', 'Last': '2019-04-24 22:43:17+00:00', 'Size': 3168, 'COLA': 'STANDARD'}, 
]
# right now 2 for the prova but could be much more..
datas = [data1, data2]
results = None 
for data in datas:
  df = pd.DataFrame(data, columns=['COLA','Size','Last'])\
              .groupby(['COLA'])\
              .agg({
                  'COLA': [('COLA', 'max'), ('Count', 'count')],
                  'Size': [('Size', 'sum')],
                  'Last': [('Last', 'max')]
                  })
  df.columns = ['COLA', 'Count', 'Size', 'Last']

  if results is None:
    results = df
  else:
    results = results.merge(df,
                   on=['COLA', 'Count', 'Size', 'Last'],
                   how='outer',
                   )\
             .groupby(['COLA'], as_index=False)\
             .agg({
                   'COLA': [('COLA', 'max'), ('Count', 'sum')],
                   'Size': [('Size', 'sum')],
                   'Last': [('Last', 'max')]
                 })

print (results)

expected results is in that format:

  • COLA have all unique COLA labels
  • Size is the sum of all 'Size' for that COLA label group across all sets.
  • Count is the total count of that COLA label group across all sets.
  • Last has the greatest date of that COLA label group across all sets.

Example:

results = [
     {'COLA': 'STANDARD', 'Size': 140298571, 'Count': 5, 'Last': '2019-04-25 20:24:28+00:00'}, 
     {'COLA': 'CUSTOM',   'Size': 4,         'Count': 1, 'Last': '2019-04-24 21:45:01+00:00'}
]
Panzer answered 4/3, 2020 at 13:31 Comment(0)
I
4

You can concatenate all the data and do groupby:

df = pd.concat(pd.DataFrame(d) for d in datas)

(df.groupby('COLA').agg({'COLA':'count',
                        'Size':'max',
                        'Last':'max'})
  .rename(columns={'COLA':'Count'})
  .reset_index()
  .to_dict('rows')
)

Output:

[{'COLA': 'CUSTOM',
  'Count': 1,
  'Size': 4,
  'Last': '2019-04-24 21:45:01+00:00'},
 {'COLA': 'STANDARD',
  'Count': 5,
  'Size': 140216236,
  'Last': '2019-04-25 20:24:28+00:00'}]

Update: I haven't tried this, but this should work and require less memory:

df = pd.concat(pd.DataFrame(d)
                 .groupby('COLA')
                 .agg({'COLA':'count',
                       'Size':'max',
                       'Last':'max'})
                 .rename(columns={'COLA':'Count'})
               for data in datas
               )

df.groupby('COLA').agg({'Count':'sum', 'Size':'max', 'last':'max'})
Inclinometer answered 4/3, 2020 at 13:36 Comment(8)
Would that cause a memory issue if I have millions of rows in each data set?Panzer
As long as you loaded all the datasets in to datas, I think you would be fine. If not, you can groupby each dataset in your for loop with the same command, concatenate the results and groupby again, this time aggregate with 'Count':'sum' and without rename.Inclinometer
Thanks! That really simplify my code and reduce many loops.. One more issue is when the .to_dict('rows') ... it give me the lables as ('COLA',''): and ('Count','Count'): ...Panzer
That I'm not sure, the code works for the sample data. Maybe your data is a bit different.Inclinometer
The issue was in the way I was doing the agg... I did not noticed the change you did.. All good now!! Many thanks!!Panzer
can you give me the sample code for the first comment about the loop? I'm not sure I get the "loop with the same command" and the change to 'Count':'sum'..Panzer
getting: pandas.core.base.SpecificationError: nested renamer is not supported, I f I remove the rename I loose my entry for 'COLA': 'CUSTOM' or 'COLA':'STANDARD' as part of the resultsPanzer
Got it! I had to add "as_index=False" in the groupby'sPanzer

© 2022 - 2024 — McMap. All rights reserved.