Pandas v 0.25 groupby with many columns gives memory error
Asked Answered
O

2

5

After updating to pandas v0.25.2 a script doing a groupby over many columns on a large dataframe no longer works. I get a memory error

MemoryError: Unable to allocate array with shape (some huge number...,) and data type int64

Doing a bit of research I find issue (#14942) reported on Git for an earlier version

import numpy as np
import pandas as pd

df = pd.DataFrame({
    'cat': np.random.randint(0, 255, size=3000000),
    'int_id': np.random.randint(0, 255, size=3000000),
    'other_id': np.random.randint(0, 10000, size=3000000),
    'foo': 0
}) 
df['cat'] = df.cat.astype(str).astype('category')

# killed after 6 minutes of 100% cpu and  90G maximum main memory usage
grouped = df.groupby(['cat', 'int_id', 'other_id']).count()

Running this code (on version 0.25.2) also gives a memory error. Am I doing something wrong (is the syntax in pandas v0.25 changed?), or has this issue, which is marked as resolved, returned?

Odious answered 6/1, 2020 at 18:49 Comment(11)
The problem I see with code provides is you are going to get 255x255x10000 = 650,250,000 rows .. that is more than 650 million. I think that is huge for your system. However I don't know why it was working with earlier versions of pandas and ,the same code, now is not working. Good question. I wish you could provide more details.Kahle
Eliminate the df['cat'] = df.cat.astype(str).astype('category') line and it works. Still trying to understand whyErik
@Kahle But the dataframe is ´only´ 3000000 rows. The groupby cannot be bigger than that... unless it actually makes the empty categories as well.Odious
@CodeDifferent I looked at the issue I mention, and in that issue the problem was that the categorical created all possible combinations (also the empty combinations). And it seems this issue might be back.Odious
@Odious Yes it has 3000000 rows but there are likely to be 255 unique values for 'cat' , 255 unique values for 'int_id' and 10000 unique values for 'other_id'. When you are grouping, each combination is unique and thus you have to multiply these numbers .. 255x255x10000 ..And these many combinations will give you over 650 million rows. I have a question why you are converting df.cat dtype first to string and then to categorical?Kahle
@Kahle I understand your point about the number of POSSIBLE combinations. My point is that as long as all of those combinations are actually not present in the dataframe (they cannot be since there are 255x255X10000 combinations and the frame only has 3000000 rows) there is no need for the groupby object to allocate memory for all combinations. That is probably what is happening, but as I state in my post I suspect this is a bug. As for the converting to df.cat dtype, that is done in order to provoke the bug.Odious
Yes you are right. I was wrong. Grouped database can not be bigger than database itself, otherwise the very concept of aggregation fails. I am really sorry. Tired perhaps.Kahle
@Kahle no need to apologize. I appreciate your inputs.Odious
Same error for me after an upgrade from pandas 0.24 to 0.25 I get a memory error on groupby.agg() operation with string and categorical variables in the index. Is there a link to a bug report?Choking
Same problem here, after upgrading to 0.25.2 at work new MemoryErrors are popping up where there were none before. It seems like a regression.Suckling
This has been an interesting thread! I've been stuck on 0.24.2 due to this issue, and never could fully figure it out. Mine was always when using the new syntax in 0.25.x around named aggregation, (or, in using my own workaround code that does same thing as named aggregation). I only started looking back into it today after testing again on 1.0.0rc1 and having same problem. I think the new functionality in 0.25.0 definitely introduced a change that is impacting this.Deviationism
C
13

Use observed=True to fix it and prevent the groupby to expand all possible combination of factor variables:

df.groupby(index, observed=True)

There is a related GitHub Issue: PERF: groupby with many empty groups memory blowup. The issue has been closed and the default will be changed from observed=False to observed=True in a future version of pandas.

Choking answered 14/1, 2020 at 13:38 Comment(0)
C
0

While the proposed solution addresses the issue, it is likely that another problem will arise when dealing with larger datasets. pandas groupby is slow and memory hungry; may need 5-10x the memory of the dataset. A more effective solution is to use a tool that is order of magnitude faster, less memory hungry, and seamlessly integrates with pandas; it reads directly from the dataframe memory. No need for data round trip, and typically no need for extensive data chunking.

My new tool of choice for quick data aggregation is https://duckdb.org. It takes your existing dataframe df and query directly on it without even importing it into the database. Here is an example final result using your dataframe generation code. Notice that total time was 0.45sec. Not sure why pandas does not use DuckDB for the groupby under the hood.

enter image description here

db object is created using this small wrapper class that allows you to simply just type db = DuckDB() and you are ready to explore the data in any project. You can expand this further or you can even simplify it using %sql using this documentation page: enter link description here. By the way, the sql returns a dataframe, so you can do also db.sql(...).pivot_table(...) it is that simple.

class DuckDB:
  def __init__(self, db=None):
      self.db_loc = db or ':memory:'
      self.db = duckdb.connect(self.db_loc)
  def sql(self, sql=""):
     return self.db.execute(sql).fetchdf()
  def __del__():
     self.db.close()

Note: DuckDB is good but not perfect, but it turned way more stable than Dusk or even PySpark with much simpler set up. For larger data sets you may need a real database, but for datasets that can fit in memory this is great. Regarding memory usage, if you have a larger dataset ensure that you limite DuckDB using pragmas as it will eat it all in no time. Limit simply places extra onto disk without dealing with data chunking. Also do not assume that this is a database. Assume this is in-memory database, if you need some results stored, then just export them into parquet instead of saving the database. Because the format is not stable between releases and you will have to export to parquet anyway to move from one version to the next.

I expanded this data frame to 300mn records so in total it had around 1.2bn records or around 9GB. It still completed your groupby and other summary stats on a 32GB machine 18GB was still free.

enter image description here

Carn answered 25/1, 2023 at 20:11 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.