I have a dataset that fits into RAM, but causes an out of memory error when I run certain methods, such as df.unique()
. My laptop has 16GB of RAM. I am running WSL with 14GB of RAM. I am using Polars version 0.18.4. Running df.estimated_size()
says that my dataset is around 6GBs when I read it in. The schema of my data is
index: Int64
first_name: Utf8
last_name: Utf8
race: Utf8
pct_1: Float64
pct_2: Float64
pct_3: Float64
pct_4: Float64
size = pl.read_parquet("data.parquet").estimated_size()
df = pl.scan_parquet("data.parquet") # use LazyFrames
However, I am unable to perform tasks such as .unique()
, .drop_nulls()
, and so on without getting SIGKILLed. I am using LazyFrames.
For example,
df = df.drop_nulls().collect(streaming=True)
results in an out of memory error. I am able to sidestep this by writing a custom function.
def iterative_drop_nulls(expr: pl.Expr, subset: list[str]) -> pl.LazyFrame:
for col in subset:
expr = expr.filter(~pl.col(col).is_null())
return expr
df = df.pipe(iterative_drop_nulls, ["col1", "col2"]).collect()
I am quite curious why the latter works but not the former, given that the largest version of the dataset (when I read it in initially) fits into RAM.
Unfortunately, I am unable to think of a similar trick to do the same thing as .unique()
. Is there something I can do to make .unique()
take less memory? I have tried:
df = df.lazy().unique(cols).collect(streaming=True)
and
def unique(df: pl.DataFrame, subset: list[str], n_rows: int = 100_000) -> pl.DataFrame:
parts = []
for slice in df.iter_slices(n_rows=n_rows):
parts.append(df.unique(slice, subset=subset))
return pl.concat(parts)
Edit:
I would love a better answer, but for now I am using
df = pl.from_pandas(
df.collect()
.to_pandas()
.drop_duplicates(subset=["col1", "col2"])
)
In general I have found Polars to be more memory efficient than Pandas, but maybe this is an area Polars could improve? Curiously, if I use
df = pl.from_pandas(
df.collect()
.to_pandas(use_pyarrow_extension_array=True)
.drop_duplicates(subset=["col1", "col2"])
)
I get the same memory error, so maybe this is a Pyarrow thing.
unique
if there is not enough memory. Can it be that the results don't fit into memory? Which polars version do you use? And what is the schema of your data? – Placia