Reduce polars memory consumption in unique()

I have a dataset that fits into RAM, but causes an out of memory error when I run certain methods, such as df.unique(). My laptop has 16GB of RAM. I am running WSL with 14GB of RAM. I am using Polars version 0.18.4. Running df.estimated_size() says that my dataset is around 6GBs when I read it in. The schema of my data is

index: Int64
first_name: Utf8
last_name: Utf8
race: Utf8
pct_1: Float64
pct_2: Float64
pct_3: Float64
pct_4: Float64

size = pl.read_parquet("data.parquet").estimated_size()
df = pl.scan_parquet("data.parquet") # use LazyFrames

However, I am unable to perform tasks such as .unique(), .drop_nulls(), and so on without getting SIGKILLed. I am using LazyFrames.

For example,

df = df.drop_nulls().collect(streaming=True)

results in an out of memory error. I am able to sidestep this by writing a custom function.

def iterative_drop_nulls(expr: pl.Expr, subset: list[str]) -> pl.LazyFrame:
    for col in subset:
        expr = expr.filter(~pl.col(col).is_null())

    return expr

df = df.pipe(iterative_drop_nulls, ["col1", "col2"]).collect()

I am quite curious why the latter works but not the former, given that the largest version of the dataset (when I read it in initially) fits into RAM.

Unfortunately, I am unable to think of a similar trick to do the same thing as .unique(). Is there something I can do to make .unique() take less memory? I have tried:

df = df.lazy().unique(cols).collect(streaming=True)

and

def unique(df: pl.DataFrame, subset: list[str], n_rows: int = 100_000) -> pl.DataFrame:
    parts = []
    for slice in df.iter_slices(n_rows=n_rows):
        parts.append(df.unique(slice, subset=subset))

    return pl.concat(parts)

Edit:

I would love a better answer, but for now I am using

df = pl.from_pandas(
    df.collect()
    .to_pandas()
    .drop_duplicates(subset=["col1", "col2"])
)

In general I have found Polars to be more memory efficient than Pandas, but maybe this is an area Polars could improve? Curiously, if I use

df = pl.from_pandas(
    df.collect()
    .to_pandas(use_pyarrow_extension_array=True)
    .drop_duplicates(subset=["col1", "col2"])
)

I get the same memory error, so maybe this is a Pyarrow thing.

(pl.LazyFrame({'a': [None, 'word', 'word', 'word'], 'b': [None, 'word2', 'word2', 'word3']}) .drop_nulls() .unique() .sink_parquet('test.parquet') ) pl.scan_parquet('test.parquet').fetch()

shape: (2, 2) ┌──────┬───────┐ │ a ┆ b │ │ --- ┆ --- │ │ str ┆ str │ ╞══════╪═══════╡ │ word ┆ word3 │ │ word ┆ word2 │ └──────┴───────┘

Recommended topics

Hot tags