Reduce polars memory consumption in unique()
Asked Answered
G

1

7

I have a dataset that fits into RAM, but causes an out of memory error when I run certain methods, such as df.unique(). My laptop has 16GB of RAM. I am running WSL with 14GB of RAM. I am using Polars version 0.18.4. Running df.estimated_size() says that my dataset is around 6GBs when I read it in. The schema of my data is

index: Int64
first_name: Utf8
last_name: Utf8
race: Utf8
pct_1: Float64
pct_2: Float64
pct_3: Float64
pct_4: Float64
size = pl.read_parquet("data.parquet").estimated_size()
df = pl.scan_parquet("data.parquet") # use LazyFrames

However, I am unable to perform tasks such as .unique(), .drop_nulls(), and so on without getting SIGKILLed. I am using LazyFrames.

For example,

df = df.drop_nulls().collect(streaming=True)

results in an out of memory error. I am able to sidestep this by writing a custom function.

def iterative_drop_nulls(expr: pl.Expr, subset: list[str]) -> pl.LazyFrame:
    for col in subset:
        expr = expr.filter(~pl.col(col).is_null())

    return expr

df = df.pipe(iterative_drop_nulls, ["col1", "col2"]).collect()

I am quite curious why the latter works but not the former, given that the largest version of the dataset (when I read it in initially) fits into RAM.

Unfortunately, I am unable to think of a similar trick to do the same thing as .unique(). Is there something I can do to make .unique() take less memory? I have tried:

df = df.lazy().unique(cols).collect(streaming=True)

and

def unique(df: pl.DataFrame, subset: list[str], n_rows: int = 100_000) -> pl.DataFrame:
    parts = []
    for slice in df.iter_slices(n_rows=n_rows):
        parts.append(df.unique(slice, subset=subset))

    return pl.concat(parts)

Edit:

I would love a better answer, but for now I am using

df = pl.from_pandas(
    df.collect()
    .to_pandas()
    .drop_duplicates(subset=["col1", "col2"])
)

In general I have found Polars to be more memory efficient than Pandas, but maybe this is an area Polars could improve? Curiously, if I use

df = pl.from_pandas(
    df.collect()
    .to_pandas(use_pyarrow_extension_array=True)
    .drop_duplicates(subset=["col1", "col2"])
)

I get the same memory error, so maybe this is a Pyarrow thing.

Georgeanngeorgeanna answered 25/6, 2023 at 19:20 Comment(9)
Unique uses a sort which takes some memory. If you're set on using unique, the best approach likely depends on the size of your data. If each element is large, use a hash. If it is small, subdivide the problem and join. github.com/numpy/numpy/blob/…Quirites
Have you tried just eagerly reading it? There are different algorithms for streaming and eager.Rockingham
I'm confused about the question title stressed "Reduce polars memory consumption in unique()". Are you asking us to 1. reduce the memory consumption of the function itself? Or 2. how you can do something similar without an out of memory error or 3. how you can still use unique() and get around this?Moazami
``` df = df.lazy().unique(cols).collect(streaming=True) ``` Should spill to disk for the unique if there is not enough memory. Can it be that the results don't fit into memory? Which polars version do you use? And what is the schema of your data?Placia
@Moazami I found a workaround but I would love to know why I can't use unique() (or a way for me to use unique()). I also have tried eagerly reading itGeorgeanngeorgeanna
@Placia The results fit into memory. The original dataset fits into memory, and unique at worst keeps the size the same (also the pandas version I posted in the edit works). I edited my post with the polars version (0.18.4) and schema (Int64, 3 Utf8, 4 Float64). I will try to create a reproducible example when I have a bit more time.Georgeanngeorgeanna
@Georgeanngeorgeanna might be worth opening another question for that! I honestly have no clue sorry!Moazami
it could be the case that pandas made a different trade off around memory use versus speed. it could be the case that the polars one is poorly implemented!Moazami
This seems to be a current issue - see #77523765; It may be specific to .unique() on long Utf8 columns.Fulminous
B
2

I've found the best way to deal with this is writing to file and then lazily loading it back in:

(pl.LazyFrame({'a': [None, 'word', 'word', 'word'], 
               'b': [None, 'word2', 'word2', 'word3']})
 .drop_nulls()
 .unique()
 .sink_parquet('test.parquet')
)

pl.scan_parquet('test.parquet').fetch()
shape: (2, 2)
┌──────┬───────┐
│ a    ┆ b     │
│ ---  ┆ ---   │
│ str  ┆ str   │
╞══════╪═══════╡
│ word ┆ word3 │
│ word ┆ word2 │
└──────┴───────┘

I'd also resort to .collect(streaming=True).write_parquet('test.parquet') if any of the operations aren't supported by streaming.

Brazilein answered 2/3, 2024 at 12:0 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.