Optimal way to get all duplicated rows in a polars dataframe
Asked Answered
V

4

7

I want to filter all duplicated rows from a polars dataframe. What I've tried:

df = pl.DataFrame([['1', '1', '1', '1'], ['7', '7', '2', '7'], ['3', '9', '3', '9']])
df
shape: (4, 3)
┌──────────┬──────────┬──────────┐
│ column_0 ┆ column_1 ┆ column_2 │
│ ---      ┆ ---      ┆ ---      │
│ str      ┆ str      ┆ str      │
╞══════════╪══════════╪══════════╡
│ 1        ┆ 7        ┆ 3        │
│ 1        ┆ 7        ┆ 9        │
│ 1        ┆ 2        ┆ 3        │
│ 1        ┆ 7        ┆ 9        │
└──────────┴──────────┴──────────┘
df.filter(pl.all().is_duplicated())
shape: (3, 3)
┌──────────┬──────────┬──────────┐
│ column_0 ┆ column_1 ┆ column_2 │
│ ---      ┆ ---      ┆ ---      │
│ str      ┆ str      ┆ str      │
╞══════════╪══════════╪══════════╡
│ 1        ┆ 7        ┆ 3        │ # DO NOT WANT
│ 1        ┆ 7        ┆ 9        │
│ 1        ┆ 7        ┆ 9        │
└──────────┴──────────┴──────────┘

This selects the first row, because it appears to go column-by-column and returns each row where all columns have a corresponding duplicate in the respective column - not the intended outcome.

Boolean indexing does not work:

df[df.is_duplicated(), :]
# TypeError: selecting rows by passing a boolean mask to `__getitem__` is not supported
# Hint: Use the `filter` method instead.

It leaves me wondering

  • if there's a way to use .filter() and expressions to achieve the desired result
  • what is the most efficient way to achieve the desired result
Velate answered 4/5, 2022 at 11:53 Comment(0)
C
11

In general, the is_duplicated method will likely perform best. Let's take a look at some alternative ways to accomplish this. And we'll do some (very) non-rigorous benchmarking - just to see which ones perform reasonably well.

Some alternatives

One alternative is a filter statement with an over (windowing) expression on all columns. One caution with windowed expressions - they are convenient, but can be costly performance-wise.

df.filter(pl.count("column_1").over(df.columns) > 1)
shape: (2, 3)
┌──────────┬──────────┬──────────┐
│ column_0 ┆ column_1 ┆ column_2 │
│ ---      ┆ ---      ┆ ---      │
│ str      ┆ str      ┆ str      │
╞══════════╪══════════╪══════════╡
│ 1        ┆ 7        ┆ 9        │
│ 1        ┆ 7        ┆ 9        │
└──────────┴──────────┴──────────┘

Another alternative is a group_by, followed by a join. Basically, we'll count the number of times that combinations of columns occur. I'm using a semi join here, simply because I don't want to include the len column in my final results.

df.join(
    df.group_by(df.columns).len()
    .filter(pl.col("len") > 1),
    on=df.columns,
    how="semi",
)
shape: (2, 3)
┌──────────┬──────────┬──────────┐
│ column_0 ┆ column_1 ┆ column_2 │
│ ---      ┆ ---      ┆ ---      │
│ str      ┆ str      ┆ str      │
╞══════════╪══════════╪══════════╡
│ 1        ┆ 7        ┆ 9        │
│ 1        ┆ 7        ┆ 9        │
└──────────┴──────────┴──────────┘

Some (very) non-rigorous benchmarking

One way to see which alternatives perform reasonably well is to time the performance on a test dataset that might resemble the datasets that you will use. For lack of something better, I'll stick to something that looks close to the dataset in your question.

Set nbr_rows to something that will challenge your machine. (My machine is a 32-core system, so I'm going to choose a reasonably high number of rows.)

import numpy as np
import string

nbr_rows = 100_000_000
df = pl.DataFrame(
    {
        "col1": np.random.choice(1_000, nbr_rows,),
        "col2": np.random.choice(1_000, nbr_rows,),
        "col3": np.random.choice(list(string.ascii_letters), nbr_rows,),
        "col4": np.random.choice(1_000, nbr_rows,),
    }
)
print(df)
shape: (100000000, 4)
┌──────┬──────┬──────┬──────┐
│ col1 ┆ col2 ┆ col3 ┆ col4 │
│ ---  ┆ ---  ┆ ---  ┆ ---  │
│ i64  ┆ i64  ┆ str  ┆ i64  │
╞══════╪══════╪══════╪══════╡
│ 955  ┆ 186  ┆ j    ┆ 851  │
│ 530  ┆ 199  ┆ d    ┆ 376  │
│ 109  ┆ 609  ┆ G    ┆ 115  │
│ 886  ┆ 487  ┆ d    ┆ 479  │
│ ...  ┆ ...  ┆ ...  ┆ ...  │
│ 837  ┆ 406  ┆ Y    ┆ 60   │
│ 467  ┆ 769  ┆ P    ┆ 344  │
│ 548  ┆ 372  ┆ F    ┆ 410  │
│ 379  ┆ 578  ┆ t    ┆ 287  │
└──────┴──────┴──────┴──────┘

Now let's benchmark some alternatives. Since these may or may not resemble your datasets (or your computing platform), I won't run the benchmarks multiple times. For our purposes, we're just trying to weed out alternatives that might perform very poorly.

Alternative: filter using an over (windowing) expression

start = time.perf_counter()
df.filter(pl.count("col1").over(df.columns) > 1)
end = time.perf_counter()
print(end - start)
>>> print(end - start)
18.136289041000055

As expected, the over (windowing) expression is rather costly.

Alternative: group_by followed by a join

start = time.perf_counter()
df.join(
    df.group_by(df.columns).len()
    .filter(pl.col("len") > 1),
    on=df.columns,
    how="semi",
)
end = time.perf_counter()
print(end - start)
>>> print(end - start)
9.419006452999383

Somewhat better ... but not as good as using the is_duplicated method provided by the Polars API.

Alternative: concat_str

Let's also look at an alternative suggested in another answer. To be fair, @FBruzzesi did say "I am not sure this is optimal by any means". But let's look at how it performs.

start = time.perf_counter()
df.filter(pl.concat_str(df.columns, separator='|').is_duplicated())
end = time.perf_counter()
print(end - start)
>>> print(end - start)
37.238660977998734

Edit

Additional Alternative: filter and is_duplicated

We can also use filter with is_duplicated. Since df.is_duplicated() is not a column in the DataFrame when the filter is run, we'll need to wrap it in a polars.lit Expression.

start = time.perf_counter()
df.filter(pl.lit(df.is_duplicated()))
end = time.perf_counter()
print(end - start)
>>> print(end - start)
8.115436136999051

Did this help? If nothing else, this shows some different ways to use the Polars API.

Conto answered 4/5, 2022 at 19:52 Comment(4)
Thank you for the multiple alternatives along with benchmarks. The only "concern" is that the implementations using the filter method seem to be the slowest ones. I believe that OP, as myself, would like to use method chaining most of the time. Using the join method may be the compromise between performance and method chaining.Kamacite
You can use a filter with the is_duplicated as well. (I actually tested that, but thought that the post was getting way too long as it was.) I'll edit my answer to include that option.Conto
Added (at the bottom). Using a filter will run just as fast as using boolean indexing. And to your point, the syntax if preferable.Conto
When you want to verify duplicates with a subset of columns on a chain of expressions, using is_duplicated() is currently not possible. In this case, the filter()+over() option enables this feature, is convenient, clean and concise. I don't know what is the difference in performance between the groupBy()+join() solution and the filter()+over() solution in your (reader) particular use case. But unless this is the most critical part and bottleneck of your app, and you have evidence that there is a very significant difference in performance, I'd suggest that you favor the more clean solution.Spender
C
2

I think the optimal way really is:

df.filter(df.is_duplicated())

We have to be aware that Polars have is_duplicated() methods in the expression API and in the DataFrame API, but for the purpose of visualizing the duplicated lines we need to evaluate each column and have a consensus in the end if the column is duplicated or not.

The df.is_duplicated() will return a vector with boolean values, It looks like this:

In []: df.is_duplicated()
Out[]: 
shape: (4,)
Series: '' [bool]
[
    false
    true
    false
    true
]

In []: 

Then leveraging our expression API I think we can do this:

In []: df = pl.DataFrame([['1', '1', '1', '1'], ['7', '7', '2', '7'], ['3', '9', '3', '9']])
    ...: df
Out[]: 
shape: (4, 3)
┌──────────┬──────────┬──────────┐
│ column_0 ┆ column_1 ┆ column_2 │
│ ---      ┆ ---      ┆ ---      │
│ str      ┆ str      ┆ str      │
╞══════════╪══════════╪══════════╡
│ 1        ┆ 7        ┆ 3        │
│ 1        ┆ 7        ┆ 9        │
│ 1        ┆ 2        ┆ 3        │
│ 1        ┆ 7        ┆ 9        │
└──────────┴──────────┴──────────┘
In []: df.filter(df.is_duplicated())
Out[]: 
shape: (2, 3)
┌──────────┬──────────┬──────────┐
│ column_0 ┆ column_1 ┆ column_2 │
│ ---      ┆ ---      ┆ ---      │
│ str      ┆ str      ┆ str      │
╞══════════╪══════════╪══════════╡
│ 1        ┆ 7        ┆ 9        │
│ 1        ┆ 7        ┆ 9        │
└──────────┴──────────┴──────────┘

And then for subseting:

In []: df.filter(df.select(["column_0", "column_1"]).is_duplicated())
Out[]: 
shape: (3, 3)
┌──────────┬──────────┬──────────┐
│ column_0 ┆ column_1 ┆ column_2 │
│ ---      ┆ ---      ┆ ---      │
│ str      ┆ str      ┆ str      │
╞══════════╪══════════╪══════════╡
│ 1        ┆ 7        ┆ 3        │
│ 1        ┆ 7        ┆ 9        │
│ 1        ┆ 7        ┆ 9        │
└──────────┴──────────┴──────────┘

Any other way for me seems to be trying to hack the API, I wish we had a way to "fold" the boolean columns we get by the expressions df.filter(pl.col("*").is_duplicated()) df.filter(pl.all().is_duplicated())

It is funny, because look at this:

In []: df.select(pl.all().is_duplicated())
Out[]: 
shape: (4, 3)
┌──────────┬──────────┬──────────┐
│ column_0 ┆ column_1 ┆ column_2 │
│ ---      ┆ ---      ┆ ---      │
│ bool     ┆ bool     ┆ bool     │
╞══════════╪══════════╪══════════╡
│ true     ┆ true     ┆ true     │
│ true     ┆ true     ┆ true     │
│ true     ┆ false    ┆ true     │
│ true     ┆ true     ┆ true     │
└──────────┴──────────┴──────────┘

This is what happens when we use is_duplicated() as expression.

Capillarity answered 10/2, 2023 at 19:22 Comment(2)
I'm not sure what exactly you mean by this: "I wish we had a way to "fold" the boolean columns we get by the expressions", but I believe there are options to do what you meant: (1): df.select(pl.all(pl.col('*').is_duplicated())) works horizontally, but its usefulness is debatable, since the duplicates are still evaluated vertically per column before the horizontal pl.all(). (2): df.select(pl.struct(pl.col('*')).is_duplicated()) is probably what you want, since the rows are first combined horizontally in the struct before .is_duplicated() is called.Velate
I've never used pl.struct before, I'll try it out thank you. And I had the same Issue you meant using pl.all().Capillarity
K
1

I am not sure this is optimal by any mean but you could concatenate all rows and check for duplicates, namely:

import polars as pl

df = pl.DataFrame(
    [["1", "1", None, "1"], ["7", "7", None, "7"], ["3", None, None, None]],
    schema=["a", "b", "c"],
)

df.filter(
    pl.concat_str(pl.col(["a", "b", "c"]).fill_null("null")).is_duplicated()
)

shape: (2, 3)
┌─────┬─────┬──────┐
│ a   ┆ b   ┆ c    │
│ --- ┆ --- ┆ ---  │
│ str ┆ str ┆ str  │
╞═════╪═════╪══════╡
│ 1   ┆ 7   ┆ null │
│ 1   ┆ 7   ┆ null │
└─────┴─────┴──────┘

Caveat: fill_null("null") - without this, all rows containing null would be treated the same. This means, however, if you have the string "null" appearing in your data, it will be treated identically to an actual null-value.

Kamacite answered 4/5, 2022 at 12:32 Comment(1)
Thanks for your suggestion. It is much more expensive than other options, but it is good to have it here for completeness' sake. I submitted an edit to account for nulls.Velate
P
0

You can use a struct()

df.filter(pl.struct(pl.all()).is_duplicated())
shape: (2, 3)
┌──────────┬──────────┬──────────┐
│ column_0 ┆ column_1 ┆ column_2 │
│ ---      ┆ ---      ┆ ---      │
│ str      ┆ str      ┆ str      │
╞══════════╪══════════╪══════════╡
│ 1        ┆ 7        ┆ 9        │
│ 1        ┆ 7        ┆ 9        │
└──────────┴──────────┴──────────┘
Phreno answered 15/9, 2024 at 3:43 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.