Polars join on array items without explode/groupby
Asked Answered
C

1

6

a follow up from Polars lazyframe - add fields from other lazyframe as struct without a `collect`.

I now want to join on array items. Currently the only way i know of doing this would be to first explode the array, perform the join, do a groupby, then collect back as a list. I'm hoping there is a more concise alternative.

companies = pl.DataFrame({
    "id": [1],
    "name": ["google"],
    "industry": [1001]
}).lazy()

industries = pl.DataFrame({
    "id": [1001],
    "name": ["tech"],
    "sectors": [[10011, 10012]]
}).lazy()

sectors = pl.DataFrame({
    "id": [10011, 10012],
    "name": ["software", "hardware"],
}).lazy()

expected = pl.DataFrame({
    "id": [1],
    "name": ["polars"],
    "industry": [{
        "name": "tech",
        "sectors": [[{"name": "software"}, {"name": "hardware"}]]
    }]
})
Cimbura answered 10/10, 2022 at 15:28 Comment(1)
I also have this question.Karrikarrie
D
0

One possible way of doing that would be to use duckdb, via integration with polars:

import duckdb
df = duckdb.sql("""
    select
        c.id,
        c.name,
        {'name': i.name, 'sectors': (select list({'name': name}) from sectors as s where list_contains(i.sectors, s.id))} as industry
    from companies as c
        inner join industries as i on
            i.id = c.industry
""").pl()

┌─────┬────────┬───────────────────────────────────┐
│ id  ┆ name   ┆ industry                          │
│ --- ┆ ---    ┆ ---                               │
│ i64 ┆ str    ┆ struct[2]                         │
╞═════╪════════╪═══════════════════════════════════╡
│ 1   ┆ google ┆ {"tech",[{"software"}, {"hardwar… │
└─────┴────────┴───────────────────────────────────┘

unfortunately, as far as I know, duckdb doesn't fully support lazy dataframes yet. Another way of doing this would be to pre-join industries and sectors via cross join and filtering, using list.contains()

industries2 = industries.join(sectors, how='cross').filter(
    pl.col('sectors').list.contains(pl.col('id_right'))
).group_by(['id','name']).agg(
    sectors=pl.struct(name='name_right')
).select(
    industry_id='id',
    industry=pl.struct(pl.exclude('id'))
)

┌─────────────┬───────────────────────────────────┐
│ industry_id ┆ industry                          │
│ ---         ┆ ---                               │
│ i64         ┆ struct[2]                         │
╞═════════════╪═══════════════════════════════════╡
│ 1001        ┆ {"tech",[{"software"}, {"hardwar… │
└─────────────┴───────────────────────────────────┘

and then join companies with the result:

companies.rename(
    {'industry':'industry_id'}
).join(
    industries2,
    on='industry_id',
    how='inner'
).drop('industry_id')

┌─────┬────────┬───────────────────────────────────┐
│ id  ┆ name   ┆ industry                          │
│ --- ┆ ---    ┆ ---                               │
│ i64 ┆ str    ┆ struct[2]                         │
╞═════╪════════╪═══════════════════════════════════╡
│ 1   ┆ google ┆ {"tech",[{"software"}, {"hardwar… │
└─────┴────────┴───────────────────────────────────┘
Daria answered 28/1, 2024 at 21:1 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.