One possible way of doing that would be to use duckdb
, via integration with polars
:
import duckdb
df = duckdb.sql("""
select
c.id,
c.name,
{'name': i.name, 'sectors': (select list({'name': name}) from sectors as s where list_contains(i.sectors, s.id))} as industry
from companies as c
inner join industries as i on
i.id = c.industry
""").pl()
┌─────┬────────┬───────────────────────────────────┐
│ id ┆ name ┆ industry │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ struct[2] │
╞═════╪════════╪═══════════════════════════════════╡
│ 1 ┆ google ┆ {"tech",[{"software"}, {"hardwar… │
└─────┴────────┴───────────────────────────────────┘
unfortunately, as far as I know, duckdb doesn't fully support lazy dataframes yet.
Another way of doing this would be to pre-join industries
and sectors
via cross join and filtering, using list.contains()
industries2 = industries.join(sectors, how='cross').filter(
pl.col('sectors').list.contains(pl.col('id_right'))
).group_by(['id','name']).agg(
sectors=pl.struct(name='name_right')
).select(
industry_id='id',
industry=pl.struct(pl.exclude('id'))
)
┌─────────────┬───────────────────────────────────┐
│ industry_id ┆ industry │
│ --- ┆ --- │
│ i64 ┆ struct[2] │
╞═════════════╪═══════════════════════════════════╡
│ 1001 ┆ {"tech",[{"software"}, {"hardwar… │
└─────────────┴───────────────────────────────────┘
and then join companies
with the result:
companies.rename(
{'industry':'industry_id'}
).join(
industries2,
on='industry_id',
how='inner'
).drop('industry_id')
┌─────┬────────┬───────────────────────────────────┐
│ id ┆ name ┆ industry │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ struct[2] │
╞═════╪════════╪═══════════════════════════════════╡
│ 1 ┆ google ┆ {"tech",[{"software"}, {"hardwar… │
└─────┴────────┴───────────────────────────────────┘