Access newly created column in .with_columns() when using polars
Asked Answered
C

3

7

I am new to Polars and I am not sure whether I am using .with_columns() correctly.

Here's a situation I encounter frequently: There's a dataframe and in .with_columns(), I apply some operation to a column. For example, I convert some dates from str to date type and then want to compute the duration between start and end date. I'd implement this as follows.

import polars as pl 

pl.DataFrame(
    {
        "start": ["01.01.2019", "01.01.2020"],
        "end": ["11.01.2019", "01.05.2020"],
    }
).with_columns(
    pl.col("start").str.to_date(),
    pl.col("end").str.to_date(),
).with_columns(
    (pl.col("end") - pl.col("start")).alias("duration"),
)

First, I convert the two columns, next I call .with_columns() again.

Something shorter like this does not work:

pl.DataFrame(
    {
        "start": ["01.01.2019", "01.01.2020"],
        "end": ["11.01.2019", "01.05.2020"],
    }
).with_columns(
    pl.col("start").str.to_date(),
    pl.col("end").str.to_date(),
    (pl.col("end") - pl.col("start")).alias("duration"),
)
# InvalidOperationError: sub operation not supported for dtypes `str` and `str`

Is there a way to avoid calling .with_columns() twice and to write this in a more compact way?

Costplus answered 1/3, 2023 at 9:8 Comment(0)
M
12

The second .with_columns() is needed.

From the GitHub Issues

I don't want this extra complexity in polars. If you want to use an updated column, you need two with_columns. This makes it much more readable, simple, and explainable.

In the given example, passing multiple names to col() could simplify it slightly.

(df.with_columns(pl.col("start", "end").str.to_date())
   .with_columns(duration = pl.col("end") - pl.col("start"))
)
shape: (2, 3)
┌────────────┬────────────┬──────────────┐
│ start      ┆ end        ┆ duration     │
│ ---        ┆ ---        ┆ ---          │
│ date       ┆ date       ┆ duration[ms] │
╞════════════╪════════════╪══════════════╡
│ 2019-01-01 ┆ 2019-01-11 ┆ 10d          │
│ 2020-01-01 ┆ 2020-05-01 ┆ 121d         │
└────────────┴────────────┴──────────────┘
Mosesmosey answered 1/3, 2023 at 9:30 Comment(2)
Thanks! The fmt argument is slightly different for start and end in the actual data I use, but I'll keep the suggestions in mind :)Costplus
@Costplus To elaborate, everything in a context (with_columns in this case) only knows about what's in the dataframe before the context was called. Each expression in a context is unaware of every other expression in the context. This is by design because all the expressions run in parallel. If you need one expression to know the output of another expression, you need another context.Virga
V
5

There is actually a trick to make it seem like you're referencing a new column in an existing context.

You can use the walrus operator.

pl.DataFrame(
    {
        "start": ["01.01.2019", "01.01.2020"],
        "end": ["11.01.2019", "01.05.2020"],
    }
).with_columns(
    start := pl.col("start").str.to_date(),
    end := pl.col("end").str.to_date(),
    (end - start).alias("duration"),
)

Note that I said this makes it seem like you're referencing the new column. What you're really doing is defining the expression for start to a global variable, the same for end. In the last expression when you're creating duration, it will compute each of start and end an extra time, (unless you're in lazy mode).

If you're in lazy mode then it will create what's called a CSER which is a cached version of each of start and end so it doesn't do redundant computations.

You can see that like this:

print(pl.DataFrame(
    {
        "start": ["01.01.2019", "01.01.2020"],
        "end": ["11.01.2019", "01.05.2020"],
    }
).lazy().with_columns(
    start:=pl.col("start").str.to_date(),
    end:=pl.col("end").str.to_date(),
    (end - start).alias("duration"),
).explain())


WITH_COLUMNS:
 [col("__POLARS_CSER_14693840185675249975").alias("start"), col("__POLARS_CSER_9386772493530534005").alias("end"), [(col("__POLARS_CSER_9386772493530534005")) - (col("__POLARS_CSER_14693840185675249975"))].alias("duration"), col("start").str.strptime([Utf8(raise)]).alias("__POLARS_CSER_14693840185675249975"), col("end").str.strptime([Utf8(raise)]).alias("__POLARS_CSER_9386772493530534005")]
  DF ["start", "end"]; PROJECT */2 COLUMNS; SELECTION: "None"
Virga answered 5/12, 2023 at 22:7 Comment(1)
This solution is very helpful if you need to access columns in the agg function, where you can't simply run agg twice.Ce
S
2

Another solution similar to the walrus operator is to store your transformations in a variable. Storing expression does not do any computation until they are evaluated in a context. I personally find this approach expresses intent well and is fairly compact.

df = pl.DataFrame({
    "start": ["01.01.2019", "01.01.2020"],
    "end": ["11.01.2019", "01.05.2020"],
})

start = pl.col("start").str.to_date()
end = pl.col("end").str.to_date()

df.with_columns(start, end, duration=end - start)
Seritaserjeant answered 19/7 at 12:55 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.