How does Polars auto-cache mechanism work on LazyFrames?
Asked Answered
B

0

7

As stated here, Polars introduced an auto-cache mechanism for LazyFrames that occures multiple times in the logical plan, so the user will not have to actively perform the cache.
However, while trying to examine their new mechanism, I encountred scenarios that the auto-cache is not performed optimally:

Without explicit cache:

import polars as pl

df1 = pl.DataFrame({'id': [0,5,6]}).lazy()
df2 = pl.DataFrame({'id': [0,8,6]}).lazy()
df3 = pl.DataFrame({'id': [7,8,6]}).lazy()

df4 = df1.join(df2, on='id')
print(pl.concat([df4.join(df3, on='id'), df1,
                 df4]).explain())

We get the logical plan:

UNION
  PLAN 0:
    INNER JOIN:
    LEFT PLAN ON: [col("id")]
      INNER JOIN:
      LEFT PLAN ON: [col("id")]
        CACHE[id: a4bcf9591fefc837, count: 3]
          DF ["id"]; PROJECT 1/1 COLUMNS; SELECTION: "None"
      RIGHT PLAN ON: [col("id")]
        CACHE[id: 8cee8e3a6f454983, count: 1]
          DF ["id"]; PROJECT 1/1 COLUMNS; SELECTION: "None"
      END INNER JOIN
    RIGHT PLAN ON: [col("id")]
      DF ["id"]; PROJECT */1 COLUMNS; SELECTION: "None"
    END INNER JOIN
  PLAN 1:
    CACHE[id: a4bcf9591fefc837, count: 3]
      DF ["id"]; PROJECT 1/1 COLUMNS; SELECTION: "None"
  PLAN 2:
    INNER JOIN:
    LEFT PLAN ON: [col("id")]
      CACHE[id: a4bcf9591fefc837, count: 3]
        DF ["id"]; PROJECT 1/1 COLUMNS; SELECTION: "None"
    RIGHT PLAN ON: [col("id")]
      CACHE[id: 8cee8e3a6f454983, count: 1]
        DF ["id"]; PROJECT 1/1 COLUMNS; SELECTION: "None"
    END INNER JOIN
END UNION

With explicit cache:

import polars as pl

df1 = pl.DataFrame({'id': [0,5,6]}).lazy()
df2 = pl.DataFrame({'id': [0,8,6]}).lazy()
df3 = pl.DataFrame({'id': [7,8,6]}).lazy()

df4 = df1.join(df2, on='id').cache()
print(pl.concat([df4.join(df3, on='id'), df1,
                 df4]).explain())

We get the logical plan:

UNION
  PLAN 0:
    INNER JOIN:
    LEFT PLAN ON: [col("id")]
      CACHE[id: 290661b0780, count: 18446744073709551615]
        FAST_PROJECT: [id]
          INNER JOIN:
          LEFT PLAN ON: [col("id")]
            DF ["id"]; PROJECT 1/1 COLUMNS; SELECTION: "None"
          RIGHT PLAN ON: [col("id")]
            DF ["id"]; PROJECT 1/1 COLUMNS; SELECTION: "None"
          END INNER JOIN
    RIGHT PLAN ON: [col("id")]
      DF ["id"]; PROJECT */1 COLUMNS; SELECTION: "None"
    END INNER JOIN
  PLAN 1:
    DF ["id"]; PROJECT */1 COLUMNS; SELECTION: "None"
  PLAN 2:
    CACHE[id: 290661b0780, count: 18446744073709551615]
      FAST_PROJECT: [id]
        INNER JOIN:
        LEFT PLAN ON: [col("id")]
          DF ["id"]; PROJECT 1/1 COLUMNS; SELECTION: "None"
        RIGHT PLAN ON: [col("id")]
          DF ["id"]; PROJECT 1/1 COLUMNS; SELECTION: "None"
        END INNER JOIN
END UNION

You can see, that with the explicit cache, we get more optimal plan because the join of df1 and df2 is performed only once.

Why doesn't Polars auto-cache mechanism detect the repeated usage of join, and apply the cache by itself? What am I missing?

Thanks.

Bedtime answered 28/11, 2023 at 6:34 Comment(6)
I'm guessing that a sub-optimal logical plan would be classified as a "bug" by the Polars team. If you don't get any feedback here, you may need to raise this on Github/Discord.Dimorphism
@Dimorphism thanks, if I will not get an answer here, I will certainly open a ticket on Github.Bedtime
What about if you add a few data columns to each of your dfs? As it is the join you want it to cache is really just a filter. In that vein, can you show an example where the manual cache has a significant time improvement?Wellgrounded
I appreciate that you're going for a MWE but the optimiser doesn't know and it may be doing the seemingly suboptimal thing because it recognizes that's it's computationally trivial.Wellgrounded
@DeanMacGregor Seems fair, I will check it tommorow and will update accordinglyBedtime
Just added another value column, same result.Bedtime

© 2022 - 2025 — McMap. All rights reserved.