As stated here, Polars introduced an auto-cache mechanism for LazyFrames that occures multiple times in the logical plan, so the user will not have to actively perform the cache.
However, while trying to examine their new mechanism, I encountred scenarios that the auto-cache is not performed optimally:
Without explicit cache:
import polars as pl
df1 = pl.DataFrame({'id': [0,5,6]}).lazy()
df2 = pl.DataFrame({'id': [0,8,6]}).lazy()
df3 = pl.DataFrame({'id': [7,8,6]}).lazy()
df4 = df1.join(df2, on='id')
print(pl.concat([df4.join(df3, on='id'), df1,
df4]).explain())
We get the logical plan:
UNION
PLAN 0:
INNER JOIN:
LEFT PLAN ON: [col("id")]
INNER JOIN:
LEFT PLAN ON: [col("id")]
CACHE[id: a4bcf9591fefc837, count: 3]
DF ["id"]; PROJECT 1/1 COLUMNS; SELECTION: "None"
RIGHT PLAN ON: [col("id")]
CACHE[id: 8cee8e3a6f454983, count: 1]
DF ["id"]; PROJECT 1/1 COLUMNS; SELECTION: "None"
END INNER JOIN
RIGHT PLAN ON: [col("id")]
DF ["id"]; PROJECT */1 COLUMNS; SELECTION: "None"
END INNER JOIN
PLAN 1:
CACHE[id: a4bcf9591fefc837, count: 3]
DF ["id"]; PROJECT 1/1 COLUMNS; SELECTION: "None"
PLAN 2:
INNER JOIN:
LEFT PLAN ON: [col("id")]
CACHE[id: a4bcf9591fefc837, count: 3]
DF ["id"]; PROJECT 1/1 COLUMNS; SELECTION: "None"
RIGHT PLAN ON: [col("id")]
CACHE[id: 8cee8e3a6f454983, count: 1]
DF ["id"]; PROJECT 1/1 COLUMNS; SELECTION: "None"
END INNER JOIN
END UNION
With explicit cache:
import polars as pl
df1 = pl.DataFrame({'id': [0,5,6]}).lazy()
df2 = pl.DataFrame({'id': [0,8,6]}).lazy()
df3 = pl.DataFrame({'id': [7,8,6]}).lazy()
df4 = df1.join(df2, on='id').cache()
print(pl.concat([df4.join(df3, on='id'), df1,
df4]).explain())
We get the logical plan:
UNION
PLAN 0:
INNER JOIN:
LEFT PLAN ON: [col("id")]
CACHE[id: 290661b0780, count: 18446744073709551615]
FAST_PROJECT: [id]
INNER JOIN:
LEFT PLAN ON: [col("id")]
DF ["id"]; PROJECT 1/1 COLUMNS; SELECTION: "None"
RIGHT PLAN ON: [col("id")]
DF ["id"]; PROJECT 1/1 COLUMNS; SELECTION: "None"
END INNER JOIN
RIGHT PLAN ON: [col("id")]
DF ["id"]; PROJECT */1 COLUMNS; SELECTION: "None"
END INNER JOIN
PLAN 1:
DF ["id"]; PROJECT */1 COLUMNS; SELECTION: "None"
PLAN 2:
CACHE[id: 290661b0780, count: 18446744073709551615]
FAST_PROJECT: [id]
INNER JOIN:
LEFT PLAN ON: [col("id")]
DF ["id"]; PROJECT 1/1 COLUMNS; SELECTION: "None"
RIGHT PLAN ON: [col("id")]
DF ["id"]; PROJECT 1/1 COLUMNS; SELECTION: "None"
END INNER JOIN
END UNION
You can see, that with the explicit cache, we get more optimal plan because the join of df1
and df2
is performed only once.
Why doesn't Polars auto-cache mechanism detect the repeated usage of join, and apply the cache by itself? What am I missing?
Thanks.