Issue while using py-polars sink_parquet method on a LazyFrame

I am getting the below error while using sink_parquet on a LazyFrame. Earlier I was using .collect() on the output of the scan_parquet() to convert the result into a DataFrame but unfortunately it is not working with larger than RAM datasets. Here is the error I received -

PanicException: sink_parquet not yet supported in standard engine. Use 'collect().write_parquet()'

I am trying to write the LazyFrame (the output from scan_parquet) into a local file after I added some filter and join conditions on the LazyFrame. It seems the error is coming from the below location -

https://github.com/pola-rs/polars/blob/master/py-polars/polars/internals/lazyframe/frame.py#L1235 (In Python)

https://github.com/pola-rs/polars/blob/master/polars/polars-lazy/src/physical_plan/planner/lp.rs#L154 (In Rust) .

I have tried updating to the latest version ~~0.15.16~~ 0.16.1 but this issue still exists .

Sample code :

pl.scan_parquet("path/to/file1.parquet")
.select([
    pl.col("col2"),
    pl.col("col2").apply( lambda x : ...)
    .alias("splited_levels"),
    ..followed by more columns and .alias()
])
.join(<another lazyframe>,on="some key",how="inner")
.filter(...)
.filter(..)
..followed by some more filters
.sink_parquet("path/to/result2.parquet")

The parquet file should be written in local system. Instead I am getting the below error -

PanicException: sink_parquet not yet supported in standard engine. Use 'collect().write_parquet()'

Here are the details of the installed packages after I used polars.show_versions() -

--- Version info----
Polars : 0.15.16
Index type : UInt32
Platform : Linux-4.15.0-191-generic-x86_64-with-glibc2.28
Python: 3.9.16
[GCC 8.3.0]
--- Optional dependencies---
pyarrow : 11.0.0
pandas : not installed
numpy : 1.24.1
fsspec : 2023.1.0
connectorx : not installed
xlsx2csv : not installed
deltalake: not installed
matplotlib : not installed

Update : I have raised a github issue here for the same and it seems all types of queries are not supported for streaming at this moment . So I am looking for a work around in this case or any alternative way of doing this with polars https://github.com/pola-rs/polars/issues/6603

WITH_COLUMNS: [col("a").list.concat([col("b")]).alias("joined")] --- STREAMING DF ["a", "b"]; PROJECT */2 COLUMNS; SELECTION: "None" --- END STREAMING DF []; PROJECT */0 COLUMNS; SELECTION: "None"

pl.Config.set_streaming_chunk_size(1000) (pl.LazyFrame({'a':'word', 'b': 'word2'}) .with_columns(joined = pl.concat_list(pl.col('a'), pl.col('b')) ) .collect(streaming=True) .write_parquet('test.parquet') )

Recommended topics

Hot tags