Sources and sinks

Scan

When using the LazyFrame API, it is important to favor scan_* (scan_parquet, scan_csv, etc.) over read_*. A Polars scan is lazy and will delay execution until the query is collected. The benefit of this, is that the Polars optimizer can push optimization into the readers. They can skip reading columns and rows that aren't required. Another benefit is that, during streaming execution, the engine already can process batches before the file is completely read.

Sink

Sinks can execute a query and stream the results to storage (being disk or cloud). The benefit of sinking data to storage is that you don't necessarily have to store all data in RAM, but can process data in batches.

If we would want to convert many csv files to parquet, whilst dropping the missing data, we could do something like the query below. We use a partitioning strategy that defines how many rows may be in a single parquet file, before we generate a new file

lf = scan_csv("my_dataset/*.csv").filter(pl.all().is_not_null())
lf.sink_parquet(
    pl.PartitionMaxSize(
        "my_table_{part}.parquet"
        max_size=512_000
    )
)

This will create the following files on disk:

my_table_0.parquet
my_table_1.parquet
...
my_table_n.parquet

Multiplexing sinks

Sinks can also multiplex. Meaning that we write to different sinks in a single query. In the code snippet below, we take a LazyFrame and sink it into 2 sinks at the same time.

# Some expensive computation
lf: LazyFrame 

q1 = lf.sink_parquet(.., lazy=True)
q2 = lf.sink_ipc(.., lazy=True)

pl.collect_all([q1, q2])