Skip to content

Multiple

Dealing with multiple files.

Polars can deal with multiple files differently depending on your needs and memory strain.

Let's create some files to give us some context:

write_csv

import polars as pl

df = pl.DataFrame({"foo": [1, 2, 3], "bar": [None, "ham", "spam"]})

for i in range(5):
    df.write_csv(f"docs/assets/data/my_many_files_{i}.csv")

Reading into a single DataFrame

To read multiple files into a single DataFrame, we can use globbing patterns:

read_csv

df = pl.read_csv("docs/assets/data/my_many_files_*.csv")
print(df)

shape: (15, 2)
┌─────┬──────┐
│ foo ┆ bar  │
│ --- ┆ ---  │
│ i64 ┆ str  │
╞═════╪══════╡
│ 1   ┆ null │
│ 2   ┆ ham  │
│ 3   ┆ spam │
│ 1   ┆ null │
│ 2   ┆ ham  │
│ …   ┆ …    │
│ 2   ┆ ham  │
│ 3   ┆ spam │
│ 1   ┆ null │
│ 2   ┆ ham  │
│ 3   ┆ spam │
└─────┴──────┘

To see how this works we can take a look at the query plan. Below we see that all files are read separately and concatenated into a single DataFrame. Polars will try to parallelize the reading.

show_graph

pl.scan_csv("docs/assets/data/my_many_files_*.csv").show_graph()

Reading and processing in parallel

If your files don't have to be in a single table you can also build a query plan for each file and execute them in parallel on the Polars thread pool.

All query plan execution is embarrassingly parallel and doesn't require any communication.

scan_csv

import glob

import polars as pl

queries = []
for file in glob.glob("docs/assets/data/my_many_files_*.csv"):
    q = pl.scan_csv(file).group_by("bar").agg(pl.len(), pl.sum("foo"))
    queries.append(q)

dataframes = pl.collect_all(queries)
print(dataframes)

[shape: (3, 3)
┌──────┬─────┬─────┐
│ bar  ┆ len ┆ foo │
│ ---  ┆ --- ┆ --- │
│ str  ┆ u32 ┆ i64 │
╞══════╪═════╪═════╡
│ spam ┆ 1   ┆ 3   │
│ null ┆ 1   ┆ 1   │
│ ham  ┆ 1   ┆ 2   │
└──────┴─────┴─────┘, shape: (3, 3)
┌──────┬─────┬─────┐
│ bar  ┆ len ┆ foo │
│ ---  ┆ --- ┆ --- │
│ str  ┆ u32 ┆ i64 │
╞══════╪═════╪═════╡
│ spam ┆ 1   ┆ 3   │
│ ham  ┆ 1   ┆ 2   │
│ null ┆ 1   ┆ 1   │
└──────┴─────┴─────┘, shape: (3, 3)
┌──────┬─────┬─────┐
│ bar  ┆ len ┆ foo │
│ ---  ┆ --- ┆ --- │
│ str  ┆ u32 ┆ i64 │
╞══════╪═════╪═════╡
│ spam ┆ 1   ┆ 3   │
│ null ┆ 1   ┆ 1   │
│ ham  ┆ 1   ┆ 2   │
└──────┴─────┴─────┘, shape: (3, 3)
┌──────┬─────┬─────┐
│ bar  ┆ len ┆ foo │
│ ---  ┆ --- ┆ --- │
│ str  ┆ u32 ┆ i64 │
╞══════╪═════╪═════╡
│ spam ┆ 1   ┆ 3   │
│ ham  ┆ 1   ┆ 2   │
│ null ┆ 1   ┆ 1   │
└──────┴─────┴─────┘, shape: (3, 3)
┌──────┬─────┬─────┐
│ bar  ┆ len ┆ foo │
│ ---  ┆ --- ┆ --- │
│ str  ┆ u32 ┆ i64 │
╞══════╪═════╪═════╡
│ null ┆ 1   ┆ 1   │
│ ham  ┆ 1   ┆ 2   │
│ spam ┆ 1   ┆ 3   │
└──────┴─────┴─────┘]