Skip to content

Expression expansion

As you've seen in the section about expressions and contexts, expression expansion is a feature that enables you to write a single expression that can expand to multiple different expressions, possibly depending on the schema of the context in which the expression is used.

This feature isn't just decorative or syntactic sugar. It allows for a very powerful application of DRY principles in your code: a single expression that specifies multiple columns expands into a list of expressions, which means you can write one single expression and reuse the computation that it represents.

In this section we will show several forms of expression expansion and we will be using the dataframe that you can see below for that effect:

import polars as pl

df = pl.DataFrame(
    {  # As of 14th October 2024, ~3pm UTC
        "ticker": ["AAPL", "NVDA", "MSFT", "GOOG", "AMZN"],
        "company_name": ["Apple", "NVIDIA", "Microsoft", "Alphabet (Google)", "Amazon"],
        "price": [229.9, 138.93, 420.56, 166.41, 188.4],
        "day_high": [231.31, 139.6, 424.04, 167.62, 189.83],
        "day_low": [228.6, 136.3, 417.52, 164.78, 188.44],
        "year_high": [237.23, 140.76, 468.35, 193.31, 201.2],
        "year_low": [164.08, 39.23, 324.39, 121.46, 118.35],
    }
)

print(df)
use polars::prelude::*;

// Data as of 14th October 2024, ~3pm UTC
let df = df!(
    "ticker" => ["AAPL", "NVDA", "MSFT", "GOOG", "AMZN"],
    "company_name" => ["Apple", "NVIDIA", "Microsoft", "Alphabet (Google)", "Amazon"],
    "price" => [229.9, 138.93, 420.56, 166.41, 188.4],
    "day_high" => [231.31, 139.6, 424.04, 167.62, 189.83],
    "day_low" => [228.6, 136.3, 417.52, 164.78, 188.44],
    "year_high" => [237.23, 140.76, 468.35, 193.31, 201.2],
    "year_low" => [164.08, 39.23, 324.39, 121.46, 118.35],
)?;

println!("{}", df);
shape: (5, 7)
┌────────┬───────────────────┬────────┬──────────┬─────────┬───────────┬──────────┐
│ ticker ┆ company_name      ┆ price  ┆ day_high ┆ day_low ┆ year_high ┆ year_low │
│ ---    ┆ ---               ┆ ---    ┆ ---      ┆ ---     ┆ ---       ┆ ---      │
│ str    ┆ str               ┆ f64    ┆ f64      ┆ f64     ┆ f64       ┆ f64      │
╞════════╪═══════════════════╪════════╪══════════╪═════════╪═══════════╪══════════╡
│ AAPL   ┆ Apple             ┆ 229.9  ┆ 231.31   ┆ 228.6   ┆ 237.23    ┆ 164.08   │
│ NVDA   ┆ NVIDIA            ┆ 138.93 ┆ 139.6    ┆ 136.3   ┆ 140.76    ┆ 39.23    │
│ MSFT   ┆ Microsoft         ┆ 420.56 ┆ 424.04   ┆ 417.52  ┆ 468.35    ┆ 324.39   │
│ GOOG   ┆ Alphabet (Google) ┆ 166.41 ┆ 167.62   ┆ 164.78  ┆ 193.31    ┆ 121.46   │
│ AMZN   ┆ Amazon            ┆ 188.4  ┆ 189.83   ┆ 188.44  ┆ 201.2     ┆ 118.35   │
└────────┴───────────────────┴────────┴──────────┴─────────┴───────────┴──────────┘

Function col

The function col is the most common way of making use of expression expansion features in Polars. Typically used to refer to one column of a dataframe, in this section we explore other ways in which you can use col (or its variants, when in Rust).

Explicit expansion by column name

The simplest form of expression expansion happens when you provide multiple column names to the function col.

The example below uses a single function col with multiple column names to convert the values in USD to EUR:

col

eur_usd_rate = 1.09  # As of 14th October 2024

result = df.with_columns(
    (
        pl.col(
            "price",
            "day_high",
            "day_low",
            "year_high",
            "year_low",
        )
        / eur_usd_rate
    ).round(2)
)
print(result)

col

let eur_usd_rate = 1.09; // As of 14th October 2024

let result = df
    .clone()
    .lazy()
    .with_column(
        (cols(["price", "day_high", "day_low", "year_high", "year_low"]) / lit(eur_usd_rate))
            .round(2),
    )
    .collect()?;
println!("{}", result);

shape: (5, 7)
┌────────┬───────────────────┬────────┬──────────┬─────────┬───────────┬──────────┐
│ ticker ┆ company_name      ┆ price  ┆ day_high ┆ day_low ┆ year_high ┆ year_low │
│ ---    ┆ ---               ┆ ---    ┆ ---      ┆ ---     ┆ ---       ┆ ---      │
│ str    ┆ str               ┆ f64    ┆ f64      ┆ f64     ┆ f64       ┆ f64      │
╞════════╪═══════════════════╪════════╪══════════╪═════════╪═══════════╪══════════╡
│ AAPL   ┆ Apple             ┆ 210.92 ┆ 212.21   ┆ 209.72  ┆ 217.64    ┆ 150.53   │
│ NVDA   ┆ NVIDIA            ┆ 127.46 ┆ 128.07   ┆ 125.05  ┆ 129.14    ┆ 35.99    │
│ MSFT   ┆ Microsoft         ┆ 385.83 ┆ 389.03   ┆ 383.05  ┆ 429.68    ┆ 297.61   │
│ GOOG   ┆ Alphabet (Google) ┆ 152.67 ┆ 153.78   ┆ 151.17  ┆ 177.35    ┆ 111.43   │
│ AMZN   ┆ Amazon            ┆ 172.84 ┆ 174.16   ┆ 172.88  ┆ 184.59    ┆ 108.58   │
└────────┴───────────────────┴────────┴──────────┴─────────┴───────────┴──────────┘

When you list the column names you want the expression to expand to, you can predict what the expression will expand to. In this case, the expression that does the currency conversion is expanded to a list of five expressions:

col

exprs = [
    (pl.col("price") / eur_usd_rate).round(2),
    (pl.col("day_high") / eur_usd_rate).round(2),
    (pl.col("day_low") / eur_usd_rate).round(2),
    (pl.col("year_high") / eur_usd_rate).round(2),
    (pl.col("year_low") / eur_usd_rate).round(2),
]

result2 = df.with_columns(exprs)
print(result.equals(result2))

col

let exprs = [
    (col("price") / lit(eur_usd_rate)).round(2),
    (col("day_high") / lit(eur_usd_rate)).round(2),
    (col("day_low") / lit(eur_usd_rate)).round(2),
    (col("year_high") / lit(eur_usd_rate)).round(2),
    (col("year_low") / lit(eur_usd_rate)).round(2),
];

let result2 = df.clone().lazy().with_columns(exprs).collect()?;
println!("{}", result.equals(&result2));

True

Expansion by data type

We had to type five column names in the previous example but the function col can also conveniently accept one or more data types. If you provide data types instead of column names, the expression is expanded to all columns that match one of the data types provided.

The example below performs the exact same computation as before:

col

result = df.with_columns((pl.col(pl.Float64) / eur_usd_rate).round(2))
print(result)

dtype_col

let result = df
    .clone()
    .lazy()
    .with_column((dtype_col(&DataType::Float64) / lit(eur_usd_rate)).round(2))
    .collect()?;
println!("{}", result);

shape: (5, 7)
┌────────┬───────────────────┬────────┬──────────┬─────────┬───────────┬──────────┐
│ ticker ┆ company_name      ┆ price  ┆ day_high ┆ day_low ┆ year_high ┆ year_low │
│ ---    ┆ ---               ┆ ---    ┆ ---      ┆ ---     ┆ ---       ┆ ---      │
│ str    ┆ str               ┆ f64    ┆ f64      ┆ f64     ┆ f64       ┆ f64      │
╞════════╪═══════════════════╪════════╪══════════╪═════════╪═══════════╪══════════╡
│ AAPL   ┆ Apple             ┆ 210.92 ┆ 212.21   ┆ 209.72  ┆ 217.64    ┆ 150.53   │
│ NVDA   ┆ NVIDIA            ┆ 127.46 ┆ 128.07   ┆ 125.05  ┆ 129.14    ┆ 35.99    │
│ MSFT   ┆ Microsoft         ┆ 385.83 ┆ 389.03   ┆ 383.05  ┆ 429.68    ┆ 297.61   │
│ GOOG   ┆ Alphabet (Google) ┆ 152.67 ┆ 153.78   ┆ 151.17  ┆ 177.35    ┆ 111.43   │
│ AMZN   ┆ Amazon            ┆ 172.84 ┆ 174.16   ┆ 172.88  ┆ 184.59    ┆ 108.58   │
└────────┴───────────────────┴────────┴──────────┴─────────┴───────────┴──────────┘

When we use a data type with expression expansion we cannot know, beforehand, how many columns a single expression will expand to. We need the schema of the input dataframe if we want to determine what is the final list of expressions that is to be applied.

If we weren't sure about whether the price columns where of the type Float64 or Float32, we could specify both data types:

col

result2 = df.with_columns(
    (
        pl.col(
            pl.Float32,
            pl.Float64,
        )
        / eur_usd_rate
    ).round(2)
)
print(result.equals(result2))

dtype_cols

let result2 = df
    .clone()
    .lazy()
    .with_column(
        (dtype_cols([DataType::Float32, DataType::Float64]) / lit(eur_usd_rate)).round(2),
    )
    .collect()?;
println!("{}", result.equals(&result2));

True

Expansion by pattern matching

You can also use regular expressions to specify patterns that are used to match the column names. To distinguish between a regular column name and expansion by pattern matching, regular expressions start and end with ^ and $, respectively. This also means that the pattern must match against the whole column name string.

Regular expressions can be mixed with regular column names:

col

result = df.select(pl.col("ticker", "^.*_high$", "^.*_low$"))
print(result)

col

// NOTE: Using regex inside `col`/`cols` requires the feature flag `regex`.
let result = df
    .clone()
    .lazy()
    .select([cols(["ticker", "^.*_high$", "^.*_low$"])])
    .collect()?;
println!("{}", result);

shape: (5, 5)
┌────────┬──────────┬───────────┬─────────┬──────────┐
│ ticker ┆ day_high ┆ year_high ┆ day_low ┆ year_low │
│ ---    ┆ ---      ┆ ---       ┆ ---     ┆ ---      │
│ str    ┆ f64      ┆ f64       ┆ f64     ┆ f64      │
╞════════╪══════════╪═══════════╪═════════╪══════════╡
│ AAPL   ┆ 231.31   ┆ 237.23    ┆ 228.6   ┆ 164.08   │
│ NVDA   ┆ 139.6    ┆ 140.76    ┆ 136.3   ┆ 39.23    │
│ MSFT   ┆ 424.04   ┆ 468.35    ┆ 417.52  ┆ 324.39   │
│ GOOG   ┆ 167.62   ┆ 193.31    ┆ 164.78  ┆ 121.46   │
│ AMZN   ┆ 189.83   ┆ 201.2     ┆ 188.44  ┆ 118.35   │
└────────┴──────────┴───────────┴─────────┴──────────┘

Arguments cannot be of mixed types

In Python, the function col accepts an arbitrary number of strings (as column names or as regular expressions) or an arbitrary number of data types, but you cannot mix both in the same function call:

try:
    df.select(pl.col("ticker", pl.Float64))
except TypeError as err:
    print("TypeError:", err)
TypeError: argument 'names': 'DataTypeClass' object cannot be converted to 'PyString'

Selecting all columns

Polars provides the function all as shorthand notation to refer to all columns of a dataframe:

all

result = df.select(pl.all())
print(result.equals(df))

all

let result = df.clone().lazy().select([all()]).collect()?;
println!("{}", result.equals(&df));

True

Note

The function all is syntactic sugar for col("*"), but since the argument "*" is a special case and all reads more like English, the usage of all is preferred.

Excluding columns

Polars also provides a mechanism to exclude certain columns from expression expansion. For that, you use the function exclude, which accepts exactly the same types of arguments as col:

exclude

result = df.select(pl.all().exclude("^day_.*$"))
print(result)

exclude

let result = df
    .clone()
    .lazy()
    .select([all().exclude(["^day_.*$"])])
    .collect()?;
println!("{}", result);

shape: (5, 5)
┌────────┬───────────────────┬────────┬───────────┬──────────┐
│ ticker ┆ company_name      ┆ price  ┆ year_high ┆ year_low │
│ ---    ┆ ---               ┆ ---    ┆ ---       ┆ ---      │
│ str    ┆ str               ┆ f64    ┆ f64       ┆ f64      │
╞════════╪═══════════════════╪════════╪═══════════╪══════════╡
│ AAPL   ┆ Apple             ┆ 229.9  ┆ 237.23    ┆ 164.08   │
│ NVDA   ┆ NVIDIA            ┆ 138.93 ┆ 140.76    ┆ 39.23    │
│ MSFT   ┆ Microsoft         ┆ 420.56 ┆ 468.35    ┆ 324.39   │
│ GOOG   ┆ Alphabet (Google) ┆ 166.41 ┆ 193.31    ┆ 121.46   │
│ AMZN   ┆ Amazon            ┆ 188.4  ┆ 201.2     ┆ 118.35   │
└────────┴───────────────────┴────────┴───────────┴──────────┘

Naturally, the function exclude can also be used after the function col:

exclude

result = df.select(pl.col(pl.Float64).exclude("^day_.*$"))
print(result)

exclude

let result = df
    .clone()
    .lazy()
    .select([dtype_col(&DataType::Float64).exclude(["^day_.*$"])])
    .collect()?;
println!("{}", result);

shape: (5, 3)
┌────────┬───────────┬──────────┐
│ price  ┆ year_high ┆ year_low │
│ ---    ┆ ---       ┆ ---      │
│ f64    ┆ f64       ┆ f64      │
╞════════╪═══════════╪══════════╡
│ 229.9  ┆ 237.23    ┆ 164.08   │
│ 138.93 ┆ 140.76    ┆ 39.23    │
│ 420.56 ┆ 468.35    ┆ 324.39   │
│ 166.41 ┆ 193.31    ┆ 121.46   │
│ 188.4  ┆ 201.2     ┆ 118.35   │
└────────┴───────────┴──────────┘

Column renaming

By default, when you apply an expression to a column, the result keeps the same name as the original column.

Preserving the column name can be semantically wrong and in certain cases Polars may even raise an error if duplicate names occur:

from polars.exceptions import DuplicateError

gbp_usd_rate = 1.31  # As of 14th October 2024

try:
    df.select(
        pl.col("price") / gbp_usd_rate,  # This would be named "price"...
        pl.col("price") / eur_usd_rate,  # And so would this.
    )
except DuplicateError as err:
    print("DuplicateError:", err)
let gbp_usd_rate = 1.31; // As of 14th October 2024

let result = df
    .clone()
    .lazy()
    .select([
        col("price") / lit(gbp_usd_rate),
        col("price") / lit(eur_usd_rate),
    ])
    .collect();
match result {
    Ok(df) => println!("{}", df),
    Err(e) => println!("{}", e),
};
DuplicateError: the name 'price' is duplicate

It's possible that multiple expressions are returning the same default column name. If this is the case, try renaming the columns with `.alias("new_name")` to avoid duplicate column names.

To prevent errors like this, and to allow users to rename their columns when appropriate, Polars provides a series of functions that let you change the name of a column or a group of columns.

Renaming a single column with alias

The function alias has been used thoroughly in the documentation already and it lets you rename a single column:

alias

result = df.select(
    (pl.col("price") / gbp_usd_rate).alias("price (GBP)"),
    (pl.col("price") / eur_usd_rate).alias("price (EUR)"),
)

alias

let _result = df
    .clone()
    .lazy()
    .select([
        (col("price") / lit(gbp_usd_rate)).alias("price (GBP)"),
        (col("price") / lit(eur_usd_rate)).alias("price (EUR)"),
    ])
    .collect()?;

Prefixing and suffixing column names

When using expression expansion you cannot use the function alias because the function alias is designed specifically to rename a single column.

When it suffices to add a static prefix or a static suffix to the existing names, we can use the functions prefix and suffix from the namespace name:

name namespace · prefix · suffix

result = df.select(
    (pl.col("^year_.*$") / eur_usd_rate).name.prefix("in_eur_"),
    (pl.col("day_high", "day_low") / gbp_usd_rate).name.suffix("_gbp"),
)
print(result)

name namespace · prefix · suffix · Available on feature lazy

let result = df
    .clone()
    .lazy()
    .select([
        (col("^year_.*$") / lit(eur_usd_rate))
            .name()
            .prefix("in_eur_"),
        (cols(["day_high", "day_low"]) / lit(gbp_usd_rate))
            .name()
            .suffix("_gbp"),
    ])
    .collect()?;
println!("{}", result);

shape: (5, 4)
┌──────────────────┬─────────────────┬──────────────┬─────────────┐
│ in_eur_year_high ┆ in_eur_year_low ┆ day_high_gbp ┆ day_low_gbp │
│ ---              ┆ ---             ┆ ---          ┆ ---         │
│ f64              ┆ f64             ┆ f64          ┆ f64         │
╞══════════════════╪═════════════════╪══════════════╪═════════════╡
│ 217.642202       ┆ 150.53211       ┆ 176.572519   ┆ 174.503817  │
│ 129.137615       ┆ 35.990826       ┆ 106.564885   ┆ 104.045802  │
│ 429.678899       ┆ 297.605505      ┆ 323.694656   ┆ 318.717557  │
│ 177.348624       ┆ 111.431193      ┆ 127.954198   ┆ 125.78626   │
│ 184.587156       ┆ 108.577982      ┆ 144.908397   ┆ 143.847328  │
└──────────────────┴─────────────────┴──────────────┴─────────────┘

Dynamic name replacement

If a static prefix/suffix is not enough, the namespace name also provides the function map that accepts a callable that accepts the old column names and produces the new ones:

name namespace · map

# There is also `.name.to_uppercase`, so this usage of `.map` is moot.
result = df.select(pl.all().name.map(str.upper))
print(result)

name namespace · map · Available on feature lazy

// There is also `name().to_uppercase()`, so this usage of `map` is moot.
let result = df
    .clone()
    .lazy()
    .select([all()
        .name()
        .map(|name| Ok(PlSmallStr::from_string(name.to_ascii_uppercase())))])
    .collect()?;
println!("{}", result);

shape: (5, 7)
┌────────┬───────────────────┬────────┬──────────┬─────────┬───────────┬──────────┐
│ TICKER ┆ COMPANY_NAME      ┆ PRICE  ┆ DAY_HIGH ┆ DAY_LOW ┆ YEAR_HIGH ┆ YEAR_LOW │
│ ---    ┆ ---               ┆ ---    ┆ ---      ┆ ---     ┆ ---       ┆ ---      │
│ str    ┆ str               ┆ f64    ┆ f64      ┆ f64     ┆ f64       ┆ f64      │
╞════════╪═══════════════════╪════════╪══════════╪═════════╪═══════════╪══════════╡
│ AAPL   ┆ Apple             ┆ 229.9  ┆ 231.31   ┆ 228.6   ┆ 237.23    ┆ 164.08   │
│ NVDA   ┆ NVIDIA            ┆ 138.93 ┆ 139.6    ┆ 136.3   ┆ 140.76    ┆ 39.23    │
│ MSFT   ┆ Microsoft         ┆ 420.56 ┆ 424.04   ┆ 417.52  ┆ 468.35    ┆ 324.39   │
│ GOOG   ┆ Alphabet (Google) ┆ 166.41 ┆ 167.62   ┆ 164.78  ┆ 193.31    ┆ 121.46   │
│ AMZN   ┆ Amazon            ┆ 188.4  ┆ 189.83   ┆ 188.44  ┆ 201.2     ┆ 118.35   │
└────────┴───────────────────┴────────┴──────────┴─────────┴───────────┴──────────┘

See the API reference for the full contents of the namespace name.

Programmatically generating expressions

Expression expansion is a very useful feature but it does not solve all of your problems. For example, if we want to compute the day and year amplitude of the prices of the stocks in our dataframe, expression expansion won't help us.

At first, you may think about using a for loop:

result = df
for tp in ["day", "year"]:
    result = result.with_columns(
        (pl.col(f"{tp}_high") - pl.col(f"{tp}_low")).alias(f"{tp}_amplitude")
    )
print(result)
let mut result = df.clone().lazy();
for tp in ["day", "year"] {
    let high = format!("{}_high", tp);
    let low = format!("{}_low", tp);
    let aliased = format!("{}_amplitude", tp);
    result = result.with_column((col(high) - col(low)).alias(aliased))
}
let result = result.collect()?;
println!("{}", result);
shape: (5, 9)
┌────────┬──────────────┬────────┬──────────┬───┬───────────┬──────────┬─────────────┬─────────────┐
│ ticker ┆ company_name ┆ price  ┆ day_high ┆ … ┆ year_high ┆ year_low ┆ day_amplitu ┆ year_amplit │
│ ---    ┆ ---          ┆ ---    ┆ ---      ┆   ┆ ---       ┆ ---      ┆ de          ┆ ude         │
│ str    ┆ str          ┆ f64    ┆ f64      ┆   ┆ f64       ┆ f64      ┆ ---         ┆ ---         │
│        ┆              ┆        ┆          ┆   ┆           ┆          ┆ f64         ┆ f64         │
╞════════╪══════════════╪════════╪══════════╪═══╪═══════════╪══════════╪═════════════╪═════════════╡
│ AAPL   ┆ Apple        ┆ 229.9  ┆ 231.31   ┆ … ┆ 237.23    ┆ 164.08   ┆ 2.71        ┆ 73.15       │
│ NVDA   ┆ NVIDIA       ┆ 138.93 ┆ 139.6    ┆ … ┆ 140.76    ┆ 39.23    ┆ 3.3         ┆ 101.53      │
│ MSFT   ┆ Microsoft    ┆ 420.56 ┆ 424.04   ┆ … ┆ 468.35    ┆ 324.39   ┆ 6.52        ┆ 143.96      │
│ GOOG   ┆ Alphabet     ┆ 166.41 ┆ 167.62   ┆ … ┆ 193.31    ┆ 121.46   ┆ 2.84        ┆ 71.85       │
│        ┆ (Google)     ┆        ┆          ┆   ┆           ┆          ┆             ┆             │
│ AMZN   ┆ Amazon       ┆ 188.4  ┆ 189.83   ┆ … ┆ 201.2     ┆ 118.35   ┆ 1.39        ┆ 82.85       │
└────────┴──────────────┴────────┴──────────┴───┴───────────┴──────────┴─────────────┴─────────────┘

Do not do this. Instead, generate all of the expressions you want to compute programmatically and use them only once in a context. Loosely speaking, you want to swap the for loop with the context with_columns. In practice, you could do something like the following:

def amplitude_expressions(time_periods):
    for tp in time_periods:
        yield (pl.col(f"{tp}_high") - pl.col(f"{tp}_low")).alias(f"{tp}_amplitude")


result = df.with_columns(amplitude_expressions(["day", "year"]))
print(result)
let mut exprs: Vec<Expr> = vec![];
for tp in ["day", "year"] {
    let high = format!("{}_high", tp);
    let low = format!("{}_low", tp);
    let aliased = format!("{}_amplitude", tp);
    exprs.push((col(high) - col(low)).alias(aliased))
}
let result = df.clone().lazy().with_columns(exprs).collect()?;
println!("{}", result);
shape: (5, 9)
┌────────┬──────────────┬────────┬──────────┬───┬───────────┬──────────┬─────────────┬─────────────┐
│ ticker ┆ company_name ┆ price  ┆ day_high ┆ … ┆ year_high ┆ year_low ┆ day_amplitu ┆ year_amplit │
│ ---    ┆ ---          ┆ ---    ┆ ---      ┆   ┆ ---       ┆ ---      ┆ de          ┆ ude         │
│ str    ┆ str          ┆ f64    ┆ f64      ┆   ┆ f64       ┆ f64      ┆ ---         ┆ ---         │
│        ┆              ┆        ┆          ┆   ┆           ┆          ┆ f64         ┆ f64         │
╞════════╪══════════════╪════════╪══════════╪═══╪═══════════╪══════════╪═════════════╪═════════════╡
│ AAPL   ┆ Apple        ┆ 229.9  ┆ 231.31   ┆ … ┆ 237.23    ┆ 164.08   ┆ 2.71        ┆ 73.15       │
│ NVDA   ┆ NVIDIA       ┆ 138.93 ┆ 139.6    ┆ … ┆ 140.76    ┆ 39.23    ┆ 3.3         ┆ 101.53      │
│ MSFT   ┆ Microsoft    ┆ 420.56 ┆ 424.04   ┆ … ┆ 468.35    ┆ 324.39   ┆ 6.52        ┆ 143.96      │
│ GOOG   ┆ Alphabet     ┆ 166.41 ┆ 167.62   ┆ … ┆ 193.31    ┆ 121.46   ┆ 2.84        ┆ 71.85       │
│        ┆ (Google)     ┆        ┆          ┆   ┆           ┆          ┆             ┆             │
│ AMZN   ┆ Amazon       ┆ 188.4  ┆ 189.83   ┆ … ┆ 201.2     ┆ 118.35   ┆ 1.39        ┆ 82.85       │
└────────┴──────────────┴────────┴──────────┴───┴───────────┴──────────┴─────────────┴─────────────┘

This produces the same final result and by specifying all of the expressions in one go we give Polars the opportunity to:

  1. do a better job at optimising the query; and
  2. parallelise the execution of the actual computations.

More flexible column selections

Polars comes with the submodule selectors that provides a number of functions that allow you to write more flexible column selections for expression expansion.

Warning

This functionality is not available in Rust yet. Refer to Polars issue #10594.

As a first example, here is how we can use the functions string and ends_with, and the set operations that the functions from selectors support, to select all string columns and the columns whose names end with "_high":

selectors

import polars.selectors as cs

result = df.select(cs.string() | cs.ends_with("_high"))
print(result)

// Selectors are not available in Rust yet.
// Refer to https://github.com/pola-rs/polars/issues/10594
shape: (5, 4)
┌────────┬───────────────────┬──────────┬───────────┐
│ ticker ┆ company_name      ┆ day_high ┆ year_high │
│ ---    ┆ ---               ┆ ---      ┆ ---       │
│ str    ┆ str               ┆ f64      ┆ f64       │
╞════════╪═══════════════════╪══════════╪═══════════╡
│ AAPL   ┆ Apple             ┆ 231.31   ┆ 237.23    │
│ NVDA   ┆ NVIDIA            ┆ 139.6    ┆ 140.76    │
│ MSFT   ┆ Microsoft         ┆ 424.04   ┆ 468.35    │
│ GOOG   ┆ Alphabet (Google) ┆ 167.62   ┆ 193.31    │
│ AMZN   ┆ Amazon            ┆ 189.83   ┆ 201.2     │
└────────┴───────────────────┴──────────┴───────────┘

The submodule selectors provides a number of selectors that match based on the data type of the columns, of which the most useful are the functions that match a whole category of types, like cs.numeric for all numeric data types or cs.temporal for all temporal data types.

The submodule selectors also provides a number of selectors that match based on patterns in the column names which make it more convenient to specify common patterns you may want to check for, like the function cs.ends_with that was shown above.

Combining selectors with set operations

We can combine multiple selectors using set operations and the usual Python operators:

Operator Operation
A | B Union
A & B Intersection
A - B Difference
A ^ B Symmetric difference
~A Complement

The next example matches all non-string columns that contain an underscore in the name:

selectors

result = df.select(cs.contains("_") - cs.string())
print(result)

// Selectors are not available in Rust yet.
// Refer to https://github.com/pola-rs/polars/issues/10594
shape: (5, 4)
┌──────────┬─────────┬───────────┬──────────┐
│ day_high ┆ day_low ┆ year_high ┆ year_low │
│ ---      ┆ ---     ┆ ---       ┆ ---      │
│ f64      ┆ f64     ┆ f64       ┆ f64      │
╞══════════╪═════════╪═══════════╪══════════╡
│ 231.31   ┆ 228.6   ┆ 237.23    ┆ 164.08   │
│ 139.6    ┆ 136.3   ┆ 140.76    ┆ 39.23    │
│ 424.04   ┆ 417.52  ┆ 468.35    ┆ 324.39   │
│ 167.62   ┆ 164.78  ┆ 193.31    ┆ 121.46   │
│ 189.83   ┆ 188.44  ┆ 201.2     ┆ 118.35   │
└──────────┴─────────┴───────────┴──────────┘

Resolving operator ambiguity

Expression functions can be chained on top of selectors:

selectors

result = df.select((cs.contains("_") - cs.string()) / eur_usd_rate)
print(result)

// Selectors are not available in Rust yet.
// Refer to https://github.com/pola-rs/polars/issues/10594
shape: (5, 4)
┌────────────┬────────────┬────────────┬────────────┐
│ day_high   ┆ day_low    ┆ year_high  ┆ year_low   │
│ ---        ┆ ---        ┆ ---        ┆ ---        │
│ f64        ┆ f64        ┆ f64        ┆ f64        │
╞════════════╪════════════╪════════════╪════════════╡
│ 212.211009 ┆ 209.724771 ┆ 217.642202 ┆ 150.53211  │
│ 128.073394 ┆ 125.045872 ┆ 129.137615 ┆ 35.990826  │
│ 389.027523 ┆ 383.045872 ┆ 429.678899 ┆ 297.605505 │
│ 153.779817 ┆ 151.174312 ┆ 177.348624 ┆ 111.431193 │
│ 174.155963 ┆ 172.880734 ┆ 184.587156 ┆ 108.577982 │
└────────────┴────────────┴────────────┴────────────┘

However, some operators have been overloaded to operate both on Polars selectors and on expressions. For example, the operator ~ on a selector represents the set operation “complement” and on an expression represents the Boolean operation of negation.

When you use a selector and then want to use, in the context of an expression, one of the operators that act as set operators for selectors, you can use the function as_expr.

Below, we want to negate the Boolean values in the columns “has_partner”, “has_kids”, and “has_tattoos”. If we are not careful, the combination of the operator ~ and the selector cs.starts_with("has_") will actually select the columns that we do not care about:

people = pl.DataFrame(
    {
        "name": ["Anna", "Bob"],
        "has_partner": [True, False],
        "has_kids": [False, False],
        "has_tattoos": [True, False],
        "is_alive": [True, True],
    }
)

wrong_result = people.select((~cs.starts_with("has_")).name.prefix("not_"))
print(wrong_result)
// Selectors are not available in Rust yet.
// Refer to https://github.com/pola-rs/polars/issues/10594
shape: (2, 2)
┌──────────┬──────────────┐
│ not_name ┆ not_is_alive │
│ ---      ┆ ---          │
│ str      ┆ bool         │
╞══════════╪══════════════╡
│ Anna     ┆ true         │
│ Bob      ┆ true         │
└──────────┴──────────────┘

The correct solution uses as_expr:

result = people.select((~cs.starts_with("has_").as_expr()).name.prefix("not_"))
print(result)
// Selectors are not available in Rust yet.
// Refer to https://github.com/pola-rs/polars/issues/10594
shape: (2, 3)
┌─────────────────┬──────────────┬─────────────────┐
│ not_has_partner ┆ not_has_kids ┆ not_has_tattoos │
│ ---             ┆ ---          ┆ ---             │
│ bool            ┆ bool         ┆ bool            │
╞═════════════════╪══════════════╪═════════════════╡
│ false           ┆ true         ┆ false           │
│ true            ┆ true         ┆ true            │
└─────────────────┴──────────────┴─────────────────┘

Debugging selectors

When you are not sure whether you have a Polars selector at hand or not, you can use the function cs.is_selector to check:

is_selector

print(cs.is_selector(~cs.starts_with("has_").as_expr()))

// Selectors are not available in Rust yet.
// Refer to https://github.com/pola-rs/polars/issues/10594
False

This should help you avoid any ambiguous situations where you think you are operating with expressions but are in fact operating with selectors.

Another helpful debugging utility is the function expand_selector. Given a target frame or schema, you can check what columns a given selector will expand to:

expand_selector

print(
    cs.expand_selector(
        people,
        cs.starts_with("has_"),
    )
)

// Selectors are not available in Rust yet.
// Refer to https://github.com/pola-rs/polars/issues/10594
('has_partner', 'has_kids', 'has_tattoos')

Complete reference

The tables below group the functions available in the submodule selectors by their type of behaviour.

Selectors for data types

Selectors that match based on the data type of the column:

Selector function Data type(s) matched
binary Binary
boolean Boolean
by_dtype Data types specified as arguments
categorical Categorical
date Date
datetime Datetime, optionally filtering by time unit/zone
decimal Decimal
duration Duration, optionally filtering by time unit
float All float types, regardless of precision
integer All integer types, signed and unsigned, regardless of precision
numeric All numeric types, namely integers, floats, and Decimal
signed_integer All signed integer types, regardless of precision
string String
temporal All temporal data types, namely Date, Datetime, and Duration
time Time
unsigned_integer All unsigned integer types, regardless of precision

Selectors for column name patterns

Selectors that match based on column name patterns:

Selector function Columns selected
alpha Columns with alphabetical names
alphanumeric Columns with alphanumeric names (letters and the digits 0-9)
by_name Columns with the names specified as arguments
contains Columns whose names contain the substring specified
digit Columns with numeric names (only the digits 0-9)
ends_with Columns whose names end with the given substring
matches Columns whose names match the given regex pattern
starts_with Columns whose names start with the given substring

Positional selectors

Selectors that match based on the position of the columns:

Selector function Columns selected
all All columns
by_index The columns at the specified indices
first The first column in the context
last The last column in the context

Miscellaneous functions

The submodule selectors also provides the following functions:

Function Behaviour
as_expr* Convert a selector to an expression
exclude Selects all columns except those matching the given names, data types, or selectors
expand_selector Expand selector to matching columns with respect to a specific frame or target schema
is_selector Check whether the given object/expression is a selector

*as_expr isn't a function defined on the submodule selectors, but rather a method defined on selectors.