Expression expansion
As you've seen in the section about expressions and contexts, expression expansion is a feature that enables you to write a single expression that can expand to multiple different expressions, possibly depending on the schema of the context in which the expression is used.
This feature isn't just decorative or syntactic sugar. It allows for a very powerful application of DRY principles in your code: a single expression that specifies multiple columns expands into a list of expressions, which means you can write one single expression and reuse the computation that it represents.
In this section we will show several forms of expression expansion and we will be using the dataframe that you can see below for that effect:
import polars as pl
df = pl.DataFrame(
{ # As of 14th October 2024, ~3pm UTC
"ticker": ["AAPL", "NVDA", "MSFT", "GOOG", "AMZN"],
"company_name": ["Apple", "NVIDIA", "Microsoft", "Alphabet (Google)", "Amazon"],
"price": [229.9, 138.93, 420.56, 166.41, 188.4],
"day_high": [231.31, 139.6, 424.04, 167.62, 189.83],
"day_low": [228.6, 136.3, 417.52, 164.78, 188.44],
"year_high": [237.23, 140.76, 468.35, 193.31, 201.2],
"year_low": [164.08, 39.23, 324.39, 121.46, 118.35],
}
)
print(df)
use polars::prelude::*;
// Data as of 14th October 2024, ~3pm UTC
let df = df!(
"ticker" => ["AAPL", "NVDA", "MSFT", "GOOG", "AMZN"],
"company_name" => ["Apple", "NVIDIA", "Microsoft", "Alphabet (Google)", "Amazon"],
"price" => [229.9, 138.93, 420.56, 166.41, 188.4],
"day_high" => [231.31, 139.6, 424.04, 167.62, 189.83],
"day_low" => [228.6, 136.3, 417.52, 164.78, 188.44],
"year_high" => [237.23, 140.76, 468.35, 193.31, 201.2],
"year_low" => [164.08, 39.23, 324.39, 121.46, 118.35],
)?;
println!("{}", df);
shape: (5, 7)
┌────────┬───────────────────┬────────┬──────────┬─────────┬───────────┬──────────┐
│ ticker ┆ company_name ┆ price ┆ day_high ┆ day_low ┆ year_high ┆ year_low │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞════════╪═══════════════════╪════════╪══════════╪═════════╪═══════════╪══════════╡
│ AAPL ┆ Apple ┆ 229.9 ┆ 231.31 ┆ 228.6 ┆ 237.23 ┆ 164.08 │
│ NVDA ┆ NVIDIA ┆ 138.93 ┆ 139.6 ┆ 136.3 ┆ 140.76 ┆ 39.23 │
│ MSFT ┆ Microsoft ┆ 420.56 ┆ 424.04 ┆ 417.52 ┆ 468.35 ┆ 324.39 │
│ GOOG ┆ Alphabet (Google) ┆ 166.41 ┆ 167.62 ┆ 164.78 ┆ 193.31 ┆ 121.46 │
│ AMZN ┆ Amazon ┆ 188.4 ┆ 189.83 ┆ 188.44 ┆ 201.2 ┆ 118.35 │
└────────┴───────────────────┴────────┴──────────┴─────────┴───────────┴──────────┘
Function col
The function col
is the most common way of making use of expression expansion features in Polars.
Typically used to refer to one column of a dataframe, in this section we explore other ways in which
you can use col
(or its variants, when in Rust).
Explicit expansion by column name
The simplest form of expression expansion happens when you provide multiple column names to the
function col
.
The example below uses a single function col
with multiple column names to convert the values in
USD to EUR:
eur_usd_rate = 1.09 # As of 14th October 2024
result = df.with_columns(
(
pl.col(
"price",
"day_high",
"day_low",
"year_high",
"year_low",
)
/ eur_usd_rate
).round(2)
)
print(result)
let eur_usd_rate = 1.09; // As of 14th October 2024
let result = df
.clone()
.lazy()
.with_column(
(cols(["price", "day_high", "day_low", "year_high", "year_low"]) / lit(eur_usd_rate))
.round(2),
)
.collect()?;
println!("{}", result);
shape: (5, 7)
┌────────┬───────────────────┬────────┬──────────┬─────────┬───────────┬──────────┐
│ ticker ┆ company_name ┆ price ┆ day_high ┆ day_low ┆ year_high ┆ year_low │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞════════╪═══════════════════╪════════╪══════════╪═════════╪═══════════╪══════════╡
│ AAPL ┆ Apple ┆ 210.92 ┆ 212.21 ┆ 209.72 ┆ 217.64 ┆ 150.53 │
│ NVDA ┆ NVIDIA ┆ 127.46 ┆ 128.07 ┆ 125.05 ┆ 129.14 ┆ 35.99 │
│ MSFT ┆ Microsoft ┆ 385.83 ┆ 389.03 ┆ 383.05 ┆ 429.68 ┆ 297.61 │
│ GOOG ┆ Alphabet (Google) ┆ 152.67 ┆ 153.78 ┆ 151.17 ┆ 177.35 ┆ 111.43 │
│ AMZN ┆ Amazon ┆ 172.84 ┆ 174.16 ┆ 172.88 ┆ 184.59 ┆ 108.58 │
└────────┴───────────────────┴────────┴──────────┴─────────┴───────────┴──────────┘
When you list the column names you want the expression to expand to, you can predict what the expression will expand to. In this case, the expression that does the currency conversion is expanded to a list of five expressions:
exprs = [
(pl.col("price") / eur_usd_rate).round(2),
(pl.col("day_high") / eur_usd_rate).round(2),
(pl.col("day_low") / eur_usd_rate).round(2),
(pl.col("year_high") / eur_usd_rate).round(2),
(pl.col("year_low") / eur_usd_rate).round(2),
]
result2 = df.with_columns(exprs)
print(result.equals(result2))
let exprs = [
(col("price") / lit(eur_usd_rate)).round(2),
(col("day_high") / lit(eur_usd_rate)).round(2),
(col("day_low") / lit(eur_usd_rate)).round(2),
(col("year_high") / lit(eur_usd_rate)).round(2),
(col("year_low") / lit(eur_usd_rate)).round(2),
];
let result2 = df.clone().lazy().with_columns(exprs).collect()?;
println!("{}", result.equals(&result2));
True
Expansion by data type
We had to type five column names in the previous example but the function col
can also
conveniently accept one or more data types. If you provide data types instead of column names, the
expression is expanded to all columns that match one of the data types provided.
The example below performs the exact same computation as before:
shape: (5, 7)
┌────────┬───────────────────┬────────┬──────────┬─────────┬───────────┬──────────┐
│ ticker ┆ company_name ┆ price ┆ day_high ┆ day_low ┆ year_high ┆ year_low │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞════════╪═══════════════════╪════════╪══════════╪═════════╪═══════════╪══════════╡
│ AAPL ┆ Apple ┆ 210.92 ┆ 212.21 ┆ 209.72 ┆ 217.64 ┆ 150.53 │
│ NVDA ┆ NVIDIA ┆ 127.46 ┆ 128.07 ┆ 125.05 ┆ 129.14 ┆ 35.99 │
│ MSFT ┆ Microsoft ┆ 385.83 ┆ 389.03 ┆ 383.05 ┆ 429.68 ┆ 297.61 │
│ GOOG ┆ Alphabet (Google) ┆ 152.67 ┆ 153.78 ┆ 151.17 ┆ 177.35 ┆ 111.43 │
│ AMZN ┆ Amazon ┆ 172.84 ┆ 174.16 ┆ 172.88 ┆ 184.59 ┆ 108.58 │
└────────┴───────────────────┴────────┴──────────┴─────────┴───────────┴──────────┘
When we use a data type with expression expansion we cannot know, beforehand, how many columns a single expression will expand to. We need the schema of the input dataframe if we want to determine what is the final list of expressions that is to be applied.
If we weren't sure about whether the price columns where of the type Float64
or Float32
, we
could specify both data types:
result2 = df.with_columns(
(
pl.col(
pl.Float32,
pl.Float64,
)
/ eur_usd_rate
).round(2)
)
print(result.equals(result2))
let result2 = df
.clone()
.lazy()
.with_column(
(dtype_cols([DataType::Float32, DataType::Float64]) / lit(eur_usd_rate)).round(2),
)
.collect()?;
println!("{}", result.equals(&result2));
True
Expansion by pattern matching
You can also use regular expressions to specify patterns that are used to match the column names. To
distinguish between a regular column name and expansion by pattern matching, regular expressions
start and end with ^
and $
, respectively. This also means that the pattern must match against
the whole column name string.
Regular expressions can be mixed with regular column names:
shape: (5, 5)
┌────────┬──────────┬───────────┬─────────┬──────────┐
│ ticker ┆ day_high ┆ year_high ┆ day_low ┆ year_low │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞════════╪══════════╪═══════════╪═════════╪══════════╡
│ AAPL ┆ 231.31 ┆ 237.23 ┆ 228.6 ┆ 164.08 │
│ NVDA ┆ 139.6 ┆ 140.76 ┆ 136.3 ┆ 39.23 │
│ MSFT ┆ 424.04 ┆ 468.35 ┆ 417.52 ┆ 324.39 │
│ GOOG ┆ 167.62 ┆ 193.31 ┆ 164.78 ┆ 121.46 │
│ AMZN ┆ 189.83 ┆ 201.2 ┆ 188.44 ┆ 118.35 │
└────────┴──────────┴───────────┴─────────┴──────────┘
Arguments cannot be of mixed types
In Python, the function col
accepts an arbitrary number of strings (as
column names or as
regular expressions) or an arbitrary number of data types, but you
cannot mix both in the same function call:
try:
df.select(pl.col("ticker", pl.Float64))
except TypeError as err:
print("TypeError:", err)
TypeError: argument 'names': 'DataTypeClass' object cannot be converted to 'PyString'
Selecting all columns
Polars provides the function all
as shorthand notation to refer to all columns of a dataframe:
True
Note
The function all
is syntactic sugar for col("*")
, but since the argument "*"
is a special case and all
reads more like English, the usage of all
is preferred.
Excluding columns
Polars also provides a mechanism to exclude certain columns from expression expansion. For that, you
use the function exclude
, which accepts exactly the same types of arguments as col
:
shape: (5, 5)
┌────────┬───────────────────┬────────┬───────────┬──────────┐
│ ticker ┆ company_name ┆ price ┆ year_high ┆ year_low │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 ┆ f64 ┆ f64 │
╞════════╪═══════════════════╪════════╪═══════════╪══════════╡
│ AAPL ┆ Apple ┆ 229.9 ┆ 237.23 ┆ 164.08 │
│ NVDA ┆ NVIDIA ┆ 138.93 ┆ 140.76 ┆ 39.23 │
│ MSFT ┆ Microsoft ┆ 420.56 ┆ 468.35 ┆ 324.39 │
│ GOOG ┆ Alphabet (Google) ┆ 166.41 ┆ 193.31 ┆ 121.46 │
│ AMZN ┆ Amazon ┆ 188.4 ┆ 201.2 ┆ 118.35 │
└────────┴───────────────────┴────────┴───────────┴──────────┘
Naturally, the function exclude
can also be used after the function col
:
shape: (5, 3)
┌────────┬───────────┬──────────┐
│ price ┆ year_high ┆ year_low │
│ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 │
╞════════╪═══════════╪══════════╡
│ 229.9 ┆ 237.23 ┆ 164.08 │
│ 138.93 ┆ 140.76 ┆ 39.23 │
│ 420.56 ┆ 468.35 ┆ 324.39 │
│ 166.41 ┆ 193.31 ┆ 121.46 │
│ 188.4 ┆ 201.2 ┆ 118.35 │
└────────┴───────────┴──────────┘
Column renaming
By default, when you apply an expression to a column, the result keeps the same name as the original column.
Preserving the column name can be semantically wrong and in certain cases Polars may even raise an error if duplicate names occur:
from polars.exceptions import DuplicateError
gbp_usd_rate = 1.31 # As of 14th October 2024
try:
df.select(
pl.col("price") / gbp_usd_rate, # This would be named "price"...
pl.col("price") / eur_usd_rate, # And so would this.
)
except DuplicateError as err:
print("DuplicateError:", err)
let gbp_usd_rate = 1.31; // As of 14th October 2024
let result = df
.clone()
.lazy()
.select([
col("price") / lit(gbp_usd_rate),
col("price") / lit(eur_usd_rate),
])
.collect();
match result {
Ok(df) => println!("{}", df),
Err(e) => println!("{}", e),
};
DuplicateError: the name 'price' is duplicate
It's possible that multiple expressions are returning the same default column name. If this is the case, try renaming the columns with `.alias("new_name")` to avoid duplicate column names.
To prevent errors like this, and to allow users to rename their columns when appropriate, Polars provides a series of functions that let you change the name of a column or a group of columns.
Renaming a single column with alias
The function alias
has been used thoroughly in the documentation already and it lets you rename a
single column:
result = df.select(
(pl.col("price") / gbp_usd_rate).alias("price (GBP)"),
(pl.col("price") / eur_usd_rate).alias("price (EUR)"),
)
let _result = df
.clone()
.lazy()
.select([
(col("price") / lit(gbp_usd_rate)).alias("price (GBP)"),
(col("price") / lit(eur_usd_rate)).alias("price (EUR)"),
])
.collect()?;
Prefixing and suffixing column names
When using expression expansion you cannot use the function alias
because the function alias
is
designed specifically to rename a single column.
When it suffices to add a static prefix or a static suffix to the existing names, we can use the
functions prefix
and suffix
from the namespace name
:
name namespace
· prefix
· suffix
result = df.select(
(pl.col("^year_.*$") / eur_usd_rate).name.prefix("in_eur_"),
(pl.col("day_high", "day_low") / gbp_usd_rate).name.suffix("_gbp"),
)
print(result)
name namespace
· prefix
· suffix
· Available on feature lazy
let result = df
.clone()
.lazy()
.select([
(col("^year_.*$") / lit(eur_usd_rate))
.name()
.prefix("in_eur_"),
(cols(["day_high", "day_low"]) / lit(gbp_usd_rate))
.name()
.suffix("_gbp"),
])
.collect()?;
println!("{}", result);
shape: (5, 4)
┌──────────────────┬─────────────────┬──────────────┬─────────────┐
│ in_eur_year_high ┆ in_eur_year_low ┆ day_high_gbp ┆ day_low_gbp │
│ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 ┆ f64 │
╞══════════════════╪═════════════════╪══════════════╪═════════════╡
│ 217.642202 ┆ 150.53211 ┆ 176.572519 ┆ 174.503817 │
│ 129.137615 ┆ 35.990826 ┆ 106.564885 ┆ 104.045802 │
│ 429.678899 ┆ 297.605505 ┆ 323.694656 ┆ 318.717557 │
│ 177.348624 ┆ 111.431193 ┆ 127.954198 ┆ 125.78626 │
│ 184.587156 ┆ 108.577982 ┆ 144.908397 ┆ 143.847328 │
└──────────────────┴─────────────────┴──────────────┴─────────────┘
Dynamic name replacement
If a static prefix/suffix is not enough, the namespace name
also provides the function map
that
accepts a callable that accepts the old column names and produces the new ones:
# There is also `.name.to_uppercase`, so this usage of `.map` is moot.
result = df.select(pl.all().name.map(str.upper))
print(result)
name namespace
· map
· Available on feature lazy
// There is also `name().to_uppercase()`, so this usage of `map` is moot.
let result = df
.clone()
.lazy()
.select([all()
.name()
.map(|name| Ok(PlSmallStr::from_string(name.to_ascii_uppercase())))])
.collect()?;
println!("{}", result);
shape: (5, 7)
┌────────┬───────────────────┬────────┬──────────┬─────────┬───────────┬──────────┐
│ TICKER ┆ COMPANY_NAME ┆ PRICE ┆ DAY_HIGH ┆ DAY_LOW ┆ YEAR_HIGH ┆ YEAR_LOW │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞════════╪═══════════════════╪════════╪══════════╪═════════╪═══════════╪══════════╡
│ AAPL ┆ Apple ┆ 229.9 ┆ 231.31 ┆ 228.6 ┆ 237.23 ┆ 164.08 │
│ NVDA ┆ NVIDIA ┆ 138.93 ┆ 139.6 ┆ 136.3 ┆ 140.76 ┆ 39.23 │
│ MSFT ┆ Microsoft ┆ 420.56 ┆ 424.04 ┆ 417.52 ┆ 468.35 ┆ 324.39 │
│ GOOG ┆ Alphabet (Google) ┆ 166.41 ┆ 167.62 ┆ 164.78 ┆ 193.31 ┆ 121.46 │
│ AMZN ┆ Amazon ┆ 188.4 ┆ 189.83 ┆ 188.44 ┆ 201.2 ┆ 118.35 │
└────────┴───────────────────┴────────┴──────────┴─────────┴───────────┴──────────┘
See the API reference for the full contents of the namespace name
.
Programmatically generating expressions
Expression expansion is a very useful feature but it does not solve all of your problems. For example, if we want to compute the day and year amplitude of the prices of the stocks in our dataframe, expression expansion won't help us.
At first, you may think about using a for
loop:
result = df
for tp in ["day", "year"]:
result = result.with_columns(
(pl.col(f"{tp}_high") - pl.col(f"{tp}_low")).alias(f"{tp}_amplitude")
)
print(result)
let mut result = df.clone().lazy();
for tp in ["day", "year"] {
let high = format!("{}_high", tp);
let low = format!("{}_low", tp);
let aliased = format!("{}_amplitude", tp);
result = result.with_column((col(high) - col(low)).alias(aliased))
}
let result = result.collect()?;
println!("{}", result);
shape: (5, 9)
┌────────┬──────────────┬────────┬──────────┬───┬───────────┬──────────┬─────────────┬─────────────┐
│ ticker ┆ company_name ┆ price ┆ day_high ┆ … ┆ year_high ┆ year_low ┆ day_amplitu ┆ year_amplit │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ de ┆ ude │
│ str ┆ str ┆ f64 ┆ f64 ┆ ┆ f64 ┆ f64 ┆ --- ┆ --- │
│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ f64 ┆ f64 │
╞════════╪══════════════╪════════╪══════════╪═══╪═══════════╪══════════╪═════════════╪═════════════╡
│ AAPL ┆ Apple ┆ 229.9 ┆ 231.31 ┆ … ┆ 237.23 ┆ 164.08 ┆ 2.71 ┆ 73.15 │
│ NVDA ┆ NVIDIA ┆ 138.93 ┆ 139.6 ┆ … ┆ 140.76 ┆ 39.23 ┆ 3.3 ┆ 101.53 │
│ MSFT ┆ Microsoft ┆ 420.56 ┆ 424.04 ┆ … ┆ 468.35 ┆ 324.39 ┆ 6.52 ┆ 143.96 │
│ GOOG ┆ Alphabet ┆ 166.41 ┆ 167.62 ┆ … ┆ 193.31 ┆ 121.46 ┆ 2.84 ┆ 71.85 │
│ ┆ (Google) ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ AMZN ┆ Amazon ┆ 188.4 ┆ 189.83 ┆ … ┆ 201.2 ┆ 118.35 ┆ 1.39 ┆ 82.85 │
└────────┴──────────────┴────────┴──────────┴───┴───────────┴──────────┴─────────────┴─────────────┘
Do not do this. Instead, generate all of the expressions you want to compute programmatically and
use them only once in a context. Loosely speaking, you want to swap the for
loop with the context
with_columns
. In practice, you could do something like the following:
def amplitude_expressions(time_periods):
for tp in time_periods:
yield (pl.col(f"{tp}_high") - pl.col(f"{tp}_low")).alias(f"{tp}_amplitude")
result = df.with_columns(amplitude_expressions(["day", "year"]))
print(result)
let mut exprs: Vec<Expr> = vec![];
for tp in ["day", "year"] {
let high = format!("{}_high", tp);
let low = format!("{}_low", tp);
let aliased = format!("{}_amplitude", tp);
exprs.push((col(high) - col(low)).alias(aliased))
}
let result = df.clone().lazy().with_columns(exprs).collect()?;
println!("{}", result);
shape: (5, 9)
┌────────┬──────────────┬────────┬──────────┬───┬───────────┬──────────┬─────────────┬─────────────┐
│ ticker ┆ company_name ┆ price ┆ day_high ┆ … ┆ year_high ┆ year_low ┆ day_amplitu ┆ year_amplit │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ de ┆ ude │
│ str ┆ str ┆ f64 ┆ f64 ┆ ┆ f64 ┆ f64 ┆ --- ┆ --- │
│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ f64 ┆ f64 │
╞════════╪══════════════╪════════╪══════════╪═══╪═══════════╪══════════╪═════════════╪═════════════╡
│ AAPL ┆ Apple ┆ 229.9 ┆ 231.31 ┆ … ┆ 237.23 ┆ 164.08 ┆ 2.71 ┆ 73.15 │
│ NVDA ┆ NVIDIA ┆ 138.93 ┆ 139.6 ┆ … ┆ 140.76 ┆ 39.23 ┆ 3.3 ┆ 101.53 │
│ MSFT ┆ Microsoft ┆ 420.56 ┆ 424.04 ┆ … ┆ 468.35 ┆ 324.39 ┆ 6.52 ┆ 143.96 │
│ GOOG ┆ Alphabet ┆ 166.41 ┆ 167.62 ┆ … ┆ 193.31 ┆ 121.46 ┆ 2.84 ┆ 71.85 │
│ ┆ (Google) ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ AMZN ┆ Amazon ┆ 188.4 ┆ 189.83 ┆ … ┆ 201.2 ┆ 118.35 ┆ 1.39 ┆ 82.85 │
└────────┴──────────────┴────────┴──────────┴───┴───────────┴──────────┴─────────────┴─────────────┘
This produces the same final result and by specifying all of the expressions in one go we give Polars the opportunity to:
- do a better job at optimising the query; and
- parallelise the execution of the actual computations.
More flexible column selections
Polars comes with the submodule selectors
that provides a number of functions that allow you to
write more flexible column selections for expression expansion.
Warning
This functionality is not available in Rust yet. Refer to Polars issue #10594.
As a first example, here is how we can use the functions string
and ends_with
, and the set
operations that the functions from selectors
support, to select all string columns and the columns
whose names end with "_high"
:
import polars.selectors as cs
result = df.select(cs.string() | cs.ends_with("_high"))
print(result)
// Selectors are not available in Rust yet.
// Refer to https://github.com/pola-rs/polars/issues/10594
shape: (5, 4)
┌────────┬───────────────────┬──────────┬───────────┐
│ ticker ┆ company_name ┆ day_high ┆ year_high │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 ┆ f64 │
╞════════╪═══════════════════╪══════════╪═══════════╡
│ AAPL ┆ Apple ┆ 231.31 ┆ 237.23 │
│ NVDA ┆ NVIDIA ┆ 139.6 ┆ 140.76 │
│ MSFT ┆ Microsoft ┆ 424.04 ┆ 468.35 │
│ GOOG ┆ Alphabet (Google) ┆ 167.62 ┆ 193.31 │
│ AMZN ┆ Amazon ┆ 189.83 ┆ 201.2 │
└────────┴───────────────────┴──────────┴───────────┘
The submodule selectors
provides
a number of selectors that match based on the data type of the columns,
of which the most useful are the functions that match a whole category of types, like cs.numeric
for all numeric data types or cs.temporal
for all temporal data types.
The submodule selectors
also provides
a number of selectors that match based on patterns in the column names
which make it more convenient to specify common patterns you may want to check for, like the
function cs.ends_with
that was shown above.
Combining selectors with set operations
We can combine multiple selectors using set operations and the usual Python operators:
Operator | Operation |
---|---|
A | B |
Union |
A & B |
Intersection |
A - B |
Difference |
A ^ B |
Symmetric difference |
~A |
Complement |
The next example matches all non-string columns that contain an underscore in the name:
result = df.select(cs.contains("_") - cs.string())
print(result)
// Selectors are not available in Rust yet.
// Refer to https://github.com/pola-rs/polars/issues/10594
shape: (5, 4)
┌──────────┬─────────┬───────────┬──────────┐
│ day_high ┆ day_low ┆ year_high ┆ year_low │
│ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 ┆ f64 │
╞══════════╪═════════╪═══════════╪══════════╡
│ 231.31 ┆ 228.6 ┆ 237.23 ┆ 164.08 │
│ 139.6 ┆ 136.3 ┆ 140.76 ┆ 39.23 │
│ 424.04 ┆ 417.52 ┆ 468.35 ┆ 324.39 │
│ 167.62 ┆ 164.78 ┆ 193.31 ┆ 121.46 │
│ 189.83 ┆ 188.44 ┆ 201.2 ┆ 118.35 │
└──────────┴─────────┴───────────┴──────────┘
Resolving operator ambiguity
Expression functions can be chained on top of selectors:
result = df.select((cs.contains("_") - cs.string()) / eur_usd_rate)
print(result)
// Selectors are not available in Rust yet.
// Refer to https://github.com/pola-rs/polars/issues/10594
shape: (5, 4)
┌────────────┬────────────┬────────────┬────────────┐
│ day_high ┆ day_low ┆ year_high ┆ year_low │
│ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 ┆ f64 │
╞════════════╪════════════╪════════════╪════════════╡
│ 212.211009 ┆ 209.724771 ┆ 217.642202 ┆ 150.53211 │
│ 128.073394 ┆ 125.045872 ┆ 129.137615 ┆ 35.990826 │
│ 389.027523 ┆ 383.045872 ┆ 429.678899 ┆ 297.605505 │
│ 153.779817 ┆ 151.174312 ┆ 177.348624 ┆ 111.431193 │
│ 174.155963 ┆ 172.880734 ┆ 184.587156 ┆ 108.577982 │
└────────────┴────────────┴────────────┴────────────┘
However, some operators have been overloaded to operate both on Polars selectors and on expressions.
For example, the operator ~
on a selector represents
the set operation “complement” and on an expression
represents the Boolean operation of negation.
When you use a selector and then want to use, in the context of an expression, one of the
operators that act as set operators for selectors, you
can use the function as_expr
.
Below, we want to negate the Boolean values in the columns “has_partner”, “has_kids”, and
“has_tattoos”. If we are not careful, the combination of the operator ~
and the selector
cs.starts_with("has_")
will actually select the columns that we do not care about:
people = pl.DataFrame(
{
"name": ["Anna", "Bob"],
"has_partner": [True, False],
"has_kids": [False, False],
"has_tattoos": [True, False],
"is_alive": [True, True],
}
)
wrong_result = people.select((~cs.starts_with("has_")).name.prefix("not_"))
print(wrong_result)
// Selectors are not available in Rust yet.
// Refer to https://github.com/pola-rs/polars/issues/10594
shape: (2, 2)
┌──────────┬──────────────┐
│ not_name ┆ not_is_alive │
│ --- ┆ --- │
│ str ┆ bool │
╞══════════╪══════════════╡
│ Anna ┆ true │
│ Bob ┆ true │
└──────────┴──────────────┘
The correct solution uses as_expr
:
result = people.select((~cs.starts_with("has_").as_expr()).name.prefix("not_"))
print(result)
// Selectors are not available in Rust yet.
// Refer to https://github.com/pola-rs/polars/issues/10594
shape: (2, 3)
┌─────────────────┬──────────────┬─────────────────┐
│ not_has_partner ┆ not_has_kids ┆ not_has_tattoos │
│ --- ┆ --- ┆ --- │
│ bool ┆ bool ┆ bool │
╞═════════════════╪══════════════╪═════════════════╡
│ false ┆ true ┆ false │
│ true ┆ true ┆ true │
└─────────────────┴──────────────┴─────────────────┘
Debugging selectors
When you are not sure whether you have a Polars selector at hand or not, you can use the function
cs.is_selector
to check:
print(cs.is_selector(~cs.starts_with("has_").as_expr()))
// Selectors are not available in Rust yet.
// Refer to https://github.com/pola-rs/polars/issues/10594
False
This should help you avoid any ambiguous situations where you think you are operating with expressions but are in fact operating with selectors.
Another helpful debugging utility is the function expand_selector
. Given a target frame or schema,
you can check what columns a given selector will expand to:
print(
cs.expand_selector(
people,
cs.starts_with("has_"),
)
)
// Selectors are not available in Rust yet.
// Refer to https://github.com/pola-rs/polars/issues/10594
('has_partner', 'has_kids', 'has_tattoos')
Complete reference
The tables below group the functions available in the submodule selectors
by their type of
behaviour.
Selectors for data types
Selectors that match based on the data type of the column:
Selector function | Data type(s) matched |
---|---|
binary |
Binary |
boolean |
Boolean |
by_dtype |
Data types specified as arguments |
categorical |
Categorical |
date |
Date |
datetime |
Datetime , optionally filtering by time unit/zone |
decimal |
Decimal |
duration |
Duration , optionally filtering by time unit |
float |
All float types, regardless of precision |
integer |
All integer types, signed and unsigned, regardless of precision |
numeric |
All numeric types, namely integers, floats, and Decimal |
signed_integer |
All signed integer types, regardless of precision |
string |
String |
temporal |
All temporal data types, namely Date , Datetime , and Duration |
time |
Time |
unsigned_integer |
All unsigned integer types, regardless of precision |
Selectors for column name patterns
Selectors that match based on column name patterns:
Selector function | Columns selected |
---|---|
alpha |
Columns with alphabetical names |
alphanumeric |
Columns with alphanumeric names (letters and the digits 0-9) |
by_name |
Columns with the names specified as arguments |
contains |
Columns whose names contain the substring specified |
digit |
Columns with numeric names (only the digits 0-9) |
ends_with |
Columns whose names end with the given substring |
matches |
Columns whose names match the given regex pattern |
starts_with |
Columns whose names start with the given substring |
Positional selectors
Selectors that match based on the position of the columns:
Selector function | Columns selected |
---|---|
all |
All columns |
by_index |
The columns at the specified indices |
first |
The first column in the context |
last |
The last column in the context |
Miscellaneous functions
The submodule selectors
also provides the following functions:
Function | Behaviour |
---|---|
as_expr * |
Convert a selector to an expression |
exclude |
Selects all columns except those matching the given names, data types, or selectors |
expand_selector |
Expand selector to matching columns with respect to a specific frame or target schema |
is_selector |
Check whether the given object/expression is a selector |
*as_expr
isn't a function defined on the submodule selectors
, but rather a method defined on
selectors.