Skip to content

DataType Expressions

In your lazy queries, you may want to reason about the datatypes of columns or expressions used in your queries. DataType expressions allow for the inspection and manipulation of datatypes that are used in your query. The datatypes are resolved during query planning and behave the same as static datatypes during runtime.

DataType expressions can be especially useful when you don't have full control over input data. This can occur when you try to compartmentalize code, write utility functions or are loading data from heterogeneous data sources. DataType expressions also allow you to express relations between the datatype of expressions or columns.

Basic Usage

DataType expressions often start with pl.dtype_of. This allows inspecting the datatype of a column or expression.

dtype_of

dtype_expr = pl.dtype_of("UserID")

# For debugging you can collect the output datatype in a specific context.
schema = pl.Schema({ 'UserID': pl.UInt64, 'Name': pl.String })
dtype_expr.collect_dtype(schema)

These expressions can be manipulated in various ways to transform them into the datatype that you need.

dtype_expr.wrap_in_list().collect_dtype(schema)

dtype_expr.to_signed_integer().collect_dtype(schema)

You can also inspect information about the datatype to use at runtime.

df = schema.to_frame()
df.select(
    userid_dtype_name = pl.dtype_of('UserID').display(),
    userid_is_signed  = pl.dtype_of('UserID').matches(cs.signed_integer()),
)

Expressing relations between datatypes

Datatypes can help with utility functions by being able to express the relation between the output datatype of two expressions. The following example allows you to express that map_batches has the same output datatype as input datatype.

map_batches

def inspect(expr: pl.Expr) -> pl.Expr:
    def print_and_return(s: pl.Series) -> pl.Series:
        print(s)
        return s

    return expr.map_batches(
        print_and_return,

        # Clarify that the expression returns the same datatype as the input
        # datatype.
        return_dtype=pl.dtype_of(expr),
    )

df = pl.DataFrame({
    'UserID': [1, 2, 3, 4, 5],
    'Name': ["Alice", "Bob", "Charlie", "Diana", "Ethan"],
})
df.select(inspect(pl.col('Name')))

shape: (5,)
Series: 'Name' [str]
[
    "Alice"
    "Bob"
    "Charlie"
    "Diana"
    "Ethan"
]

Similarly, you want to express that one column needs to be casted to the datatype of another column.

cast

df = pl.DataFrame({
    'UserID': [1, 2, 3, 4, 5],
    'Name': ["Alice", "Bob", "Charlie", "Diana", "Ethan"],
}).with_columns(
    pl.col('UserID').cast(pl.dtype_of('Name'))
)