Expand description
§Polars: DataFrames in Rust
Polars is a DataFrame library for Rust. It is based on Apache Arrow’s memory model. Apache arrow provides very cache efficient columnar data structures and is becoming the defacto standard for columnar data.
§Quickstart
We recommend to build your queries directly with polars-lazy. This allows you to combine expression into powerful aggregations and column selections. All expressions are evaluated in parallel and your queries are optimized just in time.
use polars::prelude::*;
let lf1 = LazyFrame::scan_parquet("myfile_1.parquet", Default::default())?
.group_by([col("ham")])
.agg([
// expressions can be combined into powerful aggregations
col("foo")
.sort_by([col("ham").rank(Default::default(), None)], SortMultipleOptions::default())
.last()
.alias("last_foo_ranked_by_ham"),
// every expression runs in parallel
col("foo").cum_min(false).alias("cumulative_min_per_group"),
// every expression runs in parallel
col("foo").reverse().implode().alias("reverse_group"),
]);
let lf2 = LazyFrame::scan_parquet("myfile_2.parquet", Default::default())?
.select([col("ham"), col("spam")]);
let df = lf1
.join(lf2, [col("reverse")], [col("foo")], JoinArgs::new(JoinType::Left))
// now we finally materialize the result.
.collect()?;
This means that Polars data structures can be shared zero copy with processes in many different languages.
§Tree Of Contents
§Cookbooks
See examples in the cookbooks:
§Data Structures
The base data structures provided by polars are DataFrame
, Series
, and ChunkedArray<T>
.
We will provide a short, top-down view of these data structures.
§DataFrame
A DataFrame
is a 2 dimensional data structure that is backed by a Series
, and it could be
seen as an abstraction on Vec<Series>
. Operations that can be executed on DataFrame
are very
similar to what is done in a SQL
like query. You can GROUP
, JOIN
, PIVOT
etc.
§Series
Series
are the type agnostic columnar data representation of Polars. They provide many
operations out of the box, many via the Series
series and
SeriesTrait
trait. Whether or not an operation is provided
by a Series
is determined by the operation. If the operation can be done without knowing the
underlying columnar type, this operation probably is provided by the Series
. If not, you must
downcast to the typed data structure that is wrapped by the Series
. That is the ChunkedArray<T>
.
§ChunkedArray
ChunkedArray<T>
are wrappers around an arrow array, that can contain multiples chunks, e.g.
Vec<dyn ArrowArray>
. These are the root data structures of Polars, and implement many operations.
Most operations are implemented by traits defined in chunked_array::ops,
or on the ChunkedArray
struct.
§SIMD
Polars / Arrow uses packed_simd to speed up kernels with SIMD operations. SIMD is an optional
feature = "nightly"
, and requires a nightly compiler. If you don’t need SIMD, Polars runs on stable!
§API
Polars supports an eager and a lazy API. The eager API directly yields results, but is overall more verbose and less capable of building elegant composite queries. We recommend to use the Lazy API whenever you can.
As neither API is async they should be wrapped in spawn_blocking when used in an async context to avoid blocking the async thread pool of the runtime.
§Expressions
Polars has a powerful concept called expressions.
Polars expressions can be used in various contexts and are a functional mapping of
Fn(Series) -> Series
, meaning that they have Series
as input and Series
as output.
By looking at this functional definition, we can see that the output of an Expr
also can serve
as the input of an Expr
.
That may sound a bit strange, so lets give an example. The following is an expression:
col("foo").sort().head(2)
The snippet above says select column "foo"
then sort this column and then take first 2 values
of the sorted output.
The power of expressions is that every expression produces a new expression and that they can
be piped together.
You can run an expression by passing them on one of polars execution contexts.
Here we run two expressions in the select context:
df.lazy()
.select([
col("foo").sort(Default::default()).head(None),
col("bar").filter(col("foo").eq(lit(1))).sum(),
])
.collect()?;
All expressions are ran in parallel, meaning that separate polars expressions are embarrassingly parallel. (Note that within an expression there may be more parallelization going on).
Understanding polars expressions is most important when starting with the polars library. Read more about them in the user guide.
§Eager
Read more in the pages of the following data structures /traits.
§Lazy
Unlock full potential with lazy computation. This allows query optimizations and provides Polars the full query context so that the fastest algorithm can be chosen.
§Compile times
A DataFrame library typically consists of
- Tons of features
- A lot of datatypes
Both of these really put strain on compile times. To keep Polars lean, we make both opt-in, meaning that you only pay the compilation cost, if you need it.
§Compile times and opt-in features
The opt-in features are (not including dtype features):
performant
- Longer compile times more fast paths.lazy
- Lazy APIregex
- Use regexes in column selectiondot_diagram
- Create dot diagrams from lazy logical plans.
sql
- Pass SQL queries to polars.streaming
- Be able to process datasets that are larger than RAM.random
- Generate arrays with randomly sampled valuesndarray
- Convert fromDataFrame
to ndarraytemporal
- Conversions between Chrono and Polars for temporal data typestimezones
- Activate timezone support.strings
- Extra string utilities forStringChunked
//! -string_pad
-zfill
,ljust
,rjust
string_to_integer
-parse_int
object
- Support for generic ChunkedArrays calledObjectChunked<T>
(generic overT
). These are downcastable from Series through the Any trait.- Performance related:
nightly
- Several nightly only features such as SIMD and specialization.performant
- more fast paths, slower compile times.bigidx
- Activate this feature if you expect >> 2^32 rows. This has not been needed by anyone. This allows polars to scale up way beyond that by usingu64
as an index. Polars will be a bit slower with this feature activated as many data structures are less cache efficient.cse
- Activate common subplan elimination optimization
- IO related:
serde
- Support for serde serialization and deserialization. Can be used for JSON and more serde supported serialization formats.serde-lazy
- Support for serde serialization and deserialization. Can be used for JSON and more serde supported serialization formats.parquet
- Read Apache Parquet formatjson
- JSON serializationipc
- Arrow’s IPC format serializationdecompress
- Automatically infer compression of csvs and decompress them. Supported compressions: * zip * gzip
DataFrame
operations:dynamic_group_by
- Groupby based on a time window instead of predefined keys. Also activates rolling window group by operations.sort_multiple
- Allow sorting aDataFrame
on multiple columnsrows
- CreateDataFrame
from rows and extract rows fromDataFrame
s. And activatespivot
andtranspose
operationsasof_join
- Join ASOF, to join on nearest keys instead of exact equality match.cross_join
- Create the Cartesian product of twoDataFrame
s.semi_anti_join
- SEMI and ANTI joins.row_hash
- Utility to hashDataFrame
rows toUInt64Chunked
diagonal_concat
- Concat diagonally thereby combining different schemas.dataframe_arithmetic
- Arithmetic on (Dataframe
andDataFrame
s) and (DataFrame
onSeries
)partition_by
- Split into multipleDataFrame
s partitioned by groups.
Series
/Expr
operations:is_in
- Check for membership inSeries
.zip_with
- Zip two Series/ ChunkedArrays.round_series
- round underlying float types ofSeries
.repeat_by
- [Repeat element in an Array N times, where N is given by another array.is_first_distinct
- Check if element is first unique value.is_last_distinct
- Check if element is last unique value.is_between
- Check if this expression is between the given lower and upper bounds.checked_arithmetic
- checked arithmetic/ returningNone
on invalid operations.dot_product
- Dot/inner product onSeries
andExpr
.concat_str
- Concat string data in linear time.reinterpret
- Utility to reinterpret bits to signed/unsignedtake_opt_iter
- Take from aSeries
withIterator<Item=Option<usize>>
.mode
- Return the most occurring value(s)cum_agg
-cum_sum
,cum_min
,cum_max
aggregation.rolling_window
- rolling window functions, likerolling_mean
interpolate
interpolate None valuesextract_jsonpath
- Run jsonpath queries on StringChunkedlist
- List utils.list_gather
take sublist by multiple indices
rank
- Ranking algorithms.moment
- kurtosis and skew statisticsewma
- Exponential moving average windowsabs
- Get absolute values ofSeries
.arange
- Range operation onSeries
.product
- Compute the product of aSeries
.diff
-diff
operation.pct_change
- Compute change percentages.unique_counts
- Count unique values in expressions.log
- Logarithms forSeries
.list_to_struct
- ConvertList
toStruct
dtypes.list_count
- Count elements in lists.list_eval
- Apply expressions over list elements.list_sets
- Compute UNION, INTERSECTION, and DIFFERENCE on list types.cumulative_eval
- Apply expressions over cumulatively increasing windows.arg_where
- Get indices where condition holds.search_sorted
- Find indices where elements should be inserted to maintain order.offset_by
- Add an offset to dates that take months and leap years into account.trigonometry
- Trigonometric functions.sign
- Compute the element-wise sign of aSeries
.propagate_nans
- NaN propagating min/max aggregations.extract_groups
- Extract multiple regex groups from strings.cov
- Covariance and correlation functions.find_many
- Find/replace multiple string patterns at once.
DataFrame
pretty printingfmt
- ActivateDataFrame
formatting
§Compile times and opt-in data types
As mentioned above, Polars Series
are wrappers around
ChunkedArray<T>
without the generic parameter T
.
To get rid of the generic parameter, all the possible value of T
are compiled
for Series
. This gets more expensive the more types you want for a Series
. In order to reduce
the compile times, we have decided to default to a minimal set of types and make more Series
types
opt-in.
Note that if you get strange compile time errors, you probably need to opt-in for that Series
dtype.
The opt-in dtypes are:
data type | feature flag |
---|---|
Date | dtype-date |
Datetime | dtype-datetime |
Time | dtype-time |
Duration | dtype-duration |
Int8 | dtype-i8 |
Int16 | dtype-i16 |
UInt8 | dtype-u8 |
UInt16 | dtype-u16 |
Categorical | dtype-categorical |
Struct | dtype-struct |
Or you can choose on of the preconfigured pre-sets.
dtype-full
- all opt-in dtypes.dtype-slim
- slim preset of opt-in dtypes.
§Performance
To gains most performance out of Polars we recommend compiling on a nightly compiler
with the features simd
and performant
activated. The activated cpu features also influence
the amount of simd acceleration we can use.
See this the features we activate for our python builds, or if you just run locally and want to
use all available features on your cpu, set RUSTFLAGS='-C target-cpu=native'
.
§Custom allocator
An OLAP query engine does a lot of heap allocations. It is recommended to use a custom allocator, (we have found this to have up to ~25% runtime influence). JeMalloc and Mimalloc for instance, show a significant performance gain in runtime as well as memory usage.
§Jemalloc Usage
use jemallocator::Jemalloc;
#[global_allocator]
static GLOBAL: Jemalloc = Jemalloc;
§Cargo.toml
[dependencies]
jemallocator = { version = "*" }
§Mimalloc Usage
use mimalloc::MiMalloc;
#[global_allocator]
static GLOBAL: MiMalloc = MiMalloc;
§Cargo.toml
[dependencies]
mimalloc = { version = "*", default-features = false }
§Notes
Benchmarks have shown that on Linux and macOS JeMalloc outperforms Mimalloc on all tasks and is therefore the default allocator used for the Python bindings on Unix platforms.
§Config with ENV vars
POLARS_FMT_TABLE_FORMATTING
-> define styling of tables using any of the following options (default = UTF8_FULL_CONDENSED). These options are defined by comfy-table which provides examples for each at https://github.com/Nukesor/comfy-table/blob/main/src/style/presets.rsASCII_FULL
ASCII_FULL_CONDENSED
ASCII_NO_BORDERS
ASCII_BORDERS_ONLY
ASCII_BORDERS_ONLY_CONDENSED
ASCII_HORIZONTAL_ONLY
ASCII_MARKDOWN
UTF8_FULL
UTF8_FULL_CONDENSED
UTF8_NO_BORDERS
UTF8_BORDERS_ONLY
UTF8_HORIZONTAL_ONLY
NOTHING
POLARS_FMT_TABLE_CELL_ALIGNMENT
-> define cell alignment using any of the following options (default = LEFT):LEFT
CENTER
RIGHT
POLARS_FMT_TABLE_DATAFRAME_SHAPE_BELOW
-> print shape information below the table.POLARS_FMT_TABLE_HIDE_COLUMN_NAMES
-> hide table column names.POLARS_FMT_TABLE_HIDE_COLUMN_DATA_TYPES
-> hide data types for columns.POLARS_FMT_TABLE_HIDE_COLUMN_SEPARATOR
-> hide separator that separates column names from rows.POLARS_FMT_TABLE_HIDE_DATAFRAME_SHAPE_INFORMATION"
-> omit table shape information.POLARS_FMT_TABLE_INLINE_COLUMN_DATA_TYPE
-> put column data type on the same line as the column name.POLARS_FMT_TABLE_ROUNDED_CORNERS
-> apply rounded corners to UTF8-styled tables.POLARS_FMT_MAX_COLS
-> maximum number of columns shown when formatting DataFrames.POLARS_FMT_MAX_ROWS
-> maximum number of rows shown when formatting DataFrames,-1
to show all.POLARS_FMT_STR_LEN
-> maximum number of characters printed per string value.POLARS_TABLE_WIDTH
-> width of the tables used during DataFrame formatting.POLARS_MAX_THREADS
-> maximum number of threads used to initialize thread pool (on startup).POLARS_VERBOSE
-> print logging info to stderr.POLARS_NO_PARTITION
-> polars may choose to partition the group_by operation, based on data cardinality. Setting this env var will turn partitioned group_by’s off.POLARS_PARTITION_UNIQUE_COUNT
-> at which (estimated) key count a partitioned group_by should run. defaults to1000
, any higher cardinality will run default group_by.POLARS_FORCE_PARTITION
-> force partitioned group_by if the keys and aggregations allow it.POLARS_ALLOW_EXTENSION
-> allows forObjectChunked<T>
to be used in arrow, opening up possibilities like usingT
in complex lazy expressions. However this does requireunsafe
code allow this.POLARS_NO_PARQUET_STATISTICS
-> if set, statistics in parquet files are ignored.POLARS_PANIC_ON_ERR
-> panic instead of returning an Error.POLARS_BACKTRACE_IN_ERR
-> include a Rust backtrace in Error messages.POLARS_NO_CHUNKED_JOIN
-> force rechunk before joins.
§User guide
If you want to read more, check the user guide.
Re-exports§
pub use polars_io as io;
polars-io
pub use polars_lazy as lazy;
lazy
pub use polars_time as time;
temporal
Modules§
- The typed heart of every Series column.
- Data types supported by Polars.
- DataFrame module.
- Functions
- Type agnostic columnar data structure.
- Testing utilities.
Macros§
Constants§
- Polars crate version
Functions§
- enable_
string_ cache dtype-categorical
Enable the global string cache. - using_
string_ cache dtype-categorical
Check whether the global string cache is enabled.