polars/
lib.rs

1//! # Polars: *<small>DataFrames in Rust</small>*
2//!
3//! Polars is a DataFrame library for Rust. It is based on [Apache Arrow](https://arrow.apache.org/)'s memory model.
4//! Apache Arrow provides very cache efficient columnar data structures and is becoming the defacto
5//! standard for columnar data.
6//!
7//! ## Quickstart
8//! We recommend building queries directly with [polars-lazy]. This allows you to combine
9//! expressions into powerful aggregations and column selections. All expressions are evaluated
10//! in parallel and queries are optimized just in time.
11//!
12//! [polars-lazy]: polars_lazy
13//!
14//! ```no_run
15//! use polars::prelude::*;
16//! # fn example() -> PolarsResult<()> {
17//!
18//! let lf1 = LazyFrame::scan_parquet("myfile_1.parquet", Default::default())?
19//!     .group_by([col("ham")])
20//!     .agg([
21//!         // expressions can be combined into powerful aggregations
22//!         col("foo")
23//!             .sort_by([col("ham").rank(Default::default(), None)], SortMultipleOptions::default())
24//!             .last()
25//!             .alias("last_foo_ranked_by_ham"),
26//!         // every expression runs in parallel
27//!         col("foo").cum_min(false).alias("cumulative_min_per_group"),
28//!         // every expression runs in parallel
29//!         col("foo").reverse().implode().alias("reverse_group"),
30//!     ]);
31//!
32//! let lf2 = LazyFrame::scan_parquet("myfile_2.parquet", Default::default())?
33//!     .select([col("ham"), col("spam")]);
34//!
35//! let df = lf1
36//!     .join(lf2, [col("reverse")], [col("foo")], JoinArgs::new(JoinType::Left))
37//!     // now we finally materialize the result.
38//!     .collect()?;
39//! # Ok(())
40//! # }
41//! ```
42//!
43//! This means that Polars data structures can be shared zero copy with processes in many different
44//! languages.
45//!
46//! ## Tree Of Contents
47//!
48//! * [Cookbooks](#cookbooks)
49//! * [Data structures](#data-structures)
50//!     - [DataFrame](#dataframe)
51//!     - [Series](#series)
52//!     - [ChunkedArray](#chunkedarray)
53//! * [SIMD](#simd)
54//! * [API](#api)
55//! * [Expressions](#expressions)
56//! * [Compile times](#compile-times)
57//! * [Performance](#performance-and-string-data)
58//!     - [Custom allocator](#custom-allocator)
59//! * [Config](#config-with-env-vars)
60//! * [User guide](#user-guide)
61//!
62//! ## Cookbooks
63//! See examples in the cookbooks:
64//!
65//! * [Eager](crate::docs::eager)
66//! * [Lazy](crate::docs::lazy)
67//!
68//! ## Data Structures
69//! The base data structures provided by polars are [`DataFrame`], [`Series`], and [`ChunkedArray<T>`].
70//! We will provide a short, top-down view of these data structures.
71//!
72//! [`DataFrame`]: crate::frame::DataFrame
73//! [`Series`]: crate::series::Series
74//! [`ChunkedArray<T>`]: crate::chunked_array::ChunkedArray
75//!
76//! ### DataFrame
77//! A [`DataFrame`] is a two-dimensional data structure backed by a [`Series`] and can be
78//! seen as an abstraction on [`Vec<Series>`]. Operations that can be executed on a [`DataFrame`] are
79//! similar to what is done in a `SQL` like query. You can `GROUP`, `JOIN`, `PIVOT` etc.
80//!
81//! [`Vec<Series>`]: std::vec::Vec
82//!
83//! ### Series
84//! [`Series`] are the type-agnostic columnar data representation of Polars. The [`Series`] struct and
85//! [`SeriesTrait`] trait provide many operations out of the box. Most type-agnostic operations are provided
86//! by [`Series`]. Type-aware operations require downcasting to the typed data structure that is wrapped
87//! by the [`Series`]. The underlying typed data structure is a [`ChunkedArray<T>`].
88//!
89//! [`SeriesTrait`]: crate::series::SeriesTrait
90//!
91//! ### ChunkedArray
92//! [`ChunkedArray<T>`] are wrappers around an arrow array, that can contain multiples chunks, e.g.
93//! [`Vec<dyn ArrowArray>`]. These are the root data structures of Polars, and implement many operations.
94//! Most operations are implemented by traits defined in [chunked_array::ops],
95//! or on the [`ChunkedArray`] struct.
96//!
97//! [`ChunkedArray`]: crate::chunked_array::ChunkedArray
98//!
99//! ## SIMD
100//! Polars / Arrow uses packed_simd to speed up kernels with SIMD operations. SIMD is an optional
101//! `feature = "nightly"`, and requires a nightly compiler. If you don't need SIMD, **Polars runs on stable!**
102//!
103//! ## API
104//! Polars supports an eager and a lazy API. The eager API directly yields results, but is overall
105//! more verbose and less capable of building elegant composite queries. We recommend to use the Lazy API
106//! whenever you can.
107//!
108//! As neither API is async they should be wrapped in _spawn_blocking_ when used in an async context
109//! to avoid blocking the async thread pool of the runtime.
110//!
111//! ## Expressions
112//! Polars has a powerful concept called expressions.
113//! Polars expressions can be used in various contexts and are a functional mapping of
114//! `Fn(Series) -> Series`, meaning that they have [`Series`] as input and [`Series`] as output.
115//! By looking at this functional definition, we can see that the output of an [`Expr`] also can serve
116//! as the input of an [`Expr`].
117//!
118//! [`Expr`]: polars_lazy::dsl::Expr
119//!
120//! That may sound a bit strange, so lets give an example. The following is an expression:
121//!
122//! `col("foo").sort().head(2)`
123//!
124//! The snippet above says select column `"foo"` then sort this column and then take the first 2 values
125//! of the sorted output.
126//! The power of expressions is that every expression produces a new expression and that they can
127//! be piped together.
128//! You can run an expression by passing them on one of polars execution contexts.
129//! Here we run two expressions in the **select** context:
130//!
131//! ```no_run
132//! # use polars::prelude::*;
133//! # fn example() -> PolarsResult<()> {
134//! # let df = DataFrame::default();
135//!   df.lazy()
136//!    .select([
137//!        col("foo").sort(Default::default()).head(None),
138//!        col("bar").filter(col("foo").eq(lit(1))).sum(),
139//!    ])
140//!    .collect()?;
141//! # Ok(())
142//! # }
143//! ```
144//! All expressions are run in parallel, meaning that separate polars expressions are embarrassingly parallel.
145//! (Note that within an expression there may be more parallelization going on).
146//!
147//! Understanding Polars expressions is most important when starting with the Polars library. Read more
148//! about them in the [user guide](https://docs.pola.rs/user-guide/expressions).
149//!
150//! ### Eager
151//! Read more in the pages of the following data structures /traits.
152//!
153//! * [DataFrame struct](crate::frame::DataFrame)
154//! * [Series struct](crate::series::Series)
155//! * [Series trait](crate::series::SeriesTrait)
156//! * [ChunkedArray struct](crate::chunked_array::ChunkedArray)
157//! * [ChunkedArray operations traits](crate::chunked_array::ops)
158//!
159//! ### Lazy
160//! Unlock full potential with lazy computation. This allows query optimizations and provides Polars
161//! the full query context so that the fastest algorithm can be chosen.
162//!
163//! **[Read more in the lazy module.](polars_lazy)**
164//!
165//! ## Compile times
166//! A DataFrame library typically consists of
167//!
168//! * Tons of features
169//! * A lot of datatypes
170//!
171//! Both of these really put strain on compile times. To keep Polars lean, we make both **opt-in**,
172//! meaning that you only pay the compilation cost if you need it.
173//!
174//! ## Compile times and opt-in features
175//! The opt-in features are (not including dtype features):
176//!
177//! * `lazy` - Lazy API
178//!     - `regex` - Use regexes in [column selection]
179//!     - `dot_diagram` - Create dot diagrams from lazy logical plans.
180//! * `sql` - Pass SQL queries to Polars.
181//! * `streaming` - Process datasets larger than RAM.
182//! * `random` - Generate arrays with randomly sampled values
183//! * `ndarray`- Convert from [`DataFrame`] to [ndarray](https://docs.rs/ndarray/)
184//! * `temporal` - Conversions between [Chrono](https://docs.rs/chrono/) and Polars for temporal data types
185//! * `timezones` - Activate timezone support.
186//! * `strings` - Extra string utilities for [`StringChunked`]
187//!     - `string_pad` - `zfill`, `ljust`, `rjust`
188//!     - `string_to_integer` - `parse_int`
189//! * `object` - Support for generic ChunkedArrays called [`ObjectChunked<T>`] (generic over `T`).
190//!   These are downcastable from Series through the [Any](https://doc.rust-lang.org/std/any/index.html) trait.
191//! * Performance related:
192//!     - `nightly` - Several nightly only features such as SIMD and specialization.
193//!     - `performant` - more fast paths, slower compile times.
194//!     - `bigidx` - Activate this feature if you expect >> 2^32 rows. This is rarely needed.
195//!       This allows Polars to scale up beyond 2^32 rows by using an index with a `u64` data type.
196//!       Polars will be a bit slower with this feature activated as many data structures
197//!       are less cache efficient.
198//!     - `cse` - Activate common subplan elimination optimization
199//! * IO related:
200//!     - `serde` - Support for [serde](https://crates.io/crates/serde) serialization and deserialization.
201//!       Can be used for JSON and more serde supported serialization formats.
202//!     - `serde-lazy` - Support for [serde](https://crates.io/crates/serde) serialization and deserialization.
203//!       Can be used for JSON and more serde supported serialization formats.
204//!     - `parquet` - Read Apache Parquet format
205//!     - `json` - JSON serialization
206//!     - `ipc` - Arrow's IPC format serialization
207//!     - `decompress` - Automatically infer compression of csvs and decompress them.
208//!       Supported compressions:
209//!          - gzip
210//!          - zlib
211//!          - zstd
212//!
213//! [`StringChunked`]: crate::datatypes::StringChunked
214//! [column selection]: polars_lazy::dsl::col
215//! [`ObjectChunked<T>`]: polars_core::datatypes::ObjectChunked
216//!
217//!
218//! * [`DataFrame`] operations:
219//!     - `dynamic_group_by` - Groupby based on a time window instead of predefined keys.
220//!       Also activates rolling window group by operations.
221//!     - `sort_multiple` - Allow sorting a [`DataFrame`] on multiple columns
222//!     - `rows` - Create [`DataFrame`] from rows and extract rows from [`DataFrame`]s.
223//!       Also activates `pivot` and `transpose` operations
224//!     - `asof_join` - Join ASOF, to join on nearest keys instead of exact equality match.
225//!     - `cross_join` - Create the Cartesian product of two [`DataFrame`]s.
226//!     - `semi_anti_join` - SEMI and ANTI joins.
227//!     - `row_hash` - Utility to hash [`DataFrame`] rows to [`UInt64Chunked`]
228//!     - `diagonal_concat` - Concat diagonally thereby combining different schemas.
229//!     - `dataframe_arithmetic` - Arithmetic on ([`Dataframe`] and [`DataFrame`]s) and ([`DataFrame`] on [`Series`])
230//!     - `partition_by` - Split into multiple [`DataFrame`]s partitioned by groups.
231//! * [`Series`]/[`Expr`] operations:
232//!     - `is_in` - Check for membership in [`Series`].
233//!     - `zip_with` - [Zip two Series/ ChunkedArrays](crate::chunked_array::ops::ChunkZip).
234//!     - `round_series` - Round underlying float types of [`Series`].
235//!     - `repeat_by` - Repeat element in an Array N times, where N is given by another array.
236//!     - `is_first_distinct` - Check if element is first unique value.
237//!     - `is_last_distinct` - Check if element is last unique value.
238//!     - `is_between` - Check if this expression is between the given lower and upper bounds.
239//!     - `checked_arithmetic` - checked arithmetic/ returning [`None`] on invalid operations.
240//!     - `dot_product` - Dot/inner product on [`Series`] and [`Expr`].
241//!     - `concat_str` - Concat string data in linear time.
242//!     - `reinterpret` - Utility to reinterpret bits to signed/unsigned
243//!     - `take_opt_iter` - Take from a [`Series`] with [`Iterator<Item=Option<usize>>`](std::iter::Iterator).
244//!     - `mode` - [Return the most occurring value(s)](polars_ops::chunked_array::mode)
245//!     - `cum_agg` - [`cum_sum`], [`cum_min`], [`cum_max`] aggregation.
246//!     - `rolling_window` - rolling window functions, like [`rolling_mean`]
247//!     - `interpolate` - [interpolate None values](polars_ops::series::interpolate())
248//!     - `extract_jsonpath` - [Run jsonpath queries on StringChunked](https://goessner.net/articles/JsonPath/)
249//!     - `list` - List utils.
250//!         - `list_gather` take sublist by multiple indices
251//!     - `rank` - Ranking algorithms.
252//!     - `moment` - Kurtosis and skew statistics
253//!     - `ewma` - Exponential moving average windows
254//!     - `abs` - Get absolute values of [`Series`].
255//!     - `arange` - Range operation on [`Series`].
256//!     - `product` - Compute the product of a [`Series`].
257//!     - `diff` - [`diff`] operation.
258//!     - `pct_change` - Compute change percentages.
259//!     - `unique_counts` - Count unique values in expressions.
260//!     - `log` - Logarithms for [`Series`].
261//!     - `list_to_struct` - Convert [`List`] to [`Struct`] dtypes.
262//!     - `list_count` - Count elements in lists.
263//!     - `list_eval` - Apply expressions over list elements.
264//!     - `list_sets` - Compute UNION, INTERSECTION, and DIFFERENCE on list types.
265//!     - `cumulative_eval` - Apply expressions over cumulatively increasing windows.
266//!     - `arg_where` - Get indices where condition holds.
267//!     - `search_sorted` - Find indices where elements should be inserted to maintain order.
268//!     - `offset_by` - Add an offset to dates that take months and leap years into account.
269//!     - `trigonometry` - Trigonometric functions.
270//!     - `sign` - Compute the element-wise sign of a [`Series`].
271//!     - `propagate_nans` - NaN propagating min/max aggregations.
272//!     - `extract_groups` - Extract multiple regex groups from strings.
273//!     - `cov` - Covariance and correlation functions.
274//!     - `find_many` - Find/replace multiple string patterns at once.
275//! * [`DataFrame`] pretty printing
276//!     - `fmt` - Activate [`DataFrame`] formatting
277//!
278//! [`UInt64Chunked`]: crate::datatypes::UInt64Chunked
279//! [`cum_sum`]: polars_ops::prelude::cum_sum
280//! [`cum_min`]: polars_ops::prelude::cum_min
281//! [`cum_max`]: polars_ops::prelude::cum_max
282//! [`rolling_mean`]: crate::series::Series#method.rolling_mean
283//! [`diff`]: polars_ops::prelude::diff
284//! [`List`]: crate::datatypes::DataType::List
285//! [`Struct`]: crate::datatypes::DataType::Struct
286//!
287//! ## Compile times and opt-in data types
288//! As mentioned above, Polars [`Series`] are wrappers around
289//! [`ChunkedArray<T>`] without the generic parameter `T`.
290//! To get rid of the generic parameter, all the possible values of `T` are compiled
291//! for [`Series`]. This gets more expensive the more types you want for a [`Series`]. In order to reduce
292//! the compile times, we have decided to default to a minimal set of types and make more [`Series`] types
293//! opt-in.
294//!
295//! Note that if you get strange compile time errors, you probably need to opt-in for that [`Series`] dtype.
296//! The opt-in dtypes are:
297//!
298//! | data type               | feature flag      |
299//! |-------------------------|-------------------|
300//! | Date                    | dtype-date        |
301//! | Datetime                | dtype-datetime    |
302//! | Time                    | dtype-time        |
303//! | Duration                | dtype-duration    |
304//! | Int8                    | dtype-i8          |
305//! | Int16                   | dtype-i16         |
306//! | UInt8                   | dtype-u8          |
307//! | UInt16                  | dtype-u16         |
308//! | Categorical             | dtype-categorical |
309//! | Struct                  | dtype-struct      |
310//!
311//!
312//! Or you can choose one of the preconfigured pre-sets.
313//!
314//! * `dtype-full` - all opt-in dtypes.
315//! * `dtype-slim` - slim preset of opt-in dtypes.
316//!
317//! ## Performance
318//! To get the best performance out of Polars we recommend compiling on a nightly compiler
319//! with the features `simd` and `performant` activated. The activated cpu features also influence
320//! the amount of simd acceleration we can use.
321//!
322//! See the features we activate for our python builds, or if you just run locally and want to
323//! use all available features on your cpu, set `RUSTFLAGS='-C target-cpu=native'`.
324//!
325//! ### Custom allocator
326//! An OLAP query engine does a lot of heap allocations. It is recommended to use a custom
327//! allocator, (we have found this to have up to ~25% runtime influence).
328//! [JeMalloc](https://crates.io/crates/tikv-jemallocator) and
329//! [Mimalloc](https://crates.io/crates/mimalloc) for instance, show a significant
330//! performance gain in runtime as well as memory usage.
331//!
332//! #### Jemalloc Usage
333//! ```ignore
334//! use tikv_jemallocator::Jemalloc;
335//!
336//! #[global_allocator]
337//! static GLOBAL: Jemalloc = Jemalloc;
338//! ```
339//!
340//! #### Cargo.toml
341//! ```toml
342//! [dependencies]
343//! tikv-jemallocator = { version = "*" }
344//! ```
345//!
346//! #### Mimalloc Usage
347//!
348//! ```ignore
349//! use mimalloc::MiMalloc;
350//!
351//! #[global_allocator]
352//! static GLOBAL: MiMalloc = MiMalloc;
353//! ```
354//!
355//! #### Cargo.toml
356//! ```toml
357//! [dependencies]
358//! mimalloc = { version = "*", default-features = false }
359//! ```
360//!
361//! #### Notes
362//! [Benchmarks](https://github.com/pola-rs/polars/pull/3108) have shown that on Linux and macOS JeMalloc
363//! outperforms Mimalloc on all tasks and is therefore the default allocator used for the Python bindings on Unix platforms.
364//!
365//! ## Config with ENV vars
366//!
367//! * `POLARS_FMT_TABLE_FORMATTING` -> define styling of tables using any of the following options (default = UTF8_FULL_CONDENSED). These options are defined by comfy-table which provides examples for each at <https://github.com/Nukesor/comfy-table/blob/main/src/style/presets.rs>
368//!   * `ASCII_FULL`
369//!   * `ASCII_FULL_CONDENSED`
370//!   * `ASCII_NO_BORDERS`
371//!   * `ASCII_BORDERS_ONLY`
372//!   * `ASCII_BORDERS_ONLY_CONDENSED`
373//!   * `ASCII_HORIZONTAL_ONLY`
374//!   * `ASCII_MARKDOWN`
375//!   * `MARKDOWN`
376//!   * `UTF8_FULL`
377//!   * `UTF8_FULL_CONDENSED`
378//!   * `UTF8_NO_BORDERS`
379//!   * `UTF8_BORDERS_ONLY`
380//!   * `UTF8_HORIZONTAL_ONLY`
381//!   * `NOTHING`
382//! * `POLARS_FMT_TABLE_CELL_ALIGNMENT` -> define cell alignment using any of the following options (default = LEFT):
383//!   * `LEFT`
384//!   * `CENTER`
385//!   * `RIGHT`
386//! * `POLARS_FMT_TABLE_DATAFRAME_SHAPE_BELOW` -> print shape information below the table.
387//! * `POLARS_FMT_TABLE_HIDE_COLUMN_NAMES` -> hide table column names.
388//! * `POLARS_FMT_TABLE_HIDE_COLUMN_DATA_TYPES` -> hide data types for columns.
389//! * `POLARS_FMT_TABLE_HIDE_COLUMN_SEPARATOR` -> hide separator that separates column names from rows.
390//! * `POLARS_FMT_TABLE_HIDE_DATAFRAME_SHAPE_INFORMATION"` -> omit table shape information.
391//! * `POLARS_FMT_TABLE_INLINE_COLUMN_DATA_TYPE` -> put column data type on the same line as the column name.
392//! * `POLARS_FMT_TABLE_ROUNDED_CORNERS` -> apply rounded corners to UTF8-styled tables.
393//! * `POLARS_FMT_MAX_COLS` -> maximum number of columns shown when formatting DataFrames.
394//! * `POLARS_FMT_MAX_ROWS` -> maximum number of rows shown when formatting DataFrames, `-1` to show all.
395//! * `POLARS_FMT_STR_LEN` -> maximum number of characters printed per string value.
396//! * `POLARS_TABLE_WIDTH` -> width of the tables used during DataFrame formatting.
397//! * `POLARS_MAX_THREADS` -> maximum number of threads used to initialize thread pool (on startup).
398//! * `POLARS_VERBOSE` -> print logging info to stderr.
399//! * `POLARS_NO_PARTITION` -> polars may choose to partition the group_by operation, based on data
400//!   cardinality. Setting this env var will turn partitioned group_by's off.
401//! * `POLARS_PARTITION_UNIQUE_COUNT` -> at which (estimated) key count a partitioned group_by should run.
402//!   defaults to `1000`, any higher cardinality will run default group_by.
403//! * `POLARS_FORCE_PARTITION` -> force partitioned group_by if the keys and aggregations allow it.
404//! * `POLARS_ALLOW_EXTENSION` -> allows for [`ObjectChunked<T>`] to be used in arrow, opening up possibilities like using
405//!   `T` in complex lazy expressions. However this does require `unsafe` code allow this.
406//! * `POLARS_NO_PARQUET_STATISTICS` -> if set, statistics in parquet files are ignored.
407//! * `POLARS_PANIC_ON_ERR` -> panic instead of returning an Error.
408//! * `POLARS_BACKTRACE_IN_ERR` -> include a Rust backtrace in Error messages.
409//! * `POLARS_NO_CHUNKED_JOIN` -> force rechunk before joins.
410//!
411//! ## User guide
412//!
413//! If you want to read more, check the [user guide](https://docs.pola.rs/).
414#![cfg_attr(docsrs, feature(doc_auto_cfg))]
415#![allow(ambiguous_glob_reexports)]
416pub mod docs;
417pub mod prelude;
418#[cfg(feature = "sql")]
419pub mod sql;
420
421pub use polars_core::{
422    apply_method_all_arrow_series, chunked_array, datatypes, df, error, frame, functions, series,
423    testing,
424};
425#[cfg(feature = "dtype-categorical")]
426pub use polars_core::{enable_string_cache, using_string_cache};
427#[cfg(feature = "polars-io")]
428pub use polars_io as io;
429#[cfg(feature = "lazy")]
430pub use polars_lazy as lazy;
431#[cfg(feature = "temporal")]
432pub use polars_time as time;
433
434/// Polars crate version
435pub const VERSION: &str = env!("CARGO_PKG_VERSION");