polars.read_database#

polars.read_database(
query: str | Selectable,
connection: ConnectionOrCursor | str,
*,
iter_batches: bool = False,
batch_size: int | None = None,
schema_overrides: SchemaDict | None = None,
execute_options: dict[str, Any] | None = None,
**kwargs: Any,
) DataFrame | Iterable[DataFrame][source]#

Read the results of a SQL query into a DataFrame, given a connection object.

Parameters:
query

SQL query to execute (if using a SQLAlchemy connection object this can be a suitable “Selectable”, otherwise it is expected to be a string).

connection

An instantiated connection (or cursor/client object) that the query can be executed against. Can also pass a valid ODBC connection string, identified as such if it contains the string “Driver=”, in which case the arrow-odbc package will be used to establish the connection and return Arrow-native data to Polars.

iter_batches

Return an iterator of DataFrames, where each DataFrame represents a batch of data returned by the query; this can be useful for processing large resultsets in a memory-efficient manner. If supported by the backend, this value is passed to the underlying query execution method (note that very low values will typically result in poor performance as it will result in many round-trips to the database as the data is returned). If the backend does not support changing the batch size then a single DataFrame is yielded from the iterator.

batch_size

Indicate the size of each batch when iter_batches is True (note that you can still set this when iter_batches is False, in which case the resulting DataFrame is constructed internally using batched return before being returned to you. Note that some backends may support batched operation but not allow for an explicit size; in this case you will still receive batches, but their exact size will be determined by the backend (so may not equal the value set here).

schema_overrides

A dictionary mapping column names to dtypes, used to override the schema inferred from the query cursor or given by the incoming Arrow data (depending on driver/backend). This can be useful if the given types can be more precisely defined (for example, if you know that a given column can be declared as u32 instead of i64).

execute_options

These options will be passed through into the underlying query execution method as kwargs. In the case of connections made using an ODBC string (which use arrow-odbc) these options are passed to the read_arrow_batches_from_odbc method.

See also

read_database_uri

Create a DataFrame from a SQL query using a URI string.

Notes

  • This function supports a wide range of native database drivers (ranging from local databases such as SQLite to large cloud databases such as Snowflake), as well as generic libraries such as ADBC, SQLAlchemy and various flavours of ODBC. If the backend supports returning Arrow data directly then this facility will be used to efficiently instantiate the DataFrame; otherwise, the DataFrame is initialised from row-wise data.

  • Support for Arrow Flight SQL data is available via the adbc-driver-flightsql package; see https://arrow.apache.org/adbc/current/driver/flight_sql.html for more details about using this driver (notable databases implementing Flight SQL include Dremio and InfluxDB).

  • The read_database_uri function is likely to be noticeably faster than read_database if you are using a SQLAlchemy or DBAPI2 connection, as connectorx will optimise translation of the result set into Arrow format in Rust, whereas these libraries will return row-wise data to Python before we can load into Arrow. Note that you can easily determine the connection’s URI from a SQLAlchemy engine object by calling conn.engine.url.render_as_string(hide_password=False).

  • If polars has to create a cursor from your connection in order to execute the query then that cursor will be automatically closed when the query completes; however, polars will never close any other connection or cursor.

Examples

Instantiate a DataFrame from a SQL query against a user-supplied connection:

>>> df = pl.read_database(
...     query="SELECT * FROM test_data",
...     connection=user_conn,
...     schema_overrides={"normalised_score": pl.UInt8},
... )  

Use a parameterised SQLAlchemy query, passing named values via execute_options:

>>> df = pl.read_database(
...     query="SELECT * FROM test_data WHERE metric > :value",
...     connection=alchemy_conn,
...     execute_options={"parameters": {"value": 0}},
... )  

Use ‘qmark’ style parameterisation; values are still passed via execute_options, but in this case the “parameters” value is a sequence of literals, not a dict:

>>> df = pl.read_database(
...     query="SELECT * FROM test_data WHERE metric > ?",
...     connection=alchemy_conn,
...     execute_options={"parameters": [0]},
... )  

Instantiate a DataFrame using an ODBC connection string (requires arrow-odbc) setting upper limits on the buffer size of variadic text/binary columns, returning the result as an iterator over DataFrames containing batches of 1000 rows:

>>> for df in pl.read_database(
...     query="SELECT * FROM test_data",
...     connection="Driver={PostgreSQL};Server=localhost;Port=5432;Database=test;Uid=usr;Pwd=",
...     execute_options={"max_text_size": 512, "max_binary_size": 1024},
...     iter_batches=True,
...     batch_size=1000,
... ):
...     do_something(df)