polars.read_database#
- polars.read_database(
- query: str | TextClause | Selectable,
- connection: ConnectionOrCursor | str,
- *,
- iter_batches: bool = False,
- batch_size: int | None = None,
- schema_overrides: SchemaDict | None = None,
- infer_schema_length: int | None = 100,
- execute_options: dict[str, Any] | None = None,
Read the results of a SQL query into a DataFrame, given a connection object.
- Parameters:
- query
SQL query to execute (if using a SQLAlchemy connection object this can be a suitable “Selectable”, otherwise it is expected to be a string).
- connection
An instantiated connection (or cursor/client object) that the query can be executed against. Can also pass a valid ODBC connection string (identified as such if it contains the string “Driver={…}”), in which case the
arrow-odbc
package will be used to establish the connection and return Arrow-native data to Polars. Async driver connections are also supported, though this is currently considered unstable. If using SQLAlchemy, you can configure the connection’sexecution_options
before passing toread_database
to refine its behaviour (see theiter_batches
parameter for an example where this can be useful).Warning
Use of asynchronous connections is currently considered unstable, and unexpected issues may arise; if this happens, please report them.
- iter_batches
Return an iterator of DataFrames, where each DataFrame represents a batch of data returned by the query; this can be useful for processing large resultsets in a more memory-efficient manner. If supported by the backend, this value is passed to the underlying query execution method (note that lower values will typically result in poor performance as they will cause many round-trips to the database). If the backend does not support changing the batch size then a single DataFrame is yielded from the iterator.
Note
If using SQLALchemy, you may also want to pass
stream_results=True
to the connection’sexecution_options
method when setting this parameter, which will establish a server-side cursor; without this option some drivers (such as “psycopg2”) will still materialise the entire result set client-side before batching the result locally.- batch_size
Indicate the size of each batch when
iter_batches
is True (note that you can still set this wheniter_batches
is False, in which case the resulting DataFrame is constructed internally using batched return before being returned to you. Note that some backends (such as Snowflake) may support batch operation but not allow for an explicit size to be set; in this case you will still receive batches but their size is determined by the backend (in which case any value set here will be ignored).- schema_overrides
A dictionary mapping column names to dtypes, used to override the schema inferred from the query cursor or given by the incoming Arrow data (depending on driver/backend). This can be useful if the given types can be more precisely defined (for example, if you know that a given column can be declared as
u32
instead ofi64
).- infer_schema_length
The maximum number of rows to scan for schema inference. If set to
None
, the full data may be scanned (this can be slow). This parameter only applies if the data is read as a sequence of rows and theschema_overrides
parameter is not set for the given column; Arrow-aware drivers also ignore this value.- execute_options
These options will be passed through into the underlying query execution method as kwargs. In the case of connections made using an ODBC string (which use
arrow-odbc
) these options are passed to theread_arrow_batches_from_odbc
method.
See also
read_database_uri
Create a DataFrame from a SQL query using a URI string.
Notes
This function supports a wide range of native database drivers (ranging from local databases such as SQLite to large cloud databases such as Snowflake), as well as generic libraries such as ADBC, SQLAlchemy and various flavours of ODBC. If the backend supports returning Arrow data directly then this facility will be used to efficiently instantiate the DataFrame; otherwise, the DataFrame is initialised from row-wise data.
Support for Arrow Flight SQL data is available via the
adbc-driver-flightsql
package; see https://arrow.apache.org/adbc/current/driver/flight_sql.html for more details about using this driver (notable databases implementing Flight SQL include Dremio and InfluxDB).The
read_database_uri
function can be noticeably faster thanread_database
if you are using a SQLAlchemy or DBAPI2 connection, asconnectorx
andadbc
optimise translation of the result set into Arrow format. Note that you can determine a connection’s URI from a SQLAlchemy engine object by callingconn.engine.url.render_as_string(hide_password=False)
.If Polars has to create a cursor from your connection in order to execute the query then that cursor will be automatically closed when the query completes; however, Polars will never close any other open connection or cursor.
Polars is able to support more than just relational databases and SQL queries through this function. For example, you can load local graph database results from a
KùzuDB
connection in conjunction with a Cypher query, or use SurrealQL with SurrealDB.
Examples
Instantiate a DataFrame from a SQL query against a user-supplied connection:
>>> df = pl.read_database( ... query="SELECT * FROM test_data", ... connection=user_conn, ... schema_overrides={"normalised_score": pl.UInt8}, ... )
Use a parameterised SQLAlchemy query, passing named values via
execute_options
:>>> df = pl.read_database( ... query="SELECT * FROM test_data WHERE metric > :value", ... connection=alchemy_conn, ... execute_options={"parameters": {"value": 0}}, ... )
Use ‘qmark’ style parameterisation; values are still passed via
execute_options
, but in this case the “parameters” value is a sequence of literals, not a dict:>>> df = pl.read_database( ... query="SELECT * FROM test_data WHERE metric > ?", ... connection=alchemy_conn, ... execute_options={"parameters": [0]}, ... )
Batch the results of a large SQLAlchemy query into DataFrames, each containing 100,000 rows; explicitly establish a server-side cursor using the connection’s “execution_options” method to avoid loading the entire result locally before batching (this is not required for all drivers, so check your driver’s documentation for more details):
>>> for df in pl.read_database( ... query="SELECT * FROM test_data", ... connection=alchemy_conn.execution_options(stream_results=True), ... iter_batches=True, ... batch_size=100_000, ... ): ... do_something(df)
Instantiate a DataFrame using an ODBC connection string (requires the
arrow-odbc
package) setting upper limits on the buffer size of variadic text/binary columns:>>> df = pl.read_database( ... query="SELECT * FROM test_data", ... connection="Driver={PostgreSQL};Server=localhost;Port=5432;Database=test;Uid=usr;Pwd=", ... execute_options={"max_text_size": 512, "max_binary_size": 1024}, ... )
Load graph database results from a
KùzuDB
connection and a Cypher query:>>> df = pl.read_database( ... query="MATCH (a:User)-[f:Follows]->(b:User) RETURN a.name, f.since, b.name", ... connection=kuzu_db_conn, ... )
Load data from an asynchronous SQLAlchemy driver/engine; note that asynchronous connections and sessions are also supported here:
>>> from sqlalchemy.ext.asyncio import create_async_engine >>> async_engine = create_async_engine("sqlite+aiosqlite:///test.db") >>> df = pl.read_database( ... query="SELECT * FROM test_data", ... connection=async_engine, ... )
Load data from an
AsyncSurrealDB
client connection object; note that both the “ws” and “http” protocols are supported, as is the synchronousSurrealDB
client. The async loop can be run with standardasyncio
or withuvloop
:>>> import asyncio # (or uvloop) >>> async def surreal_query_to_frame(query: str, url: str): ... async with AsyncSurrealDB(url) as client: ... await client.use(namespace="test", database="test") ... return pl.read_database(query=query, connection=client) >>> df = asyncio.run( ... surreal_query_to_frame( ... query="SELECT * FROM test", ... url="http://localhost:8000", ... ) ... )