Query Clauses#
Function |
Description |
|---|---|
Retrieves specific column data from one or more tables. |
|
Returns unique values from a query. |
|
Returns the first row for each unique combination of the specified columns. |
|
Specify the table(s) from which to retrieve or delete data. Can also be used as the leading clause. |
|
Combine rows from two or more tables based on a related column. |
|
Filter rows returned from the query based on the given conditions. |
|
Aggregate row values based based on one or more key columns. |
|
Automatically group by all non-aggregate columns in the projection. |
|
Filter groups in a |
|
Define named window specifications for window functions. |
|
Filter rows in a query based on window function results. |
|
Sort the query result based on one or more specified columns. |
|
Sort the query result by all selected columns. |
|
Skip a specified number of rows. |
|
Specify the number of rows returned. |
|
Limit the number of rows returned (alternative to LIMIT). |
SELECT#
Select the columns to be returned by the query.
Example:
df = pl.DataFrame(
{
"a": [1, 2, 3],
"b": ["zz", "yy", "xx"],
}
)
df.sql("""
SELECT a, b FROM self
""")
# shape: (3, 2)
# ┌─────┬─────┐
# │ a ┆ b │
# │ --- ┆ --- │
# │ i64 ┆ str │
# ╞═════╪═════╡
# │ 1 ┆ zz │
# │ 2 ┆ yy │
# │ 3 ┆ xx │
# └─────┴─────┘
Note
Use of bare FROM tbl is also supported, as shorthand for SELECT * FROM tbl;
see the FROM clause for more detail.
DISTINCT#
Returns unique values from a query.
Example:
df = pl.DataFrame(
{
"a": [1, 2, 2, 1],
"b": ["xx", "yy", "yy", "xx"],
}
)
df.sql("""
SELECT DISTINCT * FROM self
""")
# shape: (2, 2)
# ┌─────┬─────┐
# │ a ┆ b │
# │ --- ┆ --- │
# │ i64 ┆ str │
# ╞═════╪═════╡
# │ 1 ┆ xx │
# │ 2 ┆ yy │
# └─────┴─────┘
DISTINCT ON#
Returns the first row for each unique combination of the specified columns. When used
with ORDER BY, this keeps the first row per group according to the given ordering.
Note
DISTINCT ON only supports column names (not arbitrary expressions).
Example:
df = pl.DataFrame(
{
"category": ["A", "A", "A", "B", "B", "B"],
"value": [30, 10, 20, 50, 40, 60],
"label": ["x", "y", "z", "p", "q", "r"],
}
)
df.sql("""
SELECT DISTINCT ON (category)
category,
value,
label
FROM self
ORDER BY category, value DESC
""")
# shape: (2, 3)
# ┌──────────┬───────┬───────┐
# │ category ┆ value ┆ label │
# │ --- ┆ --- ┆ --- │
# │ str ┆ i64 ┆ str │
# ╞══════════╪═══════╪═══════╡
# │ A ┆ 30 ┆ x │
# │ B ┆ 60 ┆ r │
# └──────────┴───────┴───────┘
FROM#
Specifies the table(s) from which to retrieve or delete data.
In addition to the usual SELECT ... FROM tbl syntax, the FROM clause can
also be used as the leading clause in a query, supporting the following variations:
FROM tbl- equivalent toSELECT * FROM tbl.FROM tbl SELECT ...- a reorderedSELECTwith explicit projections.
Example:
df = pl.DataFrame(
{
"a": [1, 2, 3],
"b": ["zz", "yy", "xx"],
}
)
for query in (
"SELECT * FROM self",
"FROM self SELECT *",
"FROM self",
):
df.sql(query)
# shape: (3, 2)
# ┌─────┬─────┐
# │ a ┆ b │
# │ --- ┆ --- │
# │ i64 ┆ str │
# ╞═════╪═════╡
# │ 1 ┆ zz │
# │ 2 ┆ yy │
# │ 3 ┆ xx │
# └─────┴─────┘
Using FROM as the leading clause, with SELECT:
df.sql("""
FROM self SELECT b, a
""")
# shape: (3, 2)
# ┌─────┬─────┐
# │ b ┆ a │
# │ --- ┆ --- │
# │ str ┆ i64 │
# ╞═════╪═════╡
# │ zz ┆ 1 │
# │ yy ┆ 2 │
# │ xx ┆ 3 │
# └─────┴─────┘
JOIN#
Combines rows from two or more tables based on a related column.
Join Types
CROSS JOIN[NATURAL] FULL [OUTER] JOIN[NATURAL] INNER [OUTER] JOIN[NATURAL] LEFT [OUTER] JOIN[NATURAL] RIGHT [OUTER] JOIN[LEFT | RIGHT] ANTI JOIN[LEFT | RIGHT] SEMI JOIN
Example:
df1 = pl.DataFrame(
{
"foo": [1, 2, 3],
"ham": ["a", "b", "c"],
}
)
df2 = pl.DataFrame(
{
"apple": ["x", "y", "z"],
"ham": ["a", "b", "d"],
}
)
pl.sql("""
SELECT foo, apple, COALESCE(df1.ham, df2.ham) AS ham
FROM df1 FULL JOIN df2
USING (ham)
""").collect()
# shape: (4, 3)
# ┌──────┬───────┬─────┐
# │ foo ┆ apple ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ str ┆ str │
# ╞══════╪═══════╪═════╡
# │ 1 ┆ x ┆ a │
# │ 2 ┆ y ┆ b │
# │ null ┆ z ┆ d │
# │ 3 ┆ null ┆ c │
# └──────┴───────┴─────┘
pl.sql("""
SELECT * FROM df1 NATURAL INNER JOIN df2
""").collect()
# shape: (2, 3)
# ┌─────┬───────┬─────┐
# │ foo ┆ apple ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ str ┆ str │
# ╞═════╪═══════╪═════╡
# │ 1 ┆ x ┆ a │
# │ 2 ┆ y ┆ b │
# └─────┴───────┴─────┘
WHERE#
Filter rows returned from the query based on the given conditions.
df = pl.DataFrame(
{
"foo": [30, 40, 50],
"ham": ["a", "b", "c"],
}
)
df.sql("""
SELECT * FROM self WHERE foo > 42
""")
# shape: (1, 2)
# ┌─────┬─────┐
# │ foo ┆ ham │
# │ --- ┆ --- │
# │ i64 ┆ str │
# ╞═════╪═════╡
# │ 50 ┆ c │
# └─────┴─────┘
GROUP BY#
Group rows that have the same values in specified columns into summary rows.
Example:
df = pl.DataFrame(
{
"foo": ["a", "b", "b"],
"bar": [10, 20, 30],
}
)
df.sql("""
SELECT foo, SUM(bar) FROM self GROUP BY foo
""")
# shape: (2, 2)
# ┌─────┬─────┐
# │ foo ┆ bar │
# │ --- ┆ --- │
# │ str ┆ i64 │
# ╞═════╪═════╡
# │ b ┆ 50 │
# │ a ┆ 10 │
# └─────┴─────┘
GROUP BY ALL#
Automatically groups by all columns in the SELECT projection that are not wrapped in
an aggregate function, a window expression, or a literal value. This is a convenience
shorthand that avoids having to manually repeat column names in the GROUP BY clause.
Example:
df = pl.DataFrame(
{
"category": ["A", "A", "B", "B"],
"sub": ["x", "y", "x", "y"],
"value": [10, 20, 30, 40],
}
)
df.sql("""
SELECT category, sub, SUM(value) AS total
FROM self
GROUP BY ALL
ORDER BY category, sub
""")
# shape: (4, 3)
# ┌──────────┬─────┬───────┐
# │ category ┆ sub ┆ total │
# │ --- ┆ --- ┆ --- │
# │ str ┆ str ┆ i64 │
# ╞══════════╪═════╪═══════╡
# │ A ┆ x ┆ 10 │
# │ A ┆ y ┆ 20 │
# │ B ┆ x ┆ 30 │
# │ B ┆ y ┆ 40 │
# └──────────┴─────┴───────┘
HAVING#
Filter groups in a GROUP BY based on the given conditions.
df = pl.DataFrame(
{
"foo": ["a", "b", "b", "c"],
"bar": [10, 20, 30, 40],
}
)
df.sql("""
SELECT foo, SUM(bar) FROM self GROUP BY foo HAVING bar >= 40
""")
# shape: (2, 2)
# ┌─────┬─────┐
# │ foo ┆ bar │
# │ --- ┆ --- │
# │ str ┆ i64 │
# ╞═════╪═════╡
# │ c ┆ 40 │
# │ b ┆ 50 │
# └─────┴─────┘
WINDOW#
Define named window specifications that can be referenced by window functions.
Example:
One window, multiple expressions:
df = pl.DataFrame({
"id": [1, 2, 3, 4, 5, 6, 7],
"category": ["A", "A", "A", "B", "B", "B", "C"],
"value": [20, 10, 30, 15, 50, 30, 35],
})
df.sql("""
SELECT
category,
value,
SUM(value) OVER w AS "w:sum",
MIN(value) OVER w AS "w:min",
AVG(value) OVER w AS "w:avg",
FROM self
WINDOW w AS (PARTITION BY category ORDER BY value)
ORDER BY category, value
""")
# shape: (7, 5)
# ┌──────────┬───────┬───────┬───────┬───────────┐
# │ category ┆ value ┆ w:sum ┆ w:min ┆ w:avg │
# │ --- ┆ --- ┆ --- ┆ --- ┆ --- │
# │ str ┆ i64 ┆ i64 ┆ i64 ┆ f64 │
# ╞══════════╪═══════╪═══════╪═══════╪═══════════╡
# │ A ┆ 10 ┆ 10 ┆ 10 ┆ 20.0 │
# │ A ┆ 20 ┆ 30 ┆ 10 ┆ 20.0 │
# │ A ┆ 30 ┆ 60 ┆ 10 ┆ 20.0 │
# │ B ┆ 15 ┆ 15 ┆ 15 ┆ 31.666667 │
# │ B ┆ 30 ┆ 45 ┆ 15 ┆ 31.666667 │
# │ B ┆ 50 ┆ 95 ┆ 15 ┆ 31.666667 │
# │ C ┆ 35 ┆ 35 ┆ 35 ┆ 35.0 │
# └──────────┴───────┴───────┴───────┴───────────┘
Multiple windows, multiple expressions:
df.sql("""
SELECT
category,
value,
AVG(value) OVER w1 AS category_avg,
SUM(value) OVER w2 AS running_value,
COUNT(*) OVER w3 AS total_count
FROM self
WINDOW
w1 AS (PARTITION BY category),
w2 AS (ORDER BY value),
w3 AS ()
ORDER BY category, value
""")
# shape: (7, 5)
# ┌──────────┬───────┬──────────────┬───────────────┬─────────────┐
# │ category ┆ value ┆ category_avg ┆ running_value ┆ total_count │
# │ --- ┆ --- ┆ --- ┆ --- ┆ --- │
# │ str ┆ i64 ┆ f64 ┆ i64 ┆ u32 │
# ╞══════════╪═══════╪══════════════╪═══════════════╪═════════════╡
# │ A ┆ 10 ┆ 20.0 ┆ 10 ┆ 7 │
# │ A ┆ 20 ┆ 20.0 ┆ 45 ┆ 7 │
# │ A ┆ 30 ┆ 20.0 ┆ 75 ┆ 7 │
# │ B ┆ 15 ┆ 31.666667 ┆ 25 ┆ 7 │
# │ B ┆ 30 ┆ 31.666667 ┆ 105 ┆ 7 │
# │ B ┆ 50 ┆ 31.666667 ┆ 190 ┆ 7 │
# │ C ┆ 35 ┆ 35.0 ┆ 140 ┆ 7 │
# └──────────┴───────┴──────────────┴───────────────┴─────────────┘
QUALIFY#
Filter rows in a query based on window function results.
Example:
Constrain the result to the top (largest) two values per category:
df = pl.DataFrame({
"id": [100, 200, 300, 400, 500, 600, 700, 800],
"category": ["A", "A", "A", "B", "B", "B", "B", "A"],
"value": [20, 15, 30, 25, 15, 50, 35, 45],
})
df.sql("""
SELECT
id,
category,
value
FROM self
WINDOW w AS (PARTITION BY category ORDER BY value DESC)
QUALIFY ROW_NUMBER() OVER w <= 2
ORDER BY category, value DESC
""")
# shape: (4, 3)
# ┌─────┬──────────┬───────┐
# │ id ┆ category ┆ value │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ str ┆ i64 │
# ╞═════╪══════════╪═══════╡
# │ 800 ┆ A ┆ 45 │
# │ 300 ┆ A ┆ 30 │
# │ 600 ┆ B ┆ 50 │
# │ 700 ┆ B ┆ 35 │
# └─────┴──────────┴───────┘
ORDER BY#
Sort the query result based on one or more specified columns.
Example:
df = pl.DataFrame(
{
"foo": ["b", "a", "c", "b"],
"bar": [20, 10, 40, 30],
}
)
df.sql("""
SELECT foo, bar FROM self ORDER BY bar DESC
""")
# shape: (4, 2)
# ┌─────┬─────┐
# │ foo ┆ bar │
# │ --- ┆ --- │
# │ str ┆ i64 │
# ╞═════╪═════╡
# │ c ┆ 40 │
# │ b ┆ 30 │
# │ b ┆ 20 │
# │ a ┆ 10 │
# └─────┴─────┘
ORDER BY ALL#
Sort the query result by all selected columns. This is a convenience shorthand that
avoids repeating column names. The ASC/DESC and NULLS FIRST/NULLS LAST
modifiers apply to every column.
Example:
df = pl.DataFrame(
{
"a": ["x", "y", "x", "y"],
"b": [30, 10, 20, 40],
}
)
df.sql("""
SELECT a, b FROM self ORDER BY ALL
""")
# shape: (4, 2)
# ┌─────┬─────┐
# │ a ┆ b │
# │ --- ┆ --- │
# │ str ┆ i64 │
# ╞═════╪═════╡
# │ x ┆ 20 │
# │ x ┆ 30 │
# │ y ┆ 10 │
# │ y ┆ 40 │
# └─────┴─────┘
df.sql("""
SELECT a, b FROM self ORDER BY ALL DESC
""")
# shape: (4, 2)
# ┌─────┬─────┐
# │ a ┆ b │
# │ --- ┆ --- │
# │ str ┆ i64 │
# ╞═════╪═════╡
# │ y ┆ 40 │
# │ y ┆ 10 │
# │ x ┆ 30 │
# │ x ┆ 20 │
# └─────┴─────┘
OFFSET#
Skip a number of rows before starting to return rows from the query.
Example:
df = pl.DataFrame(
{
"foo": ["b", "a", "c", "b"],
"bar": [20, 10, 40, 30],
}
)
df.sql("""
SELECT foo, bar FROM self LIMIT 2 OFFSET 2
""")
# shape: (2, 2)
# ┌─────┬─────┐
# │ foo ┆ bar │
# │ --- ┆ --- │
# │ str ┆ i64 │
# ╞═════╪═════╡
# │ c ┆ 40 │
# │ b ┆ 30 │
# └─────┴─────┘
LIMIT#
Limit the number of rows returned by the query.
Example:
df = pl.DataFrame(
{
"foo": ["b", "a", "c", "b"],
"bar": [20, 10, 40, 30],
}
)
df.sql("""
SELECT foo, bar FROM self LIMIT 2
""")
# shape: (2, 2)
# ┌─────┬─────┐
# │ foo ┆ bar │
# │ --- ┆ --- │
# │ str ┆ i64 │
# ╞═════╪═════╡
# │ b ┆ 20 │
# │ a ┆ 10 │
# └─────┴─────┘
FETCH#
Limit the number of rows returned by the query; this is the ANSI SQL standard
alternative to the LIMIT clause, and can be combined with OFFSET. The
WITH TIES and PERCENT modifiers are not currently supported.
Example:
df = pl.DataFrame(
{
"foo": ["b", "a", "c", "b"],
"bar": [20, 10, 40, 30],
}
)
df.sql("""
SELECT foo, bar
FROM self
ORDER BY bar
OFFSET 1 FETCH NEXT 2 ROWS ONLY
""")
# shape: (2, 2)
# ┌─────┬─────┐
# │ foo ┆ bar │
# │ --- ┆ --- │
# │ str ┆ i64 │
# ╞═════╪═════╡
# │ b ┆ 20 │
# │ b ┆ 30 │
# └─────┴─────┘