Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
pola-rs
GitHub Repository: pola-rs/polars
Path: blob/main/py-polars/docs/source/reference/sql/clauses.rst
8353 views
SQL Clauses
===========

.. list-table::
   :header-rows: 1
   :widths: 20 60

   * - Function
     - Description
   * - :ref:`SELECT <select>`
     - Retrieves specific column data from one or more tables.
   * - :ref:`DISTINCT <distinct>`
     - Returns unique values from a query.
   * - :ref:`FROM <from>`
     - Specify the table(s) from which to retrieve or delete data.
   * - :ref:`JOIN <join>`
     - Combine rows from two or more tables based on a related column.
   * - :ref:`WHERE <where>`
     - Filter rows returned from the query based on the given conditions.
   * - :ref:`GROUP BY <group_by>`
     - Aggregate row values based based on one or more key columns.
   * - :ref:`HAVING <having>`
     - Filter groups in a `GROUP BY` based on the given conditions.
   * - :ref:`WINDOW <window>`
     - Define named window specifications for window functions.
   * - :ref:`QUALIFY <qualify>`
     - Filter rows in a query based on window function results.
   * - :ref:`ORDER BY <order_by>`
     - Sort the query result based on one or more specified columns.
   * - :ref:`OFFSET <offset>`
     - Skip a specified number of rows.
   * - :ref:`LIMIT <limit>`
     - Specify the number of rows returned.
   * - :ref:`FETCH <fetch>`
     - Limit the number of rows returned (alternative to LIMIT).


.. _select:

SELECT
------
Select the columns to be returned by the query.

**Example:**

.. code-block:: python

    df = pl.DataFrame(
      {
        "a": [1, 2, 3],
        "b": ["zz", "yy", "xx"],
      }
    )
    df.sql("""
      SELECT a, b FROM self
    """)
    # shape: (3, 2)
    # ┌─────┬─────┐
    # │ a   ┆ b   │
    # │ --- ┆ --- │
    # │ i64 ┆ str │
    # ╞═════╪═════╡
    # │ 1   ┆ zz  │
    # │ 2   ┆ yy  │
    # │ 3   ┆ xx  │
    # └─────┴─────┘

.. _distinct:

DISTINCT
--------
Returns unique values from a query.

**Example:**

.. code-block:: python

    df = pl.DataFrame(
      {
        "a": [1, 2, 2, 1],
        "b": ["xx", "yy", "yy", "xx"],
      }
    )
    df.sql("""
      SELECT DISTINCT * FROM self
    """)
    # shape: (2, 2)
    # ┌─────┬─────┐
    # │ a   ┆ b   │
    # │ --- ┆ --- │
    # │ i64 ┆ str │
    # ╞═════╪═════╡
    # │ 1   ┆ xx  │
    # │ 2   ┆ yy  │
    # └─────┴─────┘

.. _from:

FROM
----
Specifies the table(s) from which to retrieve or delete data.

**Example:**

.. code-block:: python

    df = pl.DataFrame(
      {
        "a": [1, 2, 3],
        "b": ["zz", "yy", "xx"],
      }
    )
    df.sql("""
      SELECT * FROM self
    """)
    # shape: (3, 2)
    # ┌─────┬─────┐
    # │ a   ┆ b   │
    # │ --- ┆ --- │
    # │ i64 ┆ str │
    # ╞═════╪═════╡
    # │ 1   ┆ zz  │
    # │ 2   ┆ yy  │
    # │ 3   ┆ xx  │
    # └─────┴─────┘

.. _join:

JOIN
----
Combines rows from two or more tables based on a related column.

**Join Types**

* `CROSS JOIN`
* `[NATURAL] FULL JOIN`
* `[NATURAL] INNER JOIN`
* `[NATURAL] LEFT JOIN`
* `[LEFT | RIGHT] ANTI JOIN`
* `[LEFT | RIGHT] SEMI JOIN`

**Example:**

.. code-block:: python

    df1 = pl.DataFrame(
      {
        "foo": [1, 2, 3],
        "ham": ["a", "b", "c"],
      }
    )
    df2 = pl.DataFrame(
      {
        "apple": ["x", "y", "z"],
        "ham": ["a", "b", "d"],
      }
    )
    pl.sql("""
      SELECT foo, apple, COALESCE(df1.ham, df2.ham) AS ham
      FROM df1 FULL JOIN df2
      USING (ham)
    """).collect()
    # shape: (4, 3)
    # ┌──────┬───────┬─────┐
    # │ foo  ┆ apple ┆ ham │
    # │ ---  ┆ ---   ┆ --- │
    # │ i64  ┆ str   ┆ str │
    # ╞══════╪═══════╪═════╡
    # │ 1    ┆ x     ┆ a   │
    # │ 2    ┆ y     ┆ b   │
    # │ null ┆ z     ┆ d   │
    # │ 3    ┆ null  ┆ c   │
    # └──────┴───────┴─────┘

    pl.sql("""
      SELECT * FROM df1 NATURAL INNER JOIN df2
    """).collect()
    # shape: (2, 3)
    # ┌─────┬───────┬─────┐
    # │ foo ┆ apple ┆ ham │
    # │ --- ┆ ---   ┆ --- │
    # │ i64 ┆ str   ┆ str │
    # ╞═════╪═══════╪═════╡
    # │ 1   ┆ x     ┆ a   │
    # │ 2   ┆ y     ┆ b   │
    # └─────┴───────┴─────┘

.. _where:

WHERE
-----

Filter rows returned from the query based on the given conditions.

.. code-block:: python

    df = pl.DataFrame(
      {
        "foo": [30, 40, 50],
        "ham": ["a", "b", "c"],
      }
    )
    df.sql("""
      SELECT * FROM self WHERE foo > 42
    """)
    # shape: (1, 2)
    # ┌─────┬─────┐
    # │ foo ┆ ham │
    # │ --- ┆ --- │
    # │ i64 ┆ str │
    # ╞═════╪═════╡
    # │ 50  ┆ c   │
    # └─────┴─────┘

.. _group_by:

GROUP BY
--------
Group rows that have the same values in specified columns into summary rows.

**Example:**

.. code-block:: python

    df = pl.DataFrame(
        {
          "foo": ["a", "b", "b"],
          "bar": [10, 20, 30],
        }
      )
    df.sql("""
      SELECT foo, SUM(bar) FROM self GROUP BY foo
    """)
    # shape: (2, 2)
    # ┌─────┬─────┐
    # │ foo ┆ bar │
    # │ --- ┆ --- │
    # │ str ┆ i64 │
    # ╞═════╪═════╡
    # │ b   ┆ 50  │
    # │ a   ┆ 10  │
    # └─────┴─────┘

.. _having:

HAVING
------
Filter groups in a `GROUP BY` based on the given conditions.

.. code-block:: python

    df = pl.DataFrame(
          {
          "foo": ["a", "b", "b", "c"],
          "bar": [10, 20, 30, 40],
        }
      )
    df.sql("""
      SELECT foo, SUM(bar) FROM self GROUP BY foo HAVING bar >= 40
    """)
    # shape: (2, 2)
    # ┌─────┬─────┐
    # │ foo ┆ bar │
    # │ --- ┆ --- │
    # │ str ┆ i64 │
    # ╞═════╪═════╡
    # │ c   ┆ 40  │
    # │ b   ┆ 50  │
    # └─────┴─────┘

.. _window:

WINDOW
------
Define named window specifications that can be referenced by window functions.

**Example:**

One window, multiple expressions:

.. code-block:: python

    df = pl.DataFrame({
        "id": [1, 2, 3, 4, 5, 6, 7],
        "category": ["A", "A", "A", "B", "B", "B", "C"],
        "value": [20, 10, 30, 15, 50, 30, 35],
    })
    df.sql("""
      SELECT
        category,
        value,
        SUM(value) OVER w AS "w:sum",
        MIN(value) OVER w AS "w:min",
        AVG(value) OVER w AS "w:avg",
      FROM self
      WINDOW w AS (PARTITION BY category ORDER BY value)
      ORDER BY category, value
    """)
    # shape: (7, 5)
    # ┌──────────┬───────┬───────┬───────┬───────────┐
    # │ category ┆ value ┆ w:sum ┆ w:min ┆ w:avg     │
    # │ ---      ┆ ---   ┆ ---   ┆ ---   ┆ ---       │
    # │ str      ┆ i64   ┆ i64   ┆ i64   ┆ f64       │
    # ╞══════════╪═══════╪═══════╪═══════╪═══════════╡
    # │ A        ┆ 10    ┆ 10    ┆ 10    ┆ 20.0      │
    # │ A        ┆ 20    ┆ 30    ┆ 10    ┆ 20.0      │
    # │ A        ┆ 30    ┆ 60    ┆ 10    ┆ 20.0      │
    # │ B        ┆ 15    ┆ 15    ┆ 15    ┆ 31.666667 │
    # │ B        ┆ 30    ┆ 45    ┆ 15    ┆ 31.666667 │
    # │ B        ┆ 50    ┆ 95    ┆ 15    ┆ 31.666667 │
    # │ C        ┆ 35    ┆ 35    ┆ 35    ┆ 35.0      │
    # └──────────┴───────┴───────┴───────┴───────────┘

Multiple windows, multiple expressions:

.. code-block:: python

    df.sql("""
      SELECT
        category,
        value,
        AVG(value) OVER w1 AS category_avg,
        SUM(value) OVER w2 AS running_value,
        COUNT(*) OVER w3 AS total_count
      FROM self
      WINDOW
        w1 AS (PARTITION BY category),
        w2 AS (ORDER BY value),
        w3 AS ()
      ORDER BY category, value
    """)
    # shape: (7, 5)
    # ┌──────────┬───────┬──────────────┬───────────────┬─────────────┐
    # │ category ┆ value ┆ category_avg ┆ running_value ┆ total_count │
    # │ ---      ┆ ---   ┆ ---          ┆ ---           ┆ ---         │
    # │ str      ┆ i64   ┆ f64          ┆ i64           ┆ u32         │
    # ╞══════════╪═══════╪══════════════╪═══════════════╪═════════════╡
    # │ A        ┆ 10    ┆ 20.0         ┆ 10            ┆ 7           │
    # │ A        ┆ 20    ┆ 20.0         ┆ 45            ┆ 7           │
    # │ A        ┆ 30    ┆ 20.0         ┆ 75            ┆ 7           │
    # │ B        ┆ 15    ┆ 31.666667    ┆ 25            ┆ 7           │
    # │ B        ┆ 30    ┆ 31.666667    ┆ 105           ┆ 7           │
    # │ B        ┆ 50    ┆ 31.666667    ┆ 190           ┆ 7           │
    # │ C        ┆ 35    ┆ 35.0         ┆ 140           ┆ 7           │
    # └──────────┴───────┴──────────────┴───────────────┴─────────────┘

.. _qualify:

QUALIFY
-------
Filter rows in a query based on window function results.

**Example:**

Constrain the result to the top (largest) two values per category:

.. code-block:: python

    df = pl.DataFrame({
        "id": [100, 200, 300, 400, 500, 600, 700, 800],
        "category": ["A", "A", "A", "B", "B", "B", "B", "A"],
        "value": [20, 15, 30, 25, 15, 50, 35, 45],
    })
    df.sql("""
      SELECT
        id,
        category,
        value
      FROM self
      WINDOW w AS (PARTITION BY category ORDER BY value DESC)
      QUALIFY ROW_NUMBER() OVER w <= 2
      ORDER BY category, value DESC
    """)
    # shape: (4, 3)
    # ┌─────┬──────────┬───────┐
    # │ id  ┆ category ┆ value │
    # │ --- ┆ ---      ┆ ---   │
    # │ i64 ┆ str      ┆ i64   │
    # ╞═════╪══════════╪═══════╡
    # │ 800 ┆ A        ┆ 45    │
    # │ 300 ┆ A        ┆ 30    │
    # │ 600 ┆ B        ┆ 50    │
    # │ 700 ┆ B        ┆ 35    │
    # └─────┴──────────┴───────┘

.. _order_by:

ORDER BY
--------
Sort the query result based on one or more specified columns.

**Example:**

.. code-block:: python

    df = pl.DataFrame(
      {
        "foo": ["b", "a", "c", "b"],
        "bar": [20, 10, 40, 30],
      }
    )
    df.sql("""
      SELECT foo, bar FROM self ORDER BY bar DESC
    """)
    # shape: (4, 2)
    # ┌─────┬─────┐
    # │ foo ┆ bar │
    # │ --- ┆ --- │
    # │ str ┆ i64 │
    # ╞═════╪═════╡
    # │ c   ┆ 40  │
    # │ b   ┆ 30  │
    # │ b   ┆ 20  │
    # │ a   ┆ 10  │
    # └─────┴─────┘

.. _offset:

OFFSET
------
Skip a number of rows before starting to return rows from the query.

**Example:**

.. code-block:: python

    df = pl.DataFrame(
      {
        "foo": ["b", "a", "c", "b"],
        "bar": [20, 10, 40, 30],
      }
    )
    df.sql("""
      SELECT foo, bar FROM self LIMIT 2 OFFSET 2
    """)
    # shape: (2, 2)
    # ┌─────┬─────┐
    # │ foo ┆ bar │
    # │ --- ┆ --- │
    # │ str ┆ i64 │
    # ╞═════╪═════╡
    # │ c   ┆ 40  │
    # │ b   ┆ 30  │
    # └─────┴─────┘

.. _limit:

LIMIT
-----
Limit the number of rows returned by the query.

**Example:**

.. code-block:: python

    df = pl.DataFrame(
      {
        "foo": ["b", "a", "c", "b"],
        "bar": [20, 10, 40, 30],
      }
    )
    df.sql("""
      SELECT foo, bar FROM self LIMIT 2
    """)
    # shape: (2, 2)
    # ┌─────┬─────┐
    # │ foo ┆ bar │
    # │ --- ┆ --- │
    # │ str ┆ i64 │
    # ╞═════╪═════╡
    # │ b   ┆ 20  │
    # │ a   ┆ 10  │
    # └─────┴─────┘

.. _fetch:

FETCH
-----
Limit the number of rows returned by the query; this is the ANSI SQL standard
alternative to the ``LIMIT`` clause, and can be combined with ``OFFSET``. The
`WITH TIES` and `PERCENT` modifiers are not currently supported.

**Example:**

.. code-block:: python

    df = pl.DataFrame(
      {
        "foo": ["b", "a", "c", "b"],
        "bar": [20, 10, 40, 30],
      }
    )
    df.sql("""
      SELECT foo, bar
      FROM self
      ORDER BY bar
      OFFSET 1 FETCH NEXT 2 ROWS ONLY
    """)
    # shape: (2, 2)
    # ┌─────┬─────┐
    # │ foo ┆ bar │
    # │ --- ┆ --- │
    # │ str ┆ i64 │
    # ╞═════╪═════╡
    # │ b   ┆ 20  │
    # │ b   ┆ 30  │
    # └─────┴─────┘