Path: blob/main/py-polars/docs/source/reference/selectors.rst
8384 views
=========
Selectors
=========
.. currentmodule:: polars
Selectors allow for more intuitive selection of columns from :class:`DataFrame`
or :class:`LazyFrame` objects based on their name, dtype or other properties.
They unify and build on the related functionality that is available through
the :meth:`col` expression and can also broadcast expressions over the selected
columns.
Importing
---------
* Selectors are available as functions imported from ``polars.selectors``
* Typical/recommended usage is to import the module as ``cs`` and employ selectors from there.
.. code-block:: python
import polars.selectors as cs
import polars as pl
df = pl.DataFrame(
{
"w": ["xx", "yy", "xx", "yy", "xx"],
"x": [1, 2, 1, 4, -2],
"y": [3.0, 4.5, 1.0, 2.5, -2.0],
"z": ["a", "b", "a", "b", "b"],
},
)
df.group_by(cs.string()).agg(cs.numeric().sum())
Set operations
--------------
Selectors support the following ``set`` operations:
.. table::
:widths: 20 60
+------------------------+------------+
| Operation | Expression |
+========================+============+
| `UNION` | ``A | B`` |
+------------------------+------------+
| `INTERSECTION` | ``A & B`` |
+------------------------+------------+
| `DIFFERENCE` | ``A - B`` |
+------------------------+------------+
| `SYMMETRIC DIFFERENCE` | ``A ^ B`` |
+------------------------+------------+
| `COMPLEMENT` | ``~A`` |
+------------------------+------------+
Note that both individual selector results and selector set operations will always return
matching columns in the same order as the underlying frame schema.
Examples
========
.. code-block:: python
import polars.selectors as cs
import polars as pl
# set up an empty dataframe with plenty of columns of various dtypes
df = pl.DataFrame(
schema={
"abc": pl.UInt16,
"bbb": pl.UInt32,
"cde": pl.Float64,
"def": pl.Float32,
"eee": pl.Boolean,
"fgg": pl.Boolean,
"ghi": pl.Time,
"JJK": pl.Date,
"Lmn": pl.Duration,
"opp": pl.Datetime("ms"),
"qqR": pl.String,
},
)
# Select the UNION of temporal, strings and columns that start with "e"
assert df.select(cs.temporal() | cs.string() | cs.starts_with("e")).schema == {
"eee": pl.Boolean,
"ghi": pl.Time,
"JJK": pl.Date,
"Lmn": pl.Duration,
"opp": pl.Datetime("ms"),
"qqR": pl.String,
}
# Select the INTERSECTION of temporal and column names that match "opp" OR "JJK"
assert df.select(cs.temporal() & cs.matches("opp|JJK")).schema == {
"JJK": pl.Date,
"opp": pl.Datetime("ms"),
}
# Select the DIFFERENCE of temporal columns and columns that contain the name "opp" OR "JJK"
assert df.select(cs.temporal() - cs.matches("opp|JJK")).schema == {
"ghi": pl.Time,
"Lmn": pl.Duration,
}
# Select the SYMMETRIC DIFFERENCE of numeric columns and columns that contain an "e"
assert df.select(cs.contains("e") ^ cs.numeric()).schema == {
"abc": UInt16,
"bbb": UInt32,
"eee": Boolean,
}
# Select the COMPLEMENT of all columns of dtypes Duration and Time
assert df.select(~cs.by_dtype([pl.Duration, pl.Time])).schema == {
"abc": pl.UInt16,
"bbb": pl.UInt32,
"cde": pl.Float64,
"def": pl.Float32,
"eee": pl.Boolean,
"fgg": pl.Boolean,
"JJK": pl.Date,
"opp": pl.Datetime("ms"),
"qqR": pl.String,
}
.. note::
If you don't want to use the set operations on the selectors, you can materialize them as ``expressions``
by calling ``as_expr``. This ensures the operations ``OR, AND, etc`` are dispatched to the underlying
expressions instead.
Functions
---------
Available selector functions:
.. automodule:: polars.selectors
:members:
:autosummary:
:autosummary-no-titles: