Path: blob/main/docs/source/user-guide/lazy/multiplexing.md
6940 views
Multiplexing queries
In the Sources and Sinks page, we already discussed multiplexing as a way to split a query into multiple sinks. This page will go a bit deeper in this concept, as it is important to understand when combining LazyFrame
s with procedural programming constructs.
When dealing with eager dataframes, it is very common to keep state in a temporary variable. Let's look at the following example. Below we create a DataFrame
with 10 unique elements in a random order (so that Polars doesn't hit any fast paths for sorted keys).
{{code_block('user-guide/lazy/multiplexing','dataframe',[])}}
Eager
If you deal with the Polars eager API, making a variable and iterating over that temporary DataFrame
gives the result you expect, as the result of the group-by is stored in df1
. Even though the output order is unstable, it doesn't matter as it is eagerly evaluated. The follow snippet therefore doesn't raise and the assert passes. {{code_block('user-guide/lazy/multiplexing','eager',[])}}
Lazy
Now if we tried this naively with LazyFrame
s, this would fail.
{{code_block('user-guide/lazy/multiplexing','lazy',[])}}
The reason this fails is that lf1
doesn't contain the materialized result of df.lazy().group_by("n").len()
, it instead holds the query plan in that variable.
This means that every time we branch of this LazyFrame
and call collect
we re-evaluate the group-by. Besides being expensive, this also leads to unexpected results if you assume that the output is stable (which isn't the case here).
In the example above you are actually evaluating 2 query plans:
Plan 1
Plan 2
Combine the query plans
To circumvent this, we must give Polars the opportunity to look at all the query plans in a single optimization and execution pass. This can be done by passing the diverging LazyFrame
's to the collect_all
function.
{{code_block('user-guide/lazy/multiplexing','collect_all',[])}}
If we explain the combined queries with pl.explain_all
, we can also observe that they are shared under a single "SINK_MULTIPLE" evaluation and that the optimizer has recognized that parts of the query come from the same subplan, indicated by the inserted "CACHE" nodes.
Combining related subplans in a single execution unit with pl.collect_all
can thus lead to large performance increases and allows diverging query plans, storing temporary tables, and a more procedural programming style.