Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
pola-rs
GitHub Repository: pola-rs/polars
Path: blob/main/crates/polars-lazy/src/lib.rs
6939 views
1
//! Lazy API of Polars
2
//!
3
//! The lazy API of Polars supports a subset of the eager API. Apart from the distributed compute,
4
//! it is very similar to [Apache Spark](https://spark.apache.org/). You write queries in a
5
//! domain specific language. These queries translate to a logical plan, which represent your query steps.
6
//! Before execution this logical plan is optimized and may change the order of operations if this will increase performance.
7
//! Or implicit type casts may be added such that execution of the query won't lead to a type error (if it can be resolved).
8
//!
9
//! # Lazy DSL
10
//!
11
//! The lazy API of polars replaces the eager [`DataFrame`] with the [`LazyFrame`], through which
12
//! the lazy API is exposed.
13
//! The [`LazyFrame`] represents a logical execution plan: a sequence of operations to perform on a concrete data source.
14
//! These operations are not executed until we call [`collect`].
15
//! This allows polars to optimize/reorder the query which may lead to faster queries or fewer type errors.
16
//!
17
//! [`DataFrame`]: polars_core::frame::DataFrame
18
//! [`LazyFrame`]: crate::frame::LazyFrame
19
//! [`collect`]: crate::frame::LazyFrame::collect
20
//!
21
//! In general, a [`LazyFrame`] requires a concrete data source — a [`DataFrame`], a file on disk, etc. — which polars-lazy
22
//! then applies the user-specified sequence of operations to.
23
//! To obtain a [`LazyFrame`] from an existing [`DataFrame`], we call the [`lazy`](crate::frame::IntoLazy::lazy) method on
24
//! the [`DataFrame`].
25
//! A [`LazyFrame`] can also be obtained through the lazy versions of file readers, such as [`LazyCsvReader`](crate::frame::LazyCsvReader).
26
//!
27
//! The other major component of the polars lazy API is [`Expr`](crate::dsl::Expr), which represents an operation to be
28
//! performed on a [`LazyFrame`], such as mapping over a column, filtering, or groupby-aggregation.
29
//! [`Expr`] and the functions that produce them can be found in the [dsl module](crate::dsl).
30
//!
31
//! [`Expr`]: crate::dsl::Expr
32
//!
33
//! Most operations on a [`LazyFrame`] consume the [`LazyFrame`] and return a new [`LazyFrame`] with the updated plan.
34
//! If you need to use the same [`LazyFrame`] multiple times, you should [`clone`](crate::frame::LazyFrame::clone) it, and optionally
35
//! [`cache`](crate::frame::LazyFrame::cache) it beforehand.
36
//!
37
//! ## Examples
38
//!
39
//! #### Adding a new column to a lazy DataFrame
40
//!
41
//!```rust
42
//! #[macro_use] extern crate polars_core;
43
//! use polars_core::prelude::*;
44
//! use polars_lazy::prelude::*;
45
//!
46
//! let df = df! {
47
//! "column_a" => &[1, 2, 3, 4, 5],
48
//! "column_b" => &["a", "b", "c", "d", "e"]
49
//! }.unwrap();
50
//!
51
//! let new = df.lazy()
52
//! // Note the reverse here!!
53
//! .reverse()
54
//! .with_column(
55
//! // always rename a new column
56
//! (col("column_a") * lit(10)).alias("new_column")
57
//! )
58
//! .collect()
59
//! .unwrap();
60
//!
61
//! assert!(new.column("new_column")
62
//! .unwrap()
63
//! .equals(
64
//! &Column::new("new_column".into(), &[50, 40, 30, 20, 10])
65
//! )
66
//! );
67
//! ```
68
//! #### Modifying a column based on some predicate
69
//!
70
//!```rust
71
//! #[macro_use] extern crate polars_core;
72
//! use polars_core::prelude::*;
73
//! use polars_lazy::prelude::*;
74
//!
75
//! let df = df! {
76
//! "column_a" => &[1, 2, 3, 4, 5],
77
//! "column_b" => &["a", "b", "c", "d", "e"]
78
//! }.unwrap();
79
//!
80
//! let new = df.lazy()
81
//! .with_column(
82
//! // value = 100 if x < 3 else x
83
//! when(
84
//! col("column_a").lt(lit(3))
85
//! ).then(
86
//! lit(100)
87
//! ).otherwise(
88
//! col("column_a")
89
//! ).alias("new_column")
90
//! )
91
//! .collect()
92
//! .unwrap();
93
//!
94
//! assert!(new.column("new_column")
95
//! .unwrap()
96
//! .equals(
97
//! &Column::new("new_column".into(), &[100, 100, 3, 4, 5])
98
//! )
99
//! );
100
//! ```
101
//! #### Groupby + Aggregations
102
//!
103
//!```rust
104
//! use polars_core::prelude::*;
105
//! use polars_core::df;
106
//! use polars_lazy::prelude::*;
107
//!
108
//! fn example() -> PolarsResult<DataFrame> {
109
//! let df = df!(
110
//! "date" => ["2020-08-21", "2020-08-21", "2020-08-22", "2020-08-23", "2020-08-22"],
111
//! "temp" => [20, 10, 7, 9, 1],
112
//! "rain" => [0.2, 0.1, 0.3, 0.1, 0.01]
113
//! )?;
114
//!
115
//! df.lazy()
116
//! .group_by([col("date")])
117
//! .agg([
118
//! col("rain").min().alias("min_rain"),
119
//! col("rain").sum().alias("sum_rain"),
120
//! col("rain").quantile(lit(0.5), QuantileMethod::Nearest).alias("median_rain"),
121
//! ])
122
//! .sort(["date"], Default::default())
123
//! .collect()
124
//! }
125
//! ```
126
//!
127
//! #### Calling any function
128
//!
129
//! Below we lazily call a custom closure of type `Series => Result<Series>`. Because the closure
130
//! changes the type/variant of the Series we also define the return type. This is important because
131
//! due to the laziness the types should be known beforehand. Note that by applying these custom
132
//! functions you have access to the whole **eager API** of the Series/ChunkedArrays.
133
//!
134
//!```rust
135
//! #[macro_use] extern crate polars_core;
136
//! use polars_core::prelude::*;
137
//! use polars_lazy::prelude::*;
138
//!
139
//! let df = df! {
140
//! "column_a" => &[1, 2, 3, 4, 5],
141
//! "column_b" => &["a", "b", "c", "d", "e"]
142
//! }.unwrap();
143
//!
144
//! let new = df.lazy()
145
//! .with_column(
146
//! col("column_a")
147
//! // apply a custom closure Series => Result<Series>
148
//! .map(
149
//! |_s| Ok(Column::new("".into(), &[6.0f32, 6.0, 6.0, 6.0, 6.0])),
150
//! // return type of the closure
151
//! |_, f| Ok(Field::new(f.name().clone(), DataType::Float64))
152
//! ).alias("new_column"),
153
//! )
154
//! .collect()
155
//! .unwrap();
156
//! ```
157
//!
158
//! #### Joins, filters and projections
159
//!
160
//! In the query below we do a lazy join and afterwards we filter rows based on the predicate `a < 2`.
161
//! And last we select the columns `"b"` and `"c_first"`. In an eager API this query would be very
162
//! suboptimal because we join on DataFrames with more columns and rows than needed. In this case
163
//! the query optimizer will do the selection of the columns (projection) and the filtering of the
164
//! rows (selection) before the join, thereby reducing the amount of work done by the query.
165
//!
166
//! ```rust
167
//! # use polars_core::prelude::*;
168
//! # use polars_lazy::prelude::*;
169
//!
170
//! fn example(df_a: DataFrame, df_b: DataFrame) -> LazyFrame {
171
//! df_a.lazy()
172
//! .left_join(df_b.lazy(), col("b_left"), col("b_right"))
173
//! .filter(
174
//! col("a").lt(lit(2))
175
//! )
176
//! .group_by([col("b")])
177
//! .agg(
178
//! vec![col("b").first().alias("first_b"), col("c").first().alias("first_c")]
179
//! )
180
//! .select(&[col("b"), col("c_first")])
181
//! }
182
//! ```
183
//!
184
//! If we want to do an aggregation on all columns we can use the wildcard operator `*` to achieve this.
185
//!
186
//! ```rust
187
//! # use polars_core::prelude::*;
188
//! # use polars_lazy::prelude::*;
189
//!
190
//! fn aggregate_all_columns(df_a: DataFrame) -> LazyFrame {
191
//! df_a.lazy()
192
//! .group_by([col("b")])
193
//! .agg(
194
//! vec![col("*").first()]
195
//! )
196
//! }
197
//! ```
198
#![allow(ambiguous_glob_reexports)]
199
#![cfg_attr(docsrs, feature(doc_auto_cfg))]
200
#![cfg_attr(
201
feature = "allow_unused",
202
allow(unused, dead_code, irrefutable_let_patterns)
203
)] // Maybe be caused by some feature
204
extern crate core;
205
206
#[cfg(feature = "dot_diagram")]
207
mod dot;
208
pub mod dsl;
209
pub mod frame;
210
pub mod physical_plan;
211
pub mod prelude;
212
213
mod scan;
214
#[cfg(test)]
215
mod tests;
216
217