Path: blob/main/docs/source/user-guide/migration/spark.md
6940 views
Coming from Apache Spark
Column-based API vs. Row-based API
Whereas the Spark
DataFrame
is analogous to a collection of rows, a Polars DataFrame
is closer to a collection of columns. This means that you can combine columns in Polars in ways that are not possible in Spark
, because Spark
preserves the relationship of the data in each row.
Consider this sample dataset:
Example 1: Combining head
and sum
In Polars you can write something like this:
Output:
The expressions on columns foo
and bar
are completely independent. Since the expression on bar
returns a single value, that value is repeated for each value output by the expression on foo
. But a
and b
have no relation to the data that produced the sum of 9
.
To do something similar in Spark
, you'd need to compute the sum separately and provide it as a literal:
Output:
Example 2: Combining Two head
s
In Polars you can combine two different head
expressions on the same DataFrame, provided that they return the same number of values.
Output:
Again, the two head
expressions here are completely independent, and the pairing of a
to 5
and b
to 4
results purely from the juxtaposition of the two columns output by the expressions.
To accomplish something similar in Spark
, you would need to generate an artificial key that enables you to join the values in this way.
Output:
Example 3: Composing expressions
Polars allows you compose expressions quite liberally. For example, if you want to find the rolling mean of a lagged variable, you can compose shift
and rolling_mean
and evaluate them in a single over
expression:
In PySpark however this is not allowed. They allow composing expressions such as F.mean(F.abs("price")).over(window)
because F.abs
is an elementwise function, but not F.mean(F.lag("price", 1)).over(window)
because F.lag
is a window function. To produce the same result, both F.lag
and F.mean
need their own window.