GitHub Repository: pola-rs/polars
Path: blob/main/docs/source/releases/upgrade/1.md
⁸⁴²² views

Version 1

Breaking changes

Properly apply `strict` parameter in Series constructor

The behavior of the Series constructor has been updated. Generally, it will be more strict, unless the user passes strict=False.

Strict construction is more efficient than non-strict construction, so make sure to pass values of the same data type to the constructor for the best performance.

Example

Before:

>>> s = pl.Series([1, 2, 3.5])
shape: (3,)
Series: '' [f64]
[
        1.0
        2.0
        3.5
]
>>> s = pl.Series([1, 2, 3.5], strict=False)
shape: (3,)
Series: '' [i64]
[
        1
        2
        null
]
>>> s = pl.Series([1, 2, 3.5], strict=False, dtype=pl.Int8)
Series: '' [i8]
[
        1
        2
        null
]

After:

>>> s = pl.Series([1, 2, 3.5])
Traceback (most recent call last):
...
TypeError: unexpected value while building Series of type Int64; found value of type Float64: 3.5

Hint: Try setting `strict=False` to allow passing data with mixed types.
>>> s = pl.Series([1, 2, 3.5], strict=False)
shape: (3,)
Series: '' [f64]
[
        1.0
        2.0
        3.5
]
>>> s = pl.Series([1, 2, 3.5], strict=False, dtype=pl.Int8)
Series: '' [i8]
[
        1
        2
        3
]

Change data orientation inference logic for DataFrame construction

Polars no longer inspects data types to infer the orientation of the data passed to the DataFrame constructor. Data orientation is inferred based on the data and schema dimensions.

Additionally, a warning is raised whenever row orientation is inferred. Because of some confusing edge cases, users should pass orient="row" to make explicit that their input is row-based.

Example

Before:

>>> data = [[1, "a"], [2, "b"]]
>>> pl.DataFrame(data)
shape: (2, 2)
┌──────────┬──────────┐
│ column_0 ┆ column_1 │
│ ---      ┆ ---      │
│ i64      ┆ str      │
╞══════════╪══════════╡
│ 1        ┆ a        │
│ 2        ┆ b        │
└──────────┴──────────┘

After:

>>> pl.DataFrame(data)
Traceback (most recent call last):
...
TypeError: unexpected value while building Series of type Int64; found value of type String: "a"

Hint: Try setting `strict=False` to allow passing data with mixed types.

Use instead:

>>> pl.DataFrame(data, orient="row")
shape: (2, 2)
┌──────────┬──────────┐
│ column_0 ┆ column_1 │
│ ---      ┆ ---      │
│ i64      ┆ str      │
╞══════════╪══════════╡
│ 1        ┆ a        │
│ 2        ┆ b        │
└──────────┴──────────┘

Consistently convert to given time zone in Series constructor

!!! danger

This change may silently impact the results of your pipelines.
If you work with time zones, please make sure to account for this change.

Handling of time zone information in the Series and DataFrame constructors was inconsistent. Row-wise construction would convert to the given time zone, while column-wise construction would replace the time zone. The inconsistency has been fixed by always converting to the time zone specified in the data type.

Example

Before:

>>> from datetime import datetime
>>> pl.Series([datetime(2020, 1, 1)], dtype=pl.Datetime('us', 'Europe/Amsterdam'))
shape: (1,)
Series: '' [datetime[μs, Europe/Amsterdam]]
[
        2020-01-01 00:00:00 CET
]

After:

>>> from datetime import datetime
>>> pl.Series([datetime(2020, 1, 1)], dtype=pl.Datetime('us', 'Europe/Amsterdam'))
shape: (1,)
Series: '' [datetime[μs, Europe/Amsterdam]]
[
        2020-01-01 01:00:00 CET
]

Update some error types to more appropriate variants

We have updated a lot of error types to more accurately represent the problem. Most commonly, ComputeError types were changed to InvalidOperationError or SchemaError.

Example

Before:

>>> s = pl.Series("a", [100, 200, 300])
>>> s.cast(pl.UInt8)
Traceback (most recent call last):
...
polars.exceptions.ComputeError: conversion from `i64` to `u8` failed in column 'a' for 1 out of 3 values: [300]

After:

>>> s.cast(pl.UInt8)
Traceback (most recent call last):
...
polars.exceptions.InvalidOperationError: conversion from `i64` to `u8` failed in column 'a' for 1 out of 3 values: [300]

Update `read/scan_parquet` to disable Hive partitioning by default for file inputs

Parquet reading functions now also support directory inputs. Hive partitioning is enabled by default for directories, but is now disabled by default for file inputs. File inputs include single files, globs, and lists of files. Explicitly pass hive_partitioning=True to restore previous behavior.

Example

Before:

>>> pl.read_parquet("dataset/a=1/foo.parquet")
shape: (2, 2)
┌─────┬─────┐
│ a   ┆ x   │
│ --- ┆ --- │
│ i64 ┆ f64 │
╞═════╪═════╡
│ 1   ┆ 1.0 │
│ 1   ┆ 2.0 │
└─────┴─────┘

After:

>>> pl.read_parquet("dataset/a=1/foo.parquet")
shape: (2, 1)
┌─────┐
│ x   │
│ --- │
│ f64 │
╞═════╡
│ 1.0 │
│ 2.0 │
└─────┘
>>> pl.read_parquet("dataset/a=1/foo.parquet", hive_partitioning=True)
shape: (2, 2)
┌─────┬─────┐
│ a   ┆ x   │
│ --- ┆ --- │
│ i64 ┆ f64 │
╞═════╪═════╡
│ 1   ┆ 1.0 │
│ 1   ┆ 2.0 │
└─────┴─────┘

Update `reshape` to return Array types instead of List types

reshape now returns an Array type instead of a List type.

Users can restore the old functionality by calling .arr.to_list() on the output. Note that this is not more expensive than it would be to create a List type directly, because reshaping into an array is basically free.

Example

Before:

>>> s = pl.Series([1, 2, 3, 4, 5, 6])
>>> s.reshape((2, 3))
shape: (2,)
Series: '' [list[i64]]
[
        [1, 2, 3]
        [4, 5, 6]
]

After:

>>> s.reshape((2, 3))
shape: (2,)
Series: '' [array[i64, 3]]
[
        [1, 2, 3]
        [4, 5, 6]
]

Read 2D NumPy arrays as `Array` type instead of `List`

The Series constructor now parses 2D NumPy arrays as an Array type rather than a List type.

Example

Before:

>>> import numpy as np
>>> arr = np.array([[1, 2], [3, 4]])
>>> pl.Series(arr)
shape: (2,)
Series: '' [list[i64]]
[
        [1, 2]
        [3, 4]
]

After:

>>> import numpy as np
>>> arr = np.array([[1, 2], [3, 4]])
>>> pl.Series(arr)
shape: (2,)
Series: '' [array[i64, 2]]
[
        [1, 2]
        [3, 4]
]

Split `replace` functionality into two separate methods

The API for replace has proven to be confusing to many users, particularly with regards to the default argument and the resulting data type.

It has been split up into two methods: replace and replace_strict. replace now always keeps the existing data type (breaking, see example below) and is meant for replacing some values in your existing column. Its parameters default and return_dtype have been deprecated.

The new method replace_strict is meant for creating a new column, mapping some or all of the values of the original column, and optionally specifying a default value. If no default is provided, it raises an error if any non-null values are not mapped.

Example

Before:

>>> s = pl.Series([1, 2, 3])
>>> s.replace(1, "a")
shape: (3,)
Series: '' [str]
[
        "a"
        "2"
        "3"
]

After:

>>> s.replace(1, "a")
Traceback (most recent call last):
...
polars.exceptions.InvalidOperationError: conversion from `str` to `i64` failed in column 'literal' for 1 out of 1 values: ["a"]
>>> s.replace_strict(1, "a", default=s)
shape: (3,)
Series: '' [str]
[
        "a"
        "2"
        "3"
]

Preserve nulls in `ewm_mean`, `ewm_std`, and `ewm_var`

Polars will no longer forward-fill null values in ewm methods. The user can call .forward_fill() on the output to achieve the same result.

Example

Before:

>>> s = pl.Series([1, 4, None, 3])
>>> s.ewm_mean(alpha=.9, ignore_nulls=False)
shape: (4,)
Series: '' [f64]
[
        1.0
        3.727273
        3.727273
        3.007913
]

After:

>>> s.ewm_mean(alpha=.9, ignore_nulls=False)
shape: (4,)
Series: '' [f64]
[
        1.0
        3.727273
        null
        3.007913
]

Update `clip` to no longer propagate nulls in the given bounds

Null values in the bounds no longer set the value to null - instead, the original value is retained.

Before

>>> df = pl.DataFrame({"a": [0, 1, 2], "min": [1, None, 1]})
>>> df.select(pl.col("a").clip("min"))
shape: (3, 1)
┌──────┐
│ a    │
│ ---  │
│ i64  │
╞══════╡
│ 1    │
│ null │
│ 2    │
└──────┘

After

>>> df.select(pl.col("a").clip("min"))
shape: (3, 1)
┌──────┐
│ a    │
│ ---  │
│ i64  │
╞══════╡
│ 1    │
│ 1    │
│ 2    │
└──────┘

Change `str.to_datetime` to default to microsecond precision for format specifiers `"%f"` and `"%.f"`

In .str.to_datetime, when specifying %.f as the format, the default was to set the resulting datatype to nanosecond precision. This has been changed to microsecond precision.

Example

Before

>>> s = pl.Series(["2022-08-31 00:00:00.123456789"])
>>> s.str.to_datetime(format="%Y-%m-%d %H:%M:%S%.f")
shape: (1,)
Series: '' [datetime[ns]]
[
        2022-08-31 00:00:00.123456789
]

After

>>> s.str.to_datetime(format="%Y-%m-%d %H:%M:%S%.f")
shape: (1,)
Series: '' [datetime[us]]
[
        2022-08-31 00:00:00.123456
]

Update resulting column names in `pivot` when pivoting by multiple values

In DataFrame.pivot, when specifying multiple values columns, the result would redundantly include the column column in the column names. This has been addressed.

Example

Before:

>>> df = pl.DataFrame(
...     {
...         "name": ["Cady", "Cady", "Karen", "Karen"],
...         "subject": ["maths", "physics", "maths", "physics"],
...         "test_1": [98, 99, 61, 58],
...         "test_2": [100, 100, 60, 60],
...     }
... )
>>> df.pivot(index='name', columns='subject', values=['test_1', 'test_2'])
shape: (2, 5)
┌───────┬──────────────────────┬────────────────────────┬──────────────────────┬────────────────────────┐
│ name  ┆ test_1_subject_maths ┆ test_1_subject_physics ┆ test_2_subject_maths ┆ test_2_subject_physics │
│ ---   ┆ ---                  ┆ ---                    ┆ ---                  ┆ ---                    │
│ str   ┆ i64                  ┆ i64                    ┆ i64                  ┆ i64                    │
╞═══════╪══════════════════════╪════════════════════════╪══════════════════════╪════════════════════════╡
│ Cady  ┆ 98                   ┆ 99                     ┆ 100                  ┆ 100                    │
│ Karen ┆ 61                   ┆ 58                     ┆ 60                   ┆ 60                     │
└───────┴──────────────────────┴────────────────────────┴──────────────────────┴────────────────────────┘

After:

>>> df = pl.DataFrame(
...     {
...         "name": ["Cady", "Cady", "Karen", "Karen"],
...         "subject": ["maths", "physics", "maths", "physics"],
...         "test_1": [98, 99, 61, 58],
...         "test_2": [100, 100, 60, 60],
...     }
... )
>>> df.pivot('subject', index='name')
┌───────┬──────────────┬────────────────┬──────────────┬────────────────┐
│ name  ┆ test_1_maths ┆ test_1_physics ┆ test_2_maths ┆ test_2_physics │
│ ---   ┆ ---          ┆ ---            ┆ ---          ┆ ---            │
│ str   ┆ i64          ┆ i64            ┆ i64          ┆ i64            │
╞═══════╪══════════════╪════════════════╪══════════════╪════════════════╡
│ Cady  ┆ 98           ┆ 99             ┆ 100          ┆ 100            │
│ Karen ┆ 61           ┆ 58             ┆ 60           ┆ 60             │
└───────┴──────────────┴────────────────┴──────────────┴────────────────┘

Note that the function signature has also changed:

columns has been renamed to on, and is now the first positional argument.
index and values are both optional. If index is not specified, then it will use all columns not specified in on and values. If values is not specified, it will use all columns not specified in on and index.

Support Decimal types by default when converting from Arrow

Update conversion from Arrow to always convert Decimals into Polars Decimals, rather than cast to Float64. Config.activate_decimals has been removed.

Example

Before:

>>> from decimal import Decimal as D
>>> import pyarrow as pa
>>> arr = pa.array([D("1.01"), D("2.25")])
>>> pl.from_arrow(arr)
shape: (2,)
Series: '' [f64]
[
        1.01
        2.25
]

After:

>>> pl.from_arrow(arr)
shape: (2,)
Series: '' [decimal[3,2]]
[
        1.01
        2.25
]

Remove serde functionality from `pl.read_json` and `DataFrame.write_json`

pl.read_json no longer supports reading JSON files produced by DataFrame.serialize. Users should use pl.DataFrame.deserialize instead.

DataFrame.write_json now only writes row-oriented JSON. The parameters row_oriented and pretty have been removed. Users should use DataFrame.serialize to serialize a DataFrame.

Example - write_json

Before:

>>> df = pl.DataFrame({"a": [1, 2], "b": [3.0, 4.0]})
>>> df.write_json()
'{"columns":[{"name":"a","datatype":"Int64","bit_settings":"","values":[1,2]},{"name":"b","datatype":"Float64","bit_settings":"","values":[3.0,4.0]}]}'

After:

>>> df.write_json()  # Same behavior as previously `df.write_json(row_oriented=True)`
'[{"a":1,"b":3.0},{"a":2,"b":4.0}]'

Example - read_json

Before:

>>> import io
>>> df_ser = '{"columns":[{"name":"a","datatype":"Int64","bit_settings":"","values":[1,2]},{"name":"b","datatype":"Float64","bit_settings":"","values":[3.0,4.0]}]}'
>>> pl.read_json(io.StringIO(df_ser))
shape: (2, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ f64 │
╞═════╪═════╡
│ 1   ┆ 3.0 │
│ 2   ┆ 4.0 │
└─────┴─────┘

After:

>>> pl.read_json(io.StringIO(df_ser))  # Format no longer supported: data is treated as a single row
shape: (1, 1)
┌─────────────────────────────────┐
│ columns                         │
│ ---                             │
│ list[struct[4]]                 │
╞═════════════════════════════════╡
│ [{"a","Int64","",[1.0, 2.0]}, … │
└─────────────────────────────────┘

Use instead:

>>> pl.DataFrame.deserialize(io.StringIO(df_ser))
shape: (2, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ f64 │
╞═════╪═════╡
│ 1   ┆ 3.0 │
│ 2   ┆ 4.0 │
└─────┴─────┘

`Series.equals` no longer checks names by default

Previously, Series.equals would return False if the Series names didn't match. The method now no longer checks the names by default. The previous behavior can be retained by setting check_names=True.

Example

Before:

>>> s1 = pl.Series("foo", [1, 2, 3])
>>> s2 = pl.Series("bar", [1, 2, 3])
>>> s1.equals(s2)
False

After:

>>> s1.equals(s2)
True
>>> s1.equals(s2, check_names=True)
False

Remove `columns` parameter from `nth` expression function

The columns parameter was removed in favor of treating positional inputs as additional indices. Use Expr.get instead to get the same functionality.

Example

Before:

>>> df = pl.DataFrame({"a": [1, 2], "b": [3, 4], "c": [5, 6]})
>>> df.select(pl.nth(1, "a"))
shape: (1, 1)
┌─────┐
│ a   │
│ --- │
│ i64 │
╞═════╡
│ 2   │
└─────┘

After:

>>> df.select(pl.nth(1, "a"))
...
TypeError: argument 'indices': 'str' object cannot be interpreted as an integer

Use instead:

>>> df.select(pl.col("a").get(1))
shape: (1, 1)
┌─────┐
│ a   │
│ --- │
│ i64 │
╞═════╡
│ 2   │
└─────┘

Rename struct fields of `rle` output

The struct fields of the rle method have been renamed from lengths/values to len/value. The data type of the len field has also been updated to match the index type (was previously Int32, now UInt32).

Before

>>> s = pl.Series(["a", "a", "b", "c", "c", "c"])
>>> s.rle().struct.unnest()
shape: (3, 2)
┌─────────┬────────┐
│ lengths ┆ values │
│ ---     ┆ ---    │
│ i32     ┆ str    │
╞═════════╪════════╡
│ 2       ┆ a      │
│ 1       ┆ b      │
│ 3       ┆ c      │
└─────────┴────────┘

After

>>> s.rle().struct.unnest()
shape: (3, 2)
┌─────┬───────┐
│ len ┆ value │
│ --- ┆ ---   │
│ u32 ┆ str   │
╞═════╪═══════╡
│ 2   ┆ a     │
│ 1   ┆ b     │
│ 3   ┆ c     │
└─────┴───────┘

Update `set_sorted` to only accept a single column

Calling set_sorted indicates that a column is sorted individually. Passing multiple columns indicates that each of those columns are also sorted individually. However, many users assumed this meant that the columns were sorted as a group, which led to incorrect results.

To help users avoid this pitfall, we removed the possibility to specify multiple columns in set_sorted. To set multiple columns as sorted, simply call set_sorted multiple times.

Example

Before:

>>> df = pl.DataFrame({"a": [1, 2, 3], "b": [4.0, 5.0, 6.0], "c": [9, 7, 8]})
>>> df.set_sorted("a", "b")

After:

>>> df.set_sorted("a", "b")
Traceback (most recent call last):
...
TypeError: DataFrame.set_sorted() takes 2 positional arguments but 3 were given

Use instead:

>>> df.set_sorted("a").set_sorted("b")

Default to raising on out-of-bounds indices in all `get`/`gather` operations

The default behavior was inconsistent between get and gather operations in various places. Now all such operations will raise by default. Pass null_on_oob=True to restore previous behavior.

Example

Before:

>>> s = pl.Series([[0, 1, 2], [0]])
>>> s.list.get(1)
shape: (2,)
Series: '' [i64]
[
        1
        null
]

After:

>>> s.list.get(1)
Traceback (most recent call last):
...
polars.exceptions.ComputeError: get index is out of bounds

Use instead:

>>> s.list.get(1, null_on_oob=True)
shape: (2,)
Series: '' [i64]
[
        1
        null
]

Change default engine for `read_excel` to `"calamine"`

The calamine engine (available through the fastexcel package) has been added to Polars relatively recently. It's much faster than the other engines, and was already the default for xlsb and xls files. We now made it the default for all Excel files.

There may be subtle differences between this engine and the previous default (xlsx2csv). One clear difference is that the calamine engine does not support the engine_options parameter. If you cannot get your desired behavior with the calamine engine, specify engine="xlsx2csv" to restore previous behavior.

Example

Before:

>>> pl.read_excel("data.xlsx", engine_options={"skip_empty_lines": True})

After:

>>> pl.read_excel("data.xlsx", engine_options={"skip_empty_lines": True})
Traceback (most recent call last):
...
TypeError: read_excel() got an unexpected keyword argument 'skip_empty_lines'

Instead, explicitly specify the xlsx2csv engine or omit the engine_options:

>>> pl.read_excel("data.xlsx", engine="xlsx2csv", engine_options={"skip_empty_lines": True})

Remove class variables from some DataTypes

Some DataType classes had class variables. The Datetime class, for example, had time_unit and time_zone as class variables. This was unintended: these should have been instance variables. This has now been corrected.

Example

Before:

>>> dtype = pl.Datetime
>>> dtype.time_unit is None
True

After:

>>> dtype.time_unit is None
Traceback (most recent call last):
...
AttributeError: type object 'Datetime' has no attribute 'time_unit'

Use instead:

>>> getattr(dtype, "time_unit", None) is None
True

Change default `offset` in `group_by_dynamic` from 'negative `every`' to 'zero'

This affects the start of the first window in group_by_dynamic. The new behavior should align more with user expectations.

Example

Before:

>>> from datetime import date
>>> df = pl.DataFrame({
...     "ts": [date(2020, 1, 1), date(2020, 1, 2), date(2020, 1, 3)],
...     "value": [1, 2, 3],
... })
>>> df.group_by_dynamic("ts", every="1d", period="2d").agg("value")
shape: (4, 2)
┌────────────┬───────────┐
│ ts         ┆ value     │
│ ---        ┆ ---       │
│ date       ┆ list[i64] │
╞════════════╪═══════════╡
│ 2019-12-31 ┆ [1]       │
│ 2020-01-01 ┆ [1, 2]    │
│ 2020-01-02 ┆ [2, 3]    │
│ 2020-01-03 ┆ [3]       │
└────────────┴───────────┘

After:

>>> df.group_by_dynamic("ts", every="1d", period="2d").agg("value")
shape: (3, 2)
┌────────────┬───────────┐
│ ts         ┆ value     │
│ ---        ┆ ---       │
│ date       ┆ list[i64] │
╞════════════╪═══════════╡
│ 2020-01-01 ┆ [1, 2]    │
│ 2020-01-02 ┆ [2, 3]    │
│ 2020-01-03 ┆ [3]       │
└────────────┴───────────┘

Change default serialization format of `LazyFrame/DataFrame/Expr`

The only serialization format available for the serialize/deserialize methods on Polars objects was JSON. We added a more optimized binary format and made this the default. JSON serialization is still available by passing format="json".

Example

Before:

>>> lf = pl.LazyFrame({"a": [1, 2, 3]}).sum()
>>> serialized = lf.serialize()
>>> serialized
'{"MapFunction":{"input":{"DataFrameScan":{"df":{"columns":[{"name":...'
>>> from io import StringIO
>>> pl.LazyFrame.deserialize(StringIO(serialized)).collect()
shape: (1, 1)
┌─────┐
│ a   │
│ --- │
│ i64 │
╞═════╡
│ 6   │
└─────┘

After:

>>> lf = pl.LazyFrame({"a": [1, 2, 3]}).sum()
>>> serialized = lf.serialize()
>>> serialized
b'\xa1kMapFunction\xa2einput\xa1mDataFrameScan\xa4bdf...'
>>> from io import BytesIO  # Note: using BytesIO instead of StringIO
>>> pl.LazyFrame.deserialize(BytesIO(serialized)).collect()
shape: (1, 1)
┌─────┐
│ a   │
│ --- │
│ i64 │
╞═════╡
│ 6   │
└─────┘

Constrain access to globals from `DataFrame.sql` in favor of `pl.sql`

The sql methods on DataFrame and LazyFrame can no longer access global variables. These methods should be used for operating on the frame itself. For global access, there is now the top-level sql function.

Example

Before:

>>> df1 = pl.DataFrame({"id1": [1, 2]})
>>> df2 = pl.DataFrame({"id2": [3, 4]})
>>> df1.sql("SELECT * FROM df1 CROSS JOIN df2")
shape: (4, 2)
┌─────┬─────┐
│ id1 ┆ id2 │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 3   │
│ 1   ┆ 4   │
│ 2   ┆ 3   │
│ 2   ┆ 4   │
└─────┴─────┘

After:

>>> df1.sql("SELECT * FROM df1 CROSS JOIN df2")
Traceback (most recent call last):
...
polars.exceptions.SQLInterfaceError: relation 'df1' was not found

Use instead:

>>> pl.sql("SELECT * FROM df1 CROSS JOIN df2", eager=True)
shape: (4, 2)
┌─────┬─────┐
│ id1 ┆ id2 │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 3   │
│ 1   ┆ 4   │
│ 2   ┆ 3   │
│ 2   ┆ 4   │
└─────┴─────┘

Remove re-export of type aliases

We have a lot of type aliases defined in the polars.type_aliases module. Some of these were re-exported at the top-level and in the polars.datatypes module. These re-exports have been removed.

We plan on adding a public polars.typing module in the future with a number of curated type aliases. Until then, please define your own type aliases, or import from our polars.type_aliases module. Note that the type_aliases module is not technically public, so use at your own risk.

Example

Before:

def foo(dtype: pl.PolarsDataType) -> None: ...

After:

PolarsDataType = pl.DataType | type[pl.DataType]

def foo(dtype: PolarsDataType) -> None: ...

Streamline optional dependency definitions in `pyproject.toml`

We revisited to optional dependency definitions and made some minor changes. If you were using the extras fastexcel, gevent, matplotlib, or async, this is a breaking change. Please update your Polars installation to use the new extras.

Example

Before:

pip install 'polars[fastexcel,gevent,matplotlib]'

After:

pip install 'polars[calamine,async,graph]'

Deprecations

Issue `PerformanceWarning` when LazyFrame properties `schema/dtypes/columns/width` are used

Recent improvements to the correctness of the schema resolving in the lazy engine have had significant performance impact on the cost of resolving the schema. It is no longer 'free' - in fact, in complex pipelines with lazy file reading, resolving the schema can be relatively expensive.

Because of this, the schema-related properties on LazyFrame were no longer good API design. Properties represent information that is already available, and just needs to be retrieved. However, for the LazyFrame properties, accessing these may have significant performance cost.

To solve this, we added the LazyFrame.collect_schema method, which retrieves the schema and returns a Schema object. The properties raise a PerformanceWarning and tell the user to use collect_schema instead. We chose not to deprecate the properties for now to facilitate writing code that is generic for both DataFrames and LazyFrames.

Version 1

Breaking changes

Properly apply `strict` parameter in Series constructor

Change data orientation inference logic for DataFrame construction

Consistently convert to given time zone in Series constructor

Update some error types to more appropriate variants

Update `read/scan_parquet` to disable Hive partitioning by default for file inputs

Update `reshape` to return Array types instead of List types

Read 2D NumPy arrays as `Array` type instead of `List`

Split `replace` functionality into two separate methods

Preserve nulls in `ewm_mean`, `ewm_std`, and `ewm_var`

Update `clip` to no longer propagate nulls in the given bounds

Change `str.to_datetime` to default to microsecond precision for format specifiers `"%f"` and `"%.f"`

Example

Update resulting column names in `pivot` when pivoting by multiple values

Support Decimal types by default when converting from Arrow

Remove serde functionality from `pl.read_json` and `DataFrame.write_json`

`Series.equals` no longer checks names by default

Remove `columns` parameter from `nth` expression function

Rename struct fields of `rle` output

Update `set_sorted` to only accept a single column

Default to raising on out-of-bounds indices in all `get`/`gather` operations

Change default engine for `read_excel` to `"calamine"`

Example

Remove class variables from some DataTypes

Change default `offset` in `group_by_dynamic` from 'negative `every`' to 'zero'

Change default serialization format of `LazyFrame/DataFrame/Expr`

Constrain access to globals from `DataFrame.sql` in favor of `pl.sql`

Remove re-export of type aliases

Streamline optional dependency definitions in `pyproject.toml`

Deprecations

Issue `PerformanceWarning` when LazyFrame properties `schema/dtypes/columns/width` are used

Product

Resources

Company

Version 1

Breaking changes

Properly apply strict parameter in Series constructor

Change data orientation inference logic for DataFrame construction

Consistently convert to given time zone in Series constructor

Update some error types to more appropriate variants

Update read/scan_parquet to disable Hive partitioning by default for file inputs

Update reshape to return Array types instead of List types

Read 2D NumPy arrays as Array type instead of List

Split replace functionality into two separate methods

Preserve nulls in ewm_mean, ewm_std, and ewm_var

Update clip to no longer propagate nulls in the given bounds

Change str.to_datetime to default to microsecond precision for format specifiers "%f" and "%.f"

Example

Update resulting column names in pivot when pivoting by multiple values

Support Decimal types by default when converting from Arrow

Remove serde functionality from pl.read_json and DataFrame.write_json

Series.equals no longer checks names by default

Remove columns parameter from nth expression function

Rename struct fields of rle output

Update set_sorted to only accept a single column

Default to raising on out-of-bounds indices in all get/gather operations

Change default engine for read_excel to "calamine"

Example

Remove class variables from some DataTypes

Change default offset in group_by_dynamic from 'negative every' to 'zero'

Change default serialization format of LazyFrame/DataFrame/Expr

Constrain access to globals from DataFrame.sql in favor of pl.sql

Remove re-export of type aliases

Streamline optional dependency definitions in pyproject.toml

Deprecations

Issue PerformanceWarning when LazyFrame properties schema/dtypes/columns/width are used

Properly apply `strict` parameter in Series constructor

Update `read/scan_parquet` to disable Hive partitioning by default for file inputs

Update `reshape` to return Array types instead of List types

Read 2D NumPy arrays as `Array` type instead of `List`

Split `replace` functionality into two separate methods

Preserve nulls in `ewm_mean`, `ewm_std`, and `ewm_var`

Update `clip` to no longer propagate nulls in the given bounds

Change `str.to_datetime` to default to microsecond precision for format specifiers `"%f"` and `"%.f"`

Update resulting column names in `pivot` when pivoting by multiple values

Remove serde functionality from `pl.read_json` and `DataFrame.write_json`

`Series.equals` no longer checks names by default

Remove `columns` parameter from `nth` expression function

Rename struct fields of `rle` output

Update `set_sorted` to only accept a single column

Default to raising on out-of-bounds indices in all `get`/`gather` operations

Change default engine for `read_excel` to `"calamine"`

Change default `offset` in `group_by_dynamic` from 'negative `every`' to 'zero'

Change default serialization format of `LazyFrame/DataFrame/Expr`

Constrain access to globals from `DataFrame.sql` in favor of `pl.sql`

Streamline optional dependency definitions in `pyproject.toml`

Issue `PerformanceWarning` when LazyFrame properties `schema/dtypes/columns/width` are used