Path: blob/main/docs/source/user-guide/expressions/missing-data.md
8353 views
Missing data
This section of the user guide teaches how to work with missing data in Polars.
null and NaN values
In Polars, missing data is represented by the value null. This missing value null is used for all data types, including numerical types.
Polars also supports the value NaN (“Not a Number”) for columns with floating point numbers. The value NaN is considered to be a valid floating point value, which is different from missing data. We discuss the value NaN separately below.
When creating a series or a dataframe, you can set a value to null by using the appropriate construct for your language:
{{code_block('user-guide/expressions/missing-data','dataframe',['DataFrame'])}}
!!! info "Difference from pandas"
Missing data metadata
Polars keeps track of some metadata regarding the missing data of each series. This metadata allows Polars to answer some basic queries about missing values in a very efficient way, namely how many values are missing and which ones are missing.
To determine how many values are missing from a column you can use the function null_count:
{{code_block('user-guide/expressions/missing-data','count',['null_count'])}}
The function null_count can be called on a dataframe, a column from a dataframe, or on a series directly. The function null_count is a cheap operation because the result is already known.
Polars uses something called a “validity bitmap” to know which values are missing in a series. The validity bitmap is memory efficient as it is bit encoded. If a series has length , then its validity bitmap will cost bytes. The function is_null uses the validity bitmap to efficiently report which values are null and which are not:
{{code_block('user-guide/expressions/missing-data','isnull',['is_null'])}}
The function is_null can be used on a column of a dataframe or on a series directly. Again, this is a cheap operation because the result is already known by Polars.
??? info "Why does Polars waste memory on a validity bitmap?"
Filling missing data
Missing data in a series can be filled with the function fill_null. You can specify how missing data is effectively filled in a couple of different ways:
a literal of the correct data type;
a Polars expression, such as replacing with values computed from another column;
a strategy based on neighbouring values, such as filling forwards or backwards; and
interpolation.
To illustrate how each of these methods work we start by defining a simple dataframe with two missing values in the second column:
{{code_block('user-guide/expressions/missing-data','dataframe2',['DataFrame'])}}
Fill with a specified literal value
You can fill the missing data with a specified literal value. This literal value will replace all of the occurrences of the value null:
{{code_block('user-guide/expressions/missing-data','fill',['fill_null'])}}
However, this is actually just a special case of the general case where the function fill_null replaces missing values with the corresponding values from the result of a Polars expression, as seen next.
Fill with an expression
In the general case, the missing data can be filled by extracting the corresponding values from the result of a general Polars expression. For example, we can fill the second column with values taken from the double of the first column:
{{code_block('user-guide/expressions/missing-data','fillexpr',['fill_null'])}}
Fill with a strategy based on neighbouring values
You can also fill the missing data by following a fill strategy based on the neighbouring values. The two simpler strategies look for the first non-null value that comes immediately before or immediately after the value null that is being filled:
{{code_block('user-guide/expressions/missing-data','fillstrategy',['fill_null'])}}
You can find other fill strategies in the API docs.
Fill with interpolation
Additionally, you can fill intermediate missing data with interpolation by using the function interpolate instead of the function fill_null:
{{code_block('user-guide/expressions/missing-data','fillinterpolate',['interpolate'])}}
Note: With interpolate, nulls at the beginning and end of the series remain null.
Not a Number, or NaN values
Missing data in a series is only ever represented by the value null, regardless of the data type of the series. Columns with a floating point data type can sometimes have the value NaN, which might be confused with null.
The special value NaN can be created directly:
{{code_block('user-guide/expressions/missing-data','nan',['DataFrame'])}}
And it might also arise as the result of a computation:
{{code_block('user-guide/expressions/missing-data','nan-computed',[])}}
!!! info
NaN values are considered to be a type of floating point data and are not considered to be missing data in Polars. This means:
NaNvalues are not counted with the functionnull_count; andNaNvalues are filled when you use the specialised functionfill_nanmethod but are not filled with the functionfill_null.
Polars has the functions is_nan and fill_nan, which work in a similar way to the functions is_null and fill_null. Unlike with missing data, Polars does not hold any metadata regarding the NaN values, so the function is_nan entails actual computation.
One further difference between the values null and NaN is that numerical aggregating functions, like mean and sum, skip the missing values when computing the result, whereas the value NaN is considered for the computation and typically propagates into the result. If desirable, this behavior can be avoided by replacing the occurrences of the value NaN with the value null:
{{code_block('user-guide/expressions/missing-data','nanfill',['fill_nan'])}}
You can learn more about the value NaN in the section about floating point number data types.