This notebook is primarily taken from SettingWithCopyWarning in Pandas: Views vs Copies, by Mirko Stojiljković.
x | y | z | |
---|---|---|---|
a | 1 | 1 | 45 |
b | 2 | 3 | 98 |
c | 4 | 9 | 24 |
d | 8 | 27 | 11 |
e | 16 | 81 | 64 |
Create a boolean array named mask
that has a True value for all rows in df
for which the value in column 'z' is less than 50.
And look at the rows in df
for which mask
contains a value of True
x | y | z | |
---|---|---|---|
a | 1 | 1 | 45 |
c | 4 | 9 | 24 |
d | 8 | 27 | 11 |
Now we want to set all the values in column z
of df
for those rows where mask
contains True.
Here's an illustration that shows what's happening.
Why does the application of mask
return a copy of df
rather than a slice
? NumPy, and therefore Pandas, wants to keep all the elements of a Series
in a contiguous block of memory. From High-performance embedded computing by Cardoso, Coutinho, and Diniz:
To exploit SIMD units, it is very important to be able to combine multiple load or store accesses in a single SIMD instruction. This can be achieved when using contiguous memory accesses, e.g., in the presence of unit stride accesses, and when array elements are aligned.
Perhaps surprisingly, you don't get the warning message if you execute the following cell.
x | y | z | |
---|---|---|---|
a | 1 | 1 | 0 |
b | 2 | 3 | 98 |
c | 4 | 9 | 0 |
d | 8 | 27 | 0 |
e | 16 | 81 | 64 |
The picture below illustrates why
Pandas knows that df['z']
is not a copy of the data in df
, and implements the assignment statement directly on that portion of memory.
Implementation of the [] operator in Python
When creating an object that supports indexing via the [ ]
notation, Python requires classes like DataFrame
and Series
to implement two methods: __getitem__
and __setitem__
(creatively referred to as dunder methods).
So this means that statements like
and
involve 3 method calls:
2 calls to
__getitem__
1 call to
__setitem__
These methods can make use of an internal DataFrame
/ Series
method named _is_view
in their implementations. If something is a view, it's not a copy.
Using accessor methods for assignment
Stojiljković advises that you:
Avoid chained assignments that combine 2 or more indexing operations like
df[mask]['z'] = 0
anddf.loc[mask]["z"] = 0
Apply single assignments with just one indexing operation, like
df.loc[mask, "z"] = 0
Other methods that are generally safer include iloc
, at
, and iat
. at/iat
allow you to access a single element of a DataFrame
/ Series
by providing a row/column combination.
x | y | z | |
---|---|---|---|
a | 1 | 1 | 0 |
b | 2 | 3 | 98 |
c | 4 | 9 | 0 |
d | 8 | 27 | 0 |
e | 16 | 81 | 64 |
You can control whether Pandas issues the SettingWithCopyWarning
by calling pd.set_option
like this:
pd.set_option("mode.chained_assignment", "raise")
pd.set_option("mode.chained_assignment", "warn")
defaultpd.set_option("mode.chained_assignment", None)
For more information , continue reading Stojiljković's article.