Contact
CoCalc Logo Icon
StoreFeaturesDocsShareSupport News AboutSign UpSign In
| Download
Project: CSCI 195
Views: 5930
Image: ubuntu2004
Kernel: Python 3 (system-wide)
import pandas as pd from pandas import DataFrame import numpy as np
data = {"x": 2**np.arange(5), "y": 3**np.arange(5), "z": np.array([45, 98, 24, 11, 64]) } index = ["a", "b", "c", "d", "e"] df = DataFrame(data = data, index=index) df.head()
x y z
a 1 1 45
b 2 3 98
c 4 9 24
d 8 27 11
e 16 81 64

Create a boolean array named mask that has a True value for all rows in df for which the value in column 'z' is less than 50.

mask = df['z'] < 50 mask
a True b False c True d True e False Name: z, dtype: bool

And look at the rows in df for which mask contains a value of True

df[mask]
x y z
a 1 1 45
c 4 9 24
d 8 27 11

Now we want to set all the values in column z of df for those rows where mask contains True.

df[mask]['z'] = 0
/tmp/ipykernel_2151/2435874596.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df[mask]['z'] = 0

Here's an illustration that shows what's happening.

Why does the application of mask return a copy of df rather than a slice? NumPy, and therefore Pandas, wants to keep all the elements of a Series in a contiguous block of memory. From High-performance embedded computing by Cardoso, Coutinho, and Diniz:

To exploit SIMD units, it is very important to be able to combine multiple load or store accesses in a single SIMD instruction. This can be achieved when using contiguous memory accesses, e.g., in the presence of unit stride accesses, and when array elements are aligned.

Perhaps surprisingly, you don't get the warning message if you execute the following cell.

df["z"][mask] = 0 df.head()
x y z
a 1 1 0
b 2 3 98
c 4 9 0
d 8 27 0
e 16 81 64

The picture below illustrates why

Pandas knows that df['z'] is not a copy of the data in df, and implements the assignment statement directly on that portion of memory.

Implementation of the [] operator in Python

When creating an object that supports indexing via the [ ] notation, Python requires classes like DataFrame and Series to implement two methods: __getitem__ and __setitem__ (creatively referred to as dunder methods).

So this means that statements like

df[mask]["z"] = 0

and

df["z"][mask] = 0

involve 3 method calls:

  • 2 calls to __getitem__

  • 1 call to __setitem__

These methods can make use of an internal DataFrame / Series method named _is_view in their implementations. If something is a view, it's not a copy.

print(f"df is copy: {not df._is_view}") print(f"df[mask] is copy: {not df[mask]._is_view}") print(f"df['z'] is copy: {not df['z']._is_view}")
df is copy: True df[mask] is copy: True df['z'] is copy: False

Using accessor methods for assignment

Stojiljković advises that you:

  • Avoid chained assignments that combine 2 or more indexing operations like df[mask]['z'] = 0 and df.loc[mask]["z"] = 0

  • Apply single assignments with just one indexing operation, like df.loc[mask, "z"] = 0

Other methods that are generally safer include iloc, at, and iat. at/iat allow you to access a single element of a DataFrame / Series by providing a row/column combination.

df.loc[mask, "z"] = 0 df
x y z
a 1 1 0
b 2 3 98
c 4 9 0
d 8 27 0
e 16 81 64

You can control whether Pandas issues the SettingWithCopyWarning by calling pd.set_option like this:

  • pd.set_option("mode.chained_assignment", "raise")

  • pd.set_option("mode.chained_assignment", "warn") default

  • pd.set_option("mode.chained_assignment", None)