{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"This notebook is primarily taken from SettingWithCopyWarning in Pandas: Views vs Copies, by Mirko Stojiljković."
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"collapsed": false
},
"outputs": [
],
"source": [
"import pandas as pd\n",
"from pandas import DataFrame\n",
"import numpy as np"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" x | \n",
" y | \n",
" z | \n",
"
\n",
" \n",
" \n",
" \n",
" a | \n",
" 1 | \n",
" 1 | \n",
" 45 | \n",
"
\n",
" \n",
" b | \n",
" 2 | \n",
" 3 | \n",
" 98 | \n",
"
\n",
" \n",
" c | \n",
" 4 | \n",
" 9 | \n",
" 24 | \n",
"
\n",
" \n",
" d | \n",
" 8 | \n",
" 27 | \n",
" 11 | \n",
"
\n",
" \n",
" e | \n",
" 16 | \n",
" 81 | \n",
" 64 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" x y z\n",
"a 1 1 45\n",
"b 2 3 98\n",
"c 4 9 24\n",
"d 8 27 11\n",
"e 16 81 64"
]
},
"execution_count": 36,
"metadata": {
},
"output_type": "execute_result"
}
],
"source": [
"data = {\"x\": 2**np.arange(5),\n",
" \"y\": 3**np.arange(5),\n",
" \"z\": np.array([45, 98, 24, 11, 64])\n",
"}\n",
"\n",
"index = [\"a\", \"b\", \"c\", \"d\", \"e\"]\n",
"\n",
"df = DataFrame(data = data, index=index)\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"Create a boolean array named `mask` that has a True value for all rows in `df` for which the value in column 'z' is less than 50."
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"a True\n",
"b False\n",
"c True\n",
"d True\n",
"e False\n",
"Name: z, dtype: bool"
]
},
"execution_count": 37,
"metadata": {
},
"output_type": "execute_result"
}
],
"source": [
"mask = df['z'] < 50\n",
"mask"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"And look at the rows in `df` for which `mask` contains a value of **True**"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" x | \n",
" y | \n",
" z | \n",
"
\n",
" \n",
" \n",
" \n",
" a | \n",
" 1 | \n",
" 1 | \n",
" 45 | \n",
"
\n",
" \n",
" c | \n",
" 4 | \n",
" 9 | \n",
" 24 | \n",
"
\n",
" \n",
" d | \n",
" 8 | \n",
" 27 | \n",
" 11 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" x y z\n",
"a 1 1 45\n",
"c 4 9 24\n",
"d 8 27 11"
]
},
"execution_count": 38,
"metadata": {
},
"output_type": "execute_result"
}
],
"source": [
"df[mask]"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"Now we want to set all the values in column `z` of `df` for those rows where `mask` contains True."
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/tmp/ipykernel_2151/2435874596.py:1: SettingWithCopyWarning: \n",
"A value is trying to be set on a copy of a slice from a DataFrame.\n",
"Try using .loc[row_indexer,col_indexer] = value instead\n",
"\n",
"See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
" df[mask]['z'] = 0\n"
]
}
],
"source": [
"df[mask]['z'] = 0"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"Here's an illustration that shows what's happening.\n",
"\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"Why does the application of `mask` return a **copy** of `df` rather than a `slice`? NumPy, and therefore Pandas, wants to keep all the elements of a `Series` in a contiguous block of memory. From **High-performance embedded computing** by Cardoso, Coutinho, and Diniz:\n",
"\n",
"\n",
" To exploit SIMD units, it is very important to be able to combine multiple load or store accesses in a single SIMD instruction. This can be achieved when using contiguous memory accesses, e.g., in the presence of unit stride accesses, and when array elements are aligned.\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"Perhaps surprisingly, you **don't** get the warning message if you execute the following cell."
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" x | \n",
" y | \n",
" z | \n",
"
\n",
" \n",
" \n",
" \n",
" a | \n",
" 1 | \n",
" 1 | \n",
" 0 | \n",
"
\n",
" \n",
" b | \n",
" 2 | \n",
" 3 | \n",
" 98 | \n",
"
\n",
" \n",
" c | \n",
" 4 | \n",
" 9 | \n",
" 0 | \n",
"
\n",
" \n",
" d | \n",
" 8 | \n",
" 27 | \n",
" 0 | \n",
"
\n",
" \n",
" e | \n",
" 16 | \n",
" 81 | \n",
" 64 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" x y z\n",
"a 1 1 0\n",
"b 2 3 98\n",
"c 4 9 0\n",
"d 8 27 0\n",
"e 16 81 64"
]
},
"execution_count": 41,
"metadata": {
},
"output_type": "execute_result"
}
],
"source": [
"df[\"z\"][mask] = 0\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"The picture below illustrates why\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"Pandas knows that `df['z']` is **not a copy** of the data in `df`, and implements the assignment statement directly on that portion of memory."
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"## Implementation of the \\[\\] operator in Python\n",
"\n",
"When creating an object that supports indexing via the `[ ]` notation, Python requires classes like `DataFrame` and `Series` to implement two **methods**: `__getitem__` and `__setitem__` (creatively referred to as **dunder** methods).\n",
"\n",
"So this means that statements like\n",
"```\n",
"df[mask][\"z\"] = 0\n",
"```\n",
"and \n",
"```\n",
"df[\"z\"][mask] = 0\n",
"```\n",
"\n",
"involve 3 method calls:\n",
"\n",
"* 2 calls to `__getitem__`\n",
"* 1 call to `__setitem__`\n",
"\n",
"These methods can make use of an **internal** `DataFrame` / `Series` method named `_is_view` in their implementations. If something is a **view**, it's not a **copy**."
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"df is copy: True\n",
"df[mask] is copy: True\n",
"df['z'] is copy: False\n"
]
}
],
"source": [
"print(f\"df is copy: {not df._is_view}\")\n",
"print(f\"df[mask] is copy: {not df[mask]._is_view}\")\n",
"print(f\"df['z'] is copy: {not df['z']._is_view}\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"## Using accessor methods for assignment\n",
"\n",
"Stojiljković advises that you:\n",
"\n",
"* **Avoid chained assignments** that combine 2 or more indexing operations like `df[mask]['z'] = 0` and `df.loc[mask][\"z\"] = 0`\n",
"* **Apply single assignments** with just one indexing operation, like `df.loc[mask, \"z\"] = 0`\n",
"\n",
"Other methods that are generally safer include `iloc`, `at`, and `iat`. `at/iat` allow you to access a **single** element of a `DataFrame` / `Series` by providing a row/column combination."
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" x | \n",
" y | \n",
" z | \n",
"
\n",
" \n",
" \n",
" \n",
" a | \n",
" 1 | \n",
" 1 | \n",
" 0 | \n",
"
\n",
" \n",
" b | \n",
" 2 | \n",
" 3 | \n",
" 98 | \n",
"
\n",
" \n",
" c | \n",
" 4 | \n",
" 9 | \n",
" 0 | \n",
"
\n",
" \n",
" d | \n",
" 8 | \n",
" 27 | \n",
" 0 | \n",
"
\n",
" \n",
" e | \n",
" 16 | \n",
" 81 | \n",
" 64 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" x y z\n",
"a 1 1 0\n",
"b 2 3 98\n",
"c 4 9 0\n",
"d 8 27 0\n",
"e 16 81 64"
]
},
"execution_count": 34,
"metadata": {
},
"output_type": "execute_result"
}
],
"source": [
"df.loc[mask, \"z\"] = 0\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"You can control whether Pandas issues the `SettingWithCopyWarning` by calling `pd.set_option` like this:\n",
"\n",
"* `pd.set_option(\"mode.chained_assignment\", \"raise\") `\n",
"* `pd.set_option(\"mode.chained_assignment\", \"warn\") ` **default**\n",
"* `pd.set_option(\"mode.chained_assignment\", None) `"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"For more information , continue reading Stojiljković's article."
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {
"collapsed": false
},
"outputs": [
],
"source": [
]
}
],
"metadata": {
"kernelspec": {
"argv": [
"/usr/bin/python3",
"-m",
"ipykernel",
"--HistoryManager.enabled=False",
"--matplotlib=inline",
"-c",
"%config InlineBackend.figure_formats = set(['retina'])\nimport matplotlib; matplotlib.rcParams['figure.figsize'] = (12, 7)",
"-f",
"{connection_file}"
],
"display_name": "Python 3 (system-wide)",
"env": {
},
"language": "python",
"metadata": {
"cocalc": {
"description": "Python 3 programming language",
"priority": 100,
"url": "https://www.python.org/"
}
},
"name": "python3",
"resource_dir": "/ext/jupyter/kernels/python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.10"
}
},
"nbformat": 4,
"nbformat_minor": 4
}