{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "## Getting Started\n",
    "\n",
    "Execute the 2 cells below to import our usual data science libraries, and then load the final version of the access log `DataFrame` from our access log processing exercise."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
   ],
   "source": [
    "import pandas as pd\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
   ],
   "source": [
    "logs = pd.read_pickle('homework5.pkl')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "## Part 1 - Decoding the User Agent Column"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "Later in this notebook you'll be doing some analysis of requests by **user agent**, which is a more technical term for browser.  To prepare for this, you'll standardize the values in the `User Agent` column to a restricted set of values."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "First, use the `fillna` method to replace all missing values in the `User Agent` column with the string **Unknown**.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
   ],
   "source": [
    "logs['User Agent'].fillna(\"Unknown\", inplace=True)\n",
    "logs['User Agent'].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "Now, create a column in `logs` named `User Agent Lower` which is the result of converting the values in the `User Agent` column to lower case.  See pages 220-222 (end of Section 7.3) to see how to do this efficiently.  **You should not write a for loop** to accomplish this task.\n",
    "\n",
    "Display the first 10 values in the `User Agent Lower` column only to verify your code worked correctly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
   ],
   "source": [
    "logs[\"User Agent Lower\"] = logs[\"User Agent\"].str.lower()\n",
    "logs[\"User Agent Lower\"].head(n=10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "Next, write one or more statements in the cell below that add a column named `Platform` to `logs`.  The contents of `Platform` will initially be the value **Other**.\n",
    "\n",
    "After creating `Platform`, display the contents of the `User Agent` and `Platform` columns for the first 10 rows of `logs` to verify it worked correctly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
   ],
   "source": [
    "logs['Platform'] = \"Other\"\n",
    "logs[['User Agent', 'Platform']].head(n=10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "In the cell below, write code that looks for the following strings in each of the values in `User Agent Lower`, setting the value in the `Platform` column to the associated value.\n",
    "<table>\n",
    "    <thead>\n",
    "        <tr>\n",
    "            <th>User Agent Lower Contains</th>\n",
    "            <th>Set Platform to</th>\n",
    "        </tr>\n",
    "    </thead>\n",
    "    <tbody>\n",
    "       <tr>\n",
    "           <td>android</td>\n",
    "           <td>Android</td>\n",
    "        </tr> \n",
    "        <tr>\n",
    "           <td>iphone</td>\n",
    "           <td>iPhone</td>\n",
    "        </tr>\n",
    "        <tr>\n",
    "            <td>ipad</td>\n",
    "            <td>iPad</td>\n",
    "        </tr>\n",
    "        <tr>\n",
    "            <td>macintosh</td>\n",
    "            <td>Mac</td>\n",
    "        </tr>\n",
    "         <tr>\n",
    "           <td>windows</td>\n",
    "           <td>Windows</td>\n",
    "        </tr> \n",
    "        <tr>\n",
    "           <td>bot</td>\n",
    "           <td>Web crawler</td>\n",
    "        </tr> \n",
    "        <tr>\n",
    "            <td>spider</td>\n",
    "            <td>Web crawler</td>\n",
    "        </tr>\n",
    "    </tbody>\n",
    "</table>\n",
    "\n",
    "\n",
    "\n",
    "You can again reference pages 220-222 (end of section 7.3) from the course textbook for more information on how to efficiently process string values in a Pandas column.  Do not write a `for loop` over the rows in `logs` to accomplish this task!  \n",
    "\n",
    "Use the `loc` function when setting the values in the `Platform` column to avoid an error message from Pandas about setting values on a copy of a slice.\n",
    "\n",
    "Display the `User Agent Lower` and `Platform` columns for the first 10 rows in `logs` when you are done to verify your work."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [
   ],
   "source": [
    "user_agent_mappings = {\n",
    "    \"android\": \"Android\",\n",
    "    \"iphone\": \"iPhone\",\n",
    "    \"ipad\": \"iPad\",\n",
    "    \"macintosh\": \"Mac\",\n",
    "    \"windows\": \"Windows\",\n",
    "    \"bot\": \"Web crawler\",\n",
    "    \"spider\": \"Web crawler\"\n",
    "}\n",
    "\n",
    "for search_string, platform in user_agent_mappings.items():\n",
    "    mask = logs[\"User Agent Lower\"].str.contains(search_string)\n",
    "    logs.loc[mask, \"Platform\"] = platform\n",
    "\n",
    "logs[[\"User Agent Lower\", \"Platform\"]].head(n=10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "Now we want to set the `Platform` column to **Linux** for those values of `User Agent Lower` that contain *linux* but do not contain *Android*.\n",
    "\n",
    "First, print out the number of rows in `logs` where the `Platform` column has the value *Android*.  We will want to verify this number doesn't change when we set values for *linux*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
   ],
   "source": [
    "print((logs['Platform'] == \"Android\").sum())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "Now write code in the cell below to change the `Platform` value to **Linux** when the `User Agent Lower` column contains *linux* and does not contain *android*.  I used `np.logical_not` to help me with this.  \n",
    "\n",
    "Again, use the `loc` method when setting values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
   ],
   "source": [
    "mask = logs[\"User Agent\"].str.contains(\"Linux\") & np.logical_not(logs[\"User Agent\"].str.contains(\"Android\"))\n",
    "logs.loc[mask, \"Platform\"] = \"Linux\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "Now copy/paste your code that printed out the number of rows whose `Platform` column is *Android* into the cell below. Verify it is the same as it was previously.  If it's not, check your statement above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
   ],
   "source": [
    "print((logs['Platform'] == \"Android\").sum())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "Use the technique for using a dictionary to map values from one column into a new column at the top of page 203 (section 7.2, under the heading *Transforming Data Using a Function or Mapping*) to create a new column named `Device type` from the `Platform` column according to the following rules:\n",
    "*  *Android*, *iPhone*, and *iPad* should be given the value **Mobile**.\n",
    "*  *Windows*, *Mac*, and *Linux* should be be given the value **Desktop**.\n",
    "*  *Unknown*, *Other*, and *Web crawler* should be given the value **Unknown**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
   ],
   "source": [
    "mapping = {\n",
    "    \"Android\": \"Mobile\",\n",
    "    \"iPhone\": \"Mobile\",\n",
    "    \"iPad\": \"Mobile\",\n",
    "    \"Windows\": \"Desktop\",\n",
    "    \"Mac\": \"Desktop\",\n",
    "    \"Linux\": \"Desktop\",\n",
    "    \"Unknown\": \"Unknown\",\n",
    "    \"Other\": \"Other\",\n",
    "    \"Web crawler\": \"Unknown\"\n",
    "}\n",
    "\n",
    "logs[\"Device type\"] = logs[\"Platform\"].map(mapping)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "To help verify the correctness of he above cell, write code in the cell below that displays the `Platform` and `Device type` columns for the first **20** rows whose `Platform` value is not *Other*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
   ],
   "source": [
    "logs[logs[\"Platform\"] != \"Other\"][[\"Platform\", \"Device type\"]].head(n=20)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "##  Part 2:  Generating a column for the top level folder for the request\n",
    "\n",
    "Next we will want to do some analysis by the \"top-level\" folder for each request.  This top-level folder will include values like **admissions**, **news**, and **academics**."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "To get started, write the definition of a function named `parse_resource` in the cell below.  `parse_resource` takes a single argument `r`, which is a string, and returns a value according to the following rules:\n",
    "*  If `r` is the string \"/\", then `r` itself is returned.\n",
    "*  Otherwise, `r` is split into `parts` based on the **/** character, and the **2nd component** of `parts` is examined.  Call that 2nd component `base`:\n",
    "   *  If `base` contains a **period**, then \"/\" is returned.\n",
    "   *  If `base` contains a **question mark**, then the portion of `base` **before** the question mark is returned.\n",
    "   *  Otherwise, `base` is returned.\n",
    "   \n",
    "The `index` method on a string can be used to find the first index in a string where a particular character appears."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
   ],
   "source": [
    "def parse_resource(r):\n",
    "    if r == \"/\":\n",
    "        return \"/\"\n",
    "    else:\n",
    "        parts = r.split(\"/\")\n",
    "\n",
    "        if \".\" in parts[1]:\n",
    "            return \"/\"\n",
    "        else:\n",
    "            base = parts[1]\n",
    "\n",
    "            if \"?\" in base:\n",
    "                return base[0:base.index(\"?\")]\n",
    "            else:\n",
    "                return base\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "Now use `map` to apply the `parse_resource` function to the values in the **Resource** column .  Store the resulting `Series` object into `logs` with the name `Top-Level`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
   ],
   "source": [
    "logs['Top-Level'] = logs['Resource'].map(parse_resource)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "Write a statement in the cell below that prints out the number of **distinct** values in the `Top-Level` column.  The answer should be **208**, but you (hopefully obviously) shouldn't just print that number literally."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
   ],
   "source": [
    "print(len(logs['Top-Level'].unique()))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "Write one or more statements in the cell below that prints out how many requests were made for the top-level folder named **academics**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
   ],
   "source": [
    "print ((logs['Top-Level']=='academics').sum())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "Write one or more statements in the cell below to get a `Series` object that summarizes how many requests occurred for each of the top 5 most frequently accessed values of `Top-Level`. **Exclude** the two values `_resources` and `graphics` when picking the top 5 most frequently accessed top-level folders.\n",
    "\n",
    "Hint:  first create a subset of `logs` that excludes the rows where the `Top-Level` column is either `_resources` or `graphics`.  Then use the `value_counts` method on that subset, followed by a slicing operation.  `value_counts` always returns values sorted in descending order."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
   ],
   "source": [
    "not_excluded = (logs['Top-Level'] != '_resources') & (logs['Top-Level'] != 'graphics')\n",
    "top_level = logs['Top-Level']\n",
    "most_frequent_top_level_values = top_level[not_excluded].value_counts()[0:5]\n",
    "print(most_frequent_top_level_values)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "Write one or more statements in the cell below that create a **tuple** named `date_range` which contains the **smallest** and **largest** values in the `Date` column for `logs`.  Print out the value of `date_range` after creating the tuple."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
   ],
   "source": [
    "date_range = (logs['Date'].min(), logs['Date'].max())\n",
    "print(date_range)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "The values in `date_range` with be Pandas **Timestamp** objects.  These objects have a `strftime` method that calls the standard `datetime.strftime` Python function we have used before.  \n",
    "\n",
    "For example, we can print out just the year for the first element in `date_range` like this:\n",
    "```\n",
    "print(date_range[0].strftime(\"%Y\"))\n",
    "```\n",
    "\n",
    "Write one or more statements in the cell below that creates a string variable named `date_range_str`.  The value of `date_range_str` should look like **Apr 01, 2021 - Apr 05, 2021**, where the two dates are replaced by the values in `date_range`.  [Use the strftime documentation](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior) to determine the correct formatting codes.\n",
    "\n",
    "Print out the value of `date_range_str` once you've created it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
   ],
   "source": [
    "format_with_year = \"%b %d, %Y\"\n",
    "date_range_str = f\"{date_range[0].strftime(format_with_year)} - {date_range[1].strftime(format_with_year)}\"\n",
    "print(date_range_str)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "Write code in the cell below so that if the year components of the two values in `date_range` **are the same**, `date_range_str` only includes the year at the end.  \n",
    "\n",
    "For example, **Apr 01 - Apr 05, 2021** would be the output for the example above.  Otherwise, leave the value of `date_range_str` as it was set in the cell above.  Print the value of `date_range_str` at the end of the cell.\n",
    "\n",
    "If `ts` is a `Timestamp` object, you can get the year associated with `ts` like this:\n",
    "```\n",
    "ts.year\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
   ],
   "source": [
    "format_without_year = \"%b %d\"\n",
    "\n",
    "if date_range[0].year == date_range[1].year:\n",
    "    date_range_str = f\"{date_range[0].strftime(format_without_year)} - {date_range[1].strftime(format_with_year)}\"\n",
    "    \n",
    "print (date_range_str)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "Use the \"recipe\" described in the textbook on page 279 in the cell below to produce a **horizontal bar chart** showing the number of requests per top-level value for those top 5 most frequent top-level folders.\n",
    "*  The **chart title** should be  *Most frequent top-level folders date_range*, where **date_range** is replaced with the contents of the `date_range_str` variable you created above.\n",
    "*  The **color of the bars** should be the string **#036** which is a Hope College blue color.\n",
    "*  The **x-axis labels** should be *Top-level folder*.  \n",
    "*  The **y-axis labels** should be *Number of requests*.\n",
    "\n",
    "I set the chart title and color using arguments to the Pandas plotting function.  For the axis labels, I saved the **axes** object returned by the Pandas plotting function into a variable, and then called methods on that object to specify the x-axis and y-axis labels.  There is example code to set the x-axis label on page 268 of the textbook; similar code can be used to set the y-axis label."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
   ],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "axes = most_frequent_top_level_values.plot.barh(title=f\"Most frequent top-level folders {date_range_str}\", color=\"#036\")\n",
    "axes.set_xlabel(\"Number of requests\")\n",
    "axes.set_ylabel(\"Top-Level folder\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "## Part 3 - Recreating `location_summary` using GroupBy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "In our previous work with this data, we computed two columns named **On campus** and **Off campus** that have boolean values.  Pandas calls these *dummy variables*.  To simplify some of our computations in the following cells, write code in the cell below that creates a new column named **Request location**.  The values in **Request location** should be *On campus* if the value of the `On campus` column is True, otherwise the value should be *Off campus*.\n",
    "\n",
    "I did this using `map`, but it can also be done using a series of assignment statements.  If you choose to do it using `map`, you will find [conditional expressions](https://docs.python.org/3/reference/expressions.html#conditional-expressions) useful in implementing a lambda function.  Or you can also define a function to be passed to `map` instead of using a lambda function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
   ],
   "source": [
    "logs['Request location'] = logs[\"On campus\"].map(lambda r: \"On campus\" if r else \"Off campus\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "Write a statement in the cell below that displays only the values of the `IP address`, `On campus`, `Off campus`, and `Request location` columns for the first 10 rows in `logs`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
   ],
   "source": [
    "logs[[\"IP address\", \"On campus\", \"Off campus\", \"Request location\"]].head(n=10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "Write one or more statements in the cell below that use `groupby` and `count` to create a `Series` named **location_count**.  `location_count` should contain the number of requests made from each type of location.  Print out the value of **location_count** after you create it.  \n",
    "\n",
    "You should see approximately 38K requests from off-campus, and 31K requests from on-campus."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
   ],
   "source": [
    "grouped_by_location = logs.groupby(\"Request location\")\n",
    "location_count = grouped_by_location['IP address'].count()\n",
    "print(location_count)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "Repeat the above to create `Series` objects named `location_total`, `location_min`, `location_max`, and `location_mean` that compute the corresponding statistics for the values in the `Size` column using the `groupby`  method.  \n",
    "\n",
    "Print the value in `location_total` **in millions** to check your work.  You should get approximately 2625M bytes from off campus, and 3376M bytes from on campus."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
   ],
   "source": [
    "location_total = grouped_by_location['Size'].sum()\n",
    "location_min = grouped_by_location['Size'].min()\n",
    "location_max = grouped_by_location['Size'].max()\n",
    "location_mean = grouped_by_location['Size'].mean()\n",
    "\n",
    "print(location_total/1e6)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "Now create a new data frame named `location_summary` with the columns  **Count**, **Min size**, **Max size**, **Mean size**, and **Total** containing the `Series` objects created above.  \n",
    "\n",
    "Use the `head` method to display both rows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
   ],
   "source": [
    "location_summary = pd.DataFrame({'Count': location_count, 'Min size': location_min, 'Max size': location_max, 'Mean size': location_mean, 'Total': location_total})\n",
    "location_summary.head(n=2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "## Part 4 - Grouping by multiple columns\n",
    "\n",
    "The following code illustrates the process of grouping by multiple columns, and then generating a bar chart from that data.\n",
    "\n",
    "First, we generate a new `DataFrame` that contains only those rows whose Top-Level folder is one of the 5 most frequent folders."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
   ],
   "source": [
    "parts = [logs[logs['Top-Level'] == value] for value in most_frequent_top_level_values.index]\n",
    "top_folder_requests = pd.concat(parts, axis=0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "Now we can \n",
    "*  Group `top_folder_requests` by the `Top-Level` and `Request location` columns.\n",
    "*  Compute a **count** for each combination of values in these columns.\n",
    "*  Display the resulting `DataFrame`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
   ],
   "source": [
    "grouped_by_level_and_request = top_folder_requests.groupby(['Top-Level', 'Request location'])\n",
    "counts = grouped_by_level_and_request.count()\n",
    "counts[['IP address', 'Date']]\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "Note that the **index** for `counts` is a **multi-level** index (also called a **hierarchical index**)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
   ],
   "source": [
    "counts.index"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "In order to generate a stacked bar chart, we need to take one level of the index and create a **column** for each value of that level.  This is what the `unstack` method does.  In this case, it will create new columns named `Off campus` and `On campus`, containing the counts associated with those values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
   ],
   "source": [
    "unstacked = counts[\"IP address\"].unstack()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "Now that we have columns for **Off campus** and **On campus**, it's pretty straightforward for us to create a couple of bar charts from the `counts` data frame."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
   ],
   "source": [
    "fix, axes = plt.subplots(1,2)\n",
    "unstacked.plot.bar(ax=axes[0], stacked=True)\n",
    "unstacked.plot.bar(ax=axes[1], stacked=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "Your final task is to generate the chart shown below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
   ],
   "source": [
    "grouped_by_device_type_and_request_location = logs.groupby([\"Device type\", \"Request location\"])\n",
    "counts = grouped_by_device_type_and_request_location[\"IP address\"].count()/1000\n",
    "unstacked = counts.unstack()\n",
    "unstacked.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [
   ],
   "source": [
    "plot = unstacked.transpose()[[\"Desktop\", \"Mobile\"]].plot.bar(stacked=False, title=\"Distribution of requests by device type and request location\", rot=0)\n",
    "plot.set_ylabel(\"Number of requests (thousands)\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "**The two cells below aren't part of the solution; they were just used to generate an image and data file for students to use while completing the assignment.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
   ],
   "source": [
    "plot.get_figure().savefig(\"Location_By_Device_Type.png\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
   ],
   "source": [
    "logs.to_pickle('final-logs.pkl')\n",
    "most_frequent_top_level_values.to_pickle('most_frequest_top_level_values.pkl')"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "argv": [
    "/usr/bin/python3",
    "-m",
    "ipykernel",
    "--HistoryManager.enabled=False",
    "--matplotlib=inline",
    "-c",
    "%config InlineBackend.figure_formats = set(['retina'])\nimport matplotlib; matplotlib.rcParams['figure.figsize'] = (12, 7)",
    "-f",
    "{connection_file}"
   ],
   "display_name": "Python 3 (system-wide)",
   "env": {
   },
   "language": "python",
   "metadata": {
    "cocalc": {
     "description": "Python 3 programming language",
     "priority": 100,
     "url": "https://www.python.org/"
    }
   },
   "name": "python3",
   "resource_dir": "/ext/jupyter/kernels/python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}