{ "cells": [ { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "## Getting Started\n", "\n", "Execute the 2 cells below to import our usual data science libraries, and then load the final version of the access log `DataFrame` from our access log processing exercise." ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "collapsed": false }, "outputs": [ ], "source": [ "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "collapsed": false }, "outputs": [ ], "source": [ "logs = pd.read_pickle('homework5.pkl')" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "## Part 1 - Decoding the User Agent Column" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "Later in this notebook you'll be doing some analysis of requests by **user agent**, which is a more technical term for browser. To prepare for this, you'll standardize the values in the `User Agent` column to a restricted set of values." ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "First, use the `fillna` method to replace all missing values in the `User Agent` column with the string **Unknown**. " ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "collapsed": false }, "outputs": [ ], "source": [ "logs['User Agent'].fillna(\"Unknown\", inplace=True)\n", "logs['User Agent'].head()" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "Now, create a column in `logs` named `User Agent Lower` which is the result of converting the values in the `User Agent` column to lower case. See pages 220-222 (end of Section 7.3) to see how to do this efficiently. **You should not write a for loop** to accomplish this task.\n", "\n", "Display the first 10 values in the `User Agent Lower` column only to verify your code worked correctly." ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "collapsed": false }, "outputs": [ ], "source": [ "logs[\"User Agent Lower\"] = logs[\"User Agent\"].str.lower()\n", "logs[\"User Agent Lower\"].head(n=10)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "Next, write one or more statements in the cell below that add a column named `Platform` to `logs`. The contents of `Platform` will initially be the value **Other**.\n", "\n", "After creating `Platform`, display the contents of the `User Agent` and `Platform` columns for the first 10 rows of `logs` to verify it worked correctly." ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "collapsed": false }, "outputs": [ ], "source": [ "logs['Platform'] = \"Other\"\n", "logs[['User Agent', 'Platform']].head(n=10)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "In the cell below, write code that looks for the following strings in each of the values in `User Agent Lower`, setting the value in the `Platform` column to the associated value.\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
User Agent Lower ContainsSet Platform to
androidAndroid
iphoneiPhone
ipadiPad
macintoshMac
windowsWindows
botWeb crawler
spiderWeb crawler
\n", "\n", "\n", "\n", "You can again reference pages 220-222 (end of section 7.3) from the course textbook for more information on how to efficiently process string values in a Pandas column. Do not write a `for loop` over the rows in `logs` to accomplish this task! \n", "\n", "Use the `loc` function when setting the values in the `Platform` column to avoid an error message from Pandas about setting values on a copy of a slice.\n", "\n", "Display the `User Agent Lower` and `Platform` columns for the first 10 rows in `logs` when you are done to verify your work." ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ ], "source": [ "user_agent_mappings = {\n", " \"android\": \"Android\",\n", " \"iphone\": \"iPhone\",\n", " \"ipad\": \"iPad\",\n", " \"macintosh\": \"Mac\",\n", " \"windows\": \"Windows\",\n", " \"bot\": \"Web crawler\",\n", " \"spider\": \"Web crawler\"\n", "}\n", "\n", "for search_string, platform in user_agent_mappings.items():\n", " mask = logs[\"User Agent Lower\"].str.contains(search_string)\n", " logs.loc[mask, \"Platform\"] = platform\n", "\n", "logs[[\"User Agent Lower\", \"Platform\"]].head(n=10)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "Now we want to set the `Platform` column to **Linux** for those values of `User Agent Lower` that contain *linux* but do not contain *Android*.\n", "\n", "First, print out the number of rows in `logs` where the `Platform` column has the value *Android*. We will want to verify this number doesn't change when we set values for *linux*." ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "collapsed": false }, "outputs": [ ], "source": [ "print((logs['Platform'] == \"Android\").sum())" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "Now write code in the cell below to change the `Platform` value to **Linux** when the `User Agent Lower` column contains *linux* and does not contain *android*. I used `np.logical_not` to help me with this. \n", "\n", "Again, use the `loc` method when setting values." ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "collapsed": false }, "outputs": [ ], "source": [ "mask = logs[\"User Agent\"].str.contains(\"Linux\") & np.logical_not(logs[\"User Agent\"].str.contains(\"Android\"))\n", "logs.loc[mask, \"Platform\"] = \"Linux\"" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "Now copy/paste your code that printed out the number of rows whose `Platform` column is *Android* into the cell below. Verify it is the same as it was previously. If it's not, check your statement above." ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "collapsed": false }, "outputs": [ ], "source": [ "print((logs['Platform'] == \"Android\").sum())" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "Use the technique for using a dictionary to map values from one column into a new column at the top of page 203 (section 7.2, under the heading *Transforming Data Using a Function or Mapping*) to create a new column named `Device type` from the `Platform` column according to the following rules:\n", "* *Android*, *iPhone*, and *iPad* should be given the value **Mobile**.\n", "* *Windows*, *Mac*, and *Linux* should be be given the value **Desktop**.\n", "* *Unknown*, *Other*, and *Web crawler* should be given the value **Unknown**." ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "collapsed": false }, "outputs": [ ], "source": [ "mapping = {\n", " \"Android\": \"Mobile\",\n", " \"iPhone\": \"Mobile\",\n", " \"iPad\": \"Mobile\",\n", " \"Windows\": \"Desktop\",\n", " \"Mac\": \"Desktop\",\n", " \"Linux\": \"Desktop\",\n", " \"Unknown\": \"Unknown\",\n", " \"Other\": \"Other\",\n", " \"Web crawler\": \"Unknown\"\n", "}\n", "\n", "logs[\"Device type\"] = logs[\"Platform\"].map(mapping)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "To help verify the correctness of he above cell, write code in the cell below that displays the `Platform` and `Device type` columns for the first **20** rows whose `Platform` value is not *Other*." ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "collapsed": false }, "outputs": [ ], "source": [ "logs[logs[\"Platform\"] != \"Other\"][[\"Platform\", \"Device type\"]].head(n=20)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "## Part 2: Generating a column for the top level folder for the request\n", "\n", "Next we will want to do some analysis by the \"top-level\" folder for each request. This top-level folder will include values like **admissions**, **news**, and **academics**." ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "To get started, write the definition of a function named `parse_resource` in the cell below. `parse_resource` takes a single argument `r`, which is a string, and returns a value according to the following rules:\n", "* If `r` is the string \"/\", then `r` itself is returned.\n", "* Otherwise, `r` is split into `parts` based on the **/** character, and the **2nd component** of `parts` is examined. Call that 2nd component `base`:\n", " * If `base` contains a **period**, then \"/\" is returned.\n", " * If `base` contains a **question mark**, then the portion of `base` **before** the question mark is returned.\n", " * Otherwise, `base` is returned.\n", " \n", "The `index` method on a string can be used to find the first index in a string where a particular character appears." ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "collapsed": false }, "outputs": [ ], "source": [ "def parse_resource(r):\n", " if r == \"/\":\n", " return \"/\"\n", " else:\n", " parts = r.split(\"/\")\n", "\n", " if \".\" in parts[1]:\n", " return \"/\"\n", " else:\n", " base = parts[1]\n", "\n", " if \"?\" in base:\n", " return base[0:base.index(\"?\")]\n", " else:\n", " return base\n" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "Now use `map` to apply the `parse_resource` function to the values in the **Resource** column . Store the resulting `Series` object into `logs` with the name `Top-Level`" ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "collapsed": false }, "outputs": [ ], "source": [ "logs['Top-Level'] = logs['Resource'].map(parse_resource)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "Write a statement in the cell below that prints out the number of **distinct** values in the `Top-Level` column. The answer should be **208**, but you (hopefully obviously) shouldn't just print that number literally." ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "collapsed": false }, "outputs": [ ], "source": [ "print(len(logs['Top-Level'].unique()))" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "Write one or more statements in the cell below that prints out how many requests were made for the top-level folder named **academics**." ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "collapsed": false }, "outputs": [ ], "source": [ "print ((logs['Top-Level']=='academics').sum())" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "Write one or more statements in the cell below to get a `Series` object that summarizes how many requests occurred for each of the top 5 most frequently accessed values of `Top-Level`. **Exclude** the two values `_resources` and `graphics` when picking the top 5 most frequently accessed top-level folders.\n", "\n", "Hint: first create a subset of `logs` that excludes the rows where the `Top-Level` column is either `_resources` or `graphics`. Then use the `value_counts` method on that subset, followed by a slicing operation. `value_counts` always returns values sorted in descending order." ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "collapsed": false }, "outputs": [ ], "source": [ "not_excluded = (logs['Top-Level'] != '_resources') & (logs['Top-Level'] != 'graphics')\n", "top_level = logs['Top-Level']\n", "most_frequent_top_level_values = top_level[not_excluded].value_counts()[0:5]\n", "print(most_frequent_top_level_values)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "Write one or more statements in the cell below that create a **tuple** named `date_range` which contains the **smallest** and **largest** values in the `Date` column for `logs`. Print out the value of `date_range` after creating the tuple." ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "collapsed": false }, "outputs": [ ], "source": [ "date_range = (logs['Date'].min(), logs['Date'].max())\n", "print(date_range)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "The values in `date_range` with be Pandas **Timestamp** objects. These objects have a `strftime` method that calls the standard `datetime.strftime` Python function we have used before. \n", "\n", "For example, we can print out just the year for the first element in `date_range` like this:\n", "```\n", "print(date_range[0].strftime(\"%Y\"))\n", "```\n", "\n", "Write one or more statements in the cell below that creates a string variable named `date_range_str`. The value of `date_range_str` should look like **Apr 01, 2021 - Apr 05, 2021**, where the two dates are replaced by the values in `date_range`. [Use the strftime documentation](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior) to determine the correct formatting codes.\n", "\n", "Print out the value of `date_range_str` once you've created it." ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "collapsed": false }, "outputs": [ ], "source": [ "format_with_year = \"%b %d, %Y\"\n", "date_range_str = f\"{date_range[0].strftime(format_with_year)} - {date_range[1].strftime(format_with_year)}\"\n", "print(date_range_str)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "Write code in the cell below so that if the year components of the two values in `date_range` **are the same**, `date_range_str` only includes the year at the end. \n", "\n", "For example, **Apr 01 - Apr 05, 2021** would be the output for the example above. Otherwise, leave the value of `date_range_str` as it was set in the cell above. Print the value of `date_range_str` at the end of the cell.\n", "\n", "If `ts` is a `Timestamp` object, you can get the year associated with `ts` like this:\n", "```\n", "ts.year\n", "```" ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "collapsed": false }, "outputs": [ ], "source": [ "format_without_year = \"%b %d\"\n", "\n", "if date_range[0].year == date_range[1].year:\n", " date_range_str = f\"{date_range[0].strftime(format_without_year)} - {date_range[1].strftime(format_with_year)}\"\n", " \n", "print (date_range_str)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "Use the \"recipe\" described in the textbook on page 279 in the cell below to produce a **horizontal bar chart** showing the number of requests per top-level value for those top 5 most frequent top-level folders.\n", "* The **chart title** should be *Most frequent top-level folders date_range*, where **date_range** is replaced with the contents of the `date_range_str` variable you created above.\n", "* The **color of the bars** should be the string **#036** which is a Hope College blue color.\n", "* The **x-axis labels** should be *Top-level folder*. \n", "* The **y-axis labels** should be *Number of requests*.\n", "\n", "I set the chart title and color using arguments to the Pandas plotting function. For the axis labels, I saved the **axes** object returned by the Pandas plotting function into a variable, and then called methods on that object to specify the x-axis and y-axis labels. There is example code to set the x-axis label on page 268 of the textbook; similar code can be used to set the y-axis label." ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "collapsed": false }, "outputs": [ ], "source": [ "import matplotlib.pyplot as plt\n", "axes = most_frequent_top_level_values.plot.barh(title=f\"Most frequent top-level folders {date_range_str}\", color=\"#036\")\n", "axes.set_xlabel(\"Number of requests\")\n", "axes.set_ylabel(\"Top-Level folder\")" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "## Part 3 - Recreating `location_summary` using GroupBy" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "In our previous work with this data, we computed two columns named **On campus** and **Off campus** that have boolean values. Pandas calls these *dummy variables*. To simplify some of our computations in the following cells, write code in the cell below that creates a new column named **Request location**. The values in **Request location** should be *On campus* if the value of the `On campus` column is True, otherwise the value should be *Off campus*.\n", "\n", "I did this using `map`, but it can also be done using a series of assignment statements. If you choose to do it using `map`, you will find [conditional expressions](https://docs.python.org/3/reference/expressions.html#conditional-expressions) useful in implementing a lambda function. Or you can also define a function to be passed to `map` instead of using a lambda function." ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "collapsed": false }, "outputs": [ ], "source": [ "logs['Request location'] = logs[\"On campus\"].map(lambda r: \"On campus\" if r else \"Off campus\")" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "Write a statement in the cell below that displays only the values of the `IP address`, `On campus`, `Off campus`, and `Request location` columns for the first 10 rows in `logs`." ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "collapsed": false }, "outputs": [ ], "source": [ "logs[[\"IP address\", \"On campus\", \"Off campus\", \"Request location\"]].head(n=10)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "Write one or more statements in the cell below that use `groupby` and `count` to create a `Series` named **location_count**. `location_count` should contain the number of requests made from each type of location. Print out the value of **location_count** after you create it. \n", "\n", "You should see approximately 38K requests from off-campus, and 31K requests from on-campus." ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "collapsed": false }, "outputs": [ ], "source": [ "grouped_by_location = logs.groupby(\"Request location\")\n", "location_count = grouped_by_location['IP address'].count()\n", "print(location_count)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "Repeat the above to create `Series` objects named `location_total`, `location_min`, `location_max`, and `location_mean` that compute the corresponding statistics for the values in the `Size` column using the `groupby` method. \n", "\n", "Print the value in `location_total` **in millions** to check your work. You should get approximately 2625M bytes from off campus, and 3376M bytes from on campus." ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "collapsed": false }, "outputs": [ ], "source": [ "location_total = grouped_by_location['Size'].sum()\n", "location_min = grouped_by_location['Size'].min()\n", "location_max = grouped_by_location['Size'].max()\n", "location_mean = grouped_by_location['Size'].mean()\n", "\n", "print(location_total/1e6)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "Now create a new data frame named `location_summary` with the columns **Count**, **Min size**, **Max size**, **Mean size**, and **Total** containing the `Series` objects created above. \n", "\n", "Use the `head` method to display both rows." ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "collapsed": false }, "outputs": [ ], "source": [ "location_summary = pd.DataFrame({'Count': location_count, 'Min size': location_min, 'Max size': location_max, 'Mean size': location_mean, 'Total': location_total})\n", "location_summary.head(n=2)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "## Part 4 - Grouping by multiple columns\n", "\n", "The following code illustrates the process of grouping by multiple columns, and then generating a bar chart from that data.\n", "\n", "First, we generate a new `DataFrame` that contains only those rows whose Top-Level folder is one of the 5 most frequent folders." ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "collapsed": false }, "outputs": [ ], "source": [ "parts = [logs[logs['Top-Level'] == value] for value in most_frequent_top_level_values.index]\n", "top_folder_requests = pd.concat(parts, axis=0)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "Now we can \n", "* Group `top_folder_requests` by the `Top-Level` and `Request location` columns.\n", "* Compute a **count** for each combination of values in these columns.\n", "* Display the resulting `DataFrame`." ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "collapsed": false }, "outputs": [ ], "source": [ "grouped_by_level_and_request = top_folder_requests.groupby(['Top-Level', 'Request location'])\n", "counts = grouped_by_level_and_request.count()\n", "counts[['IP address', 'Date']]\n" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "Note that the **index** for `counts` is a **multi-level** index (also called a **hierarchical index**)." ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "collapsed": false }, "outputs": [ ], "source": [ "counts.index" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "In order to generate a stacked bar chart, we need to take one level of the index and create a **column** for each value of that level. This is what the `unstack` method does. In this case, it will create new columns named `Off campus` and `On campus`, containing the counts associated with those values." ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "collapsed": false }, "outputs": [ ], "source": [ "unstacked = counts[\"IP address\"].unstack()" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "Now that we have columns for **Off campus** and **On campus**, it's pretty straightforward for us to create a couple of bar charts from the `counts` data frame." ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "collapsed": false }, "outputs": [ ], "source": [ "fix, axes = plt.subplots(1,2)\n", "unstacked.plot.bar(ax=axes[0], stacked=True)\n", "unstacked.plot.bar(ax=axes[1], stacked=False)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "Your final task is to generate the chart shown below." ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "collapsed": false }, "outputs": [ ], "source": [ "grouped_by_device_type_and_request_location = logs.groupby([\"Device type\", \"Request location\"])\n", "counts = grouped_by_device_type_and_request_location[\"IP address\"].count()/1000\n", "unstacked = counts.unstack()\n", "unstacked.head()" ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ ], "source": [ "plot = unstacked.transpose()[[\"Desktop\", \"Mobile\"]].plot.bar(stacked=False, title=\"Distribution of requests by device type and request location\", rot=0)\n", "plot.set_ylabel(\"Number of requests (thousands)\")" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "**The two cells below aren't part of the solution; they were just used to generate an image and data file for students to use while completing the assignment.**" ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "collapsed": false }, "outputs": [ ], "source": [ "plot.get_figure().savefig(\"Location_By_Device_Type.png\")" ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "collapsed": false }, "outputs": [ ], "source": [ "logs.to_pickle('final-logs.pkl')\n", "most_frequent_top_level_values.to_pickle('most_frequest_top_level_values.pkl')" ] } ], "metadata": { "kernelspec": { "argv": [ "/usr/bin/python3", "-m", "ipykernel", "--HistoryManager.enabled=False", "--matplotlib=inline", "-c", "%config InlineBackend.figure_formats = set(['retina'])\nimport matplotlib; matplotlib.rcParams['figure.figsize'] = (12, 7)", "-f", "{connection_file}" ], "display_name": "Python 3 (system-wide)", "env": { }, "language": "python", "metadata": { "cocalc": { "description": "Python 3 programming language", "priority": 100, "url": "https://www.python.org/" } }, "name": "python3", "resource_dir": "/ext/jupyter/kernels/python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }