CoCalc -- Group By and Charts for Access Logs Solutions.ipynb

Project: CSCI 195

Path: Class Samples / datascience / Group By and Charts for Access Logs Solutions.ipynb

Views: ⁵⁹³⁰
Image: ubuntu2004

Kernel: Python 3 (system-wide)

Getting Started

Execute the 2 cells below to import our usual data science libraries, and then load the final version of the access log DataFrame from our access log processing exercise.

In [0]:

import pandas as pd
import numpy as np

In [0]:

logs = pd.read_pickle('homework5.pkl')

Part 1 - Decoding the User Agent Column

Later in this notebook you'll be doing some analysis of requests by user agent, which is a more technical term for browser. To prepare for this, you'll standardize the values in the User Agent column to a restricted set of values.

First, use the fillna method to replace all missing values in the User Agent column with the string Unknown.

In [0]:

logs['User Agent'].fillna("Unknown", inplace=True)
logs['User Agent'].head()

Now, create a column in logs named User Agent Lower which is the result of converting the values in the User Agent column to lower case. See pages 220-222 (end of Section 7.3) to see how to do this efficiently. You should not write a for loop to accomplish this task.

Display the first 10 values in the User Agent Lower column only to verify your code worked correctly.

In [0]:

logs["User Agent Lower"] = logs["User Agent"].str.lower()
logs["User Agent Lower"].head(n=10)

Next, write one or more statements in the cell below that add a column named Platform to logs. The contents of Platform will initially be the value Other.

After creating Platform, display the contents of the User Agent and Platform columns for the first 10 rows of logs to verify it worked correctly.

In [0]:

logs['Platform'] = "Other"
logs[['User Agent', 'Platform']].head(n=10)

In the cell below, write code that looks for the following strings in each of the values in User Agent Lower, setting the value in the Platform column to the associated value.

User Agent Lower Contains	Set Platform to
android	Android
iphone	iPhone
ipad	iPad
macintosh	Mac
windows	Windows
bot	Web crawler
spider	Web crawler

You can again reference pages 220-222 (end of section 7.3) from the course textbook for more information on how to efficiently process string values in a Pandas column. Do not write a for loop over the rows in logs to accomplish this task!

Use the loc function when setting the values in the Platform column to avoid an error message from Pandas about setting values on a copy of a slice.

Display the User Agent Lower and Platform columns for the first 10 rows in logs when you are done to verify your work.

In [0]:

user_agent_mappings = {
    "android": "Android",
    "iphone": "iPhone",
    "ipad": "iPad",
    "macintosh": "Mac",
    "windows": "Windows",
    "bot": "Web crawler",
    "spider": "Web crawler"
}

for search_string, platform in user_agent_mappings.items():
    mask = logs["User Agent Lower"].str.contains(search_string)
    logs.loc[mask, "Platform"] = platform

logs[["User Agent Lower", "Platform"]].head(n=10)

Now we want to set the Platform column to Linux for those values of User Agent Lower that contain linux but do not contain Android.

First, print out the number of rows in logs where the Platform column has the value Android. We will want to verify this number doesn't change when we set values for linux.

In [0]:

print((logs['Platform'] == "Android").sum())

Now write code in the cell below to change the Platform value to Linux when the User Agent Lower column contains linux and does not contain android. I used np.logical_not to help me with this.

Again, use the loc method when setting values.

In [0]:

mask = logs["User Agent"].str.contains("Linux") & np.logical_not(logs["User Agent"].str.contains("Android"))
logs.loc[mask, "Platform"] = "Linux"

Now copy/paste your code that printed out the number of rows whose Platform column is Android into the cell below. Verify it is the same as it was previously. If it's not, check your statement above.

In [0]:

print((logs['Platform'] == "Android").sum())

Use the technique for using a dictionary to map values from one column into a new column at the top of page 203 (section 7.2, under the heading Transforming Data Using a Function or Mapping) to create a new column named Device type from the Platform column according to the following rules:

Android, iPhone, and iPad should be given the value Mobile.
Windows, Mac, and Linux should be be given the value Desktop.
Unknown, Other, and Web crawler should be given the value Unknown.

In [0]:

mapping = {
    "Android": "Mobile",
    "iPhone": "Mobile",
    "iPad": "Mobile",
    "Windows": "Desktop",
    "Mac": "Desktop",
    "Linux": "Desktop",
    "Unknown": "Unknown",
    "Other": "Other",
    "Web crawler": "Unknown"
}

logs["Device type"] = logs["Platform"].map(mapping)

To help verify the correctness of he above cell, write code in the cell below that displays the Platform and Device type columns for the first 20 rows whose Platform value is not Other.

In [0]:

logs[logs["Platform"] != "Other"][["Platform", "Device type"]].head(n=20)

Part 2: Generating a column for the top level folder for the request

Next we will want to do some analysis by the "top-level" folder for each request. This top-level folder will include values like admissions, news, and academics.

To get started, write the definition of a function named parse_resource in the cell below. parse_resource takes a single argument r, which is a string, and returns a value according to the following rules:

If r is the string "/", then r itself is returned.
Otherwise, r is split into parts based on the / character, and the 2nd component of parts is examined. Call that 2nd component base:
- If base contains a period, then "/" is returned.
- If base contains a question mark, then the portion of base before the question mark is returned.
- Otherwise, base is returned.

The index method on a string can be used to find the first index in a string where a particular character appears.

In [0]:

def parse_resource(r):
    if r == "/":
        return "/"
    else:
        parts = r.split("/")

        if "." in parts[1]:
            return "/"
        else:
            base = parts[1]

            if "?" in base:
                return base[0:base.index("?")]
            else:
                return base

Now use map to apply the parse_resource function to the values in the Resource column . Store the resulting Series object into logs with the name Top-Level

In [0]:

logs['Top-Level'] = logs['Resource'].map(parse_resource)

Write a statement in the cell below that prints out the number of distinct values in the Top-Level column. The answer should be 208, but you (hopefully obviously) shouldn't just print that number literally.

In [0]:

print(len(logs['Top-Level'].unique()))

Write one or more statements in the cell below that prints out how many requests were made for the top-level folder named academics.

In [0]:

print ((logs['Top-Level']=='academics').sum())

Write one or more statements in the cell below to get a Series object that summarizes how many requests occurred for each of the top 5 most frequently accessed values of Top-Level. Exclude the two values _resources and graphics when picking the top 5 most frequently accessed top-level folders.

Hint: first create a subset of logs that excludes the rows where the Top-Level column is either _resources or graphics. Then use the value_counts method on that subset, followed by a slicing operation. value_counts always returns values sorted in descending order.

In [0]:

not_excluded = (logs['Top-Level'] != '_resources') & (logs['Top-Level'] != 'graphics')
top_level = logs['Top-Level']
most_frequent_top_level_values = top_level[not_excluded].value_counts()[0:5]
print(most_frequent_top_level_values)

Write one or more statements in the cell below that create a tuple named date_range which contains the smallest and largest values in the Date column for logs. Print out the value of date_range after creating the tuple.

In [0]:

date_range = (logs['Date'].min(), logs['Date'].max())
print(date_range)

The values in date_range with be Pandas Timestamp objects. These objects have a strftime method that calls the standard datetime.strftime Python function we have used before.

For example, we can print out just the year for the first element in date_range like this:

print(date_range[0].strftime("%Y"))

Write one or more statements in the cell below that creates a string variable named date_range_str. The value of date_range_str should look like Apr 01, 2021 - Apr 05, 2021, where the two dates are replaced by the values in date_range. Use the strftime documentation to determine the correct formatting codes.

Print out the value of date_range_str once you've created it.

In [0]:

format_with_year = "%b %d, %Y"
date_range_str = f"{date_range[0].strftime(format_with_year)} - {date_range[1].strftime(format_with_year)}"
print(date_range_str)

Write code in the cell below so that if the year components of the two values in date_range are the same, date_range_str only includes the year at the end.

For example, Apr 01 - Apr 05, 2021 would be the output for the example above. Otherwise, leave the value of date_range_str as it was set in the cell above. Print the value of date_range_str at the end of the cell.

If ts is a Timestamp object, you can get the year associated with ts like this:

ts.year

In [0]:

format_without_year = "%b %d"

if date_range[0].year == date_range[1].year:
    date_range_str = f"{date_range[0].strftime(format_without_year)} - {date_range[1].strftime(format_with_year)}"
    
print (date_range_str)

Use the "recipe" described in the textbook on page 279 in the cell below to produce a horizontal bar chart showing the number of requests per top-level value for those top 5 most frequent top-level folders.

The chart title should be Most frequent top-level folders date_range, where date_range is replaced with the contents of the date_range_str variable you created above.
The color of the bars should be the string #036 which is a Hope College blue color.
The x-axis labels should be Top-level folder.
The y-axis labels should be Number of requests.

I set the chart title and color using arguments to the Pandas plotting function. For the axis labels, I saved the axes object returned by the Pandas plotting function into a variable, and then called methods on that object to specify the x-axis and y-axis labels. There is example code to set the x-axis label on page 268 of the textbook; similar code can be used to set the y-axis label.

In [0]:

import matplotlib.pyplot as plt
axes = most_frequent_top_level_values.plot.barh(title=f"Most frequent top-level folders {date_range_str}", color="#036")
axes.set_xlabel("Number of requests")
axes.set_ylabel("Top-Level folder")

Part 3 - Recreating `location_summary` using GroupBy

In our previous work with this data, we computed two columns named On campus and Off campus that have boolean values. Pandas calls these dummy variables. To simplify some of our computations in the following cells, write code in the cell below that creates a new column named Request location. The values in Request location should be On campus if the value of the On campus column is True, otherwise the value should be Off campus.

I did this using map, but it can also be done using a series of assignment statements. If you choose to do it using map, you will find conditional expressions useful in implementing a lambda function. Or you can also define a function to be passed to map instead of using a lambda function.

In [0]:

logs['Request location'] = logs["On campus"].map(lambda r: "On campus" if r else "Off campus")

Write a statement in the cell below that displays only the values of the IP address, On campus, Off campus, and Request location columns for the first 10 rows in logs.

In [0]:

logs[["IP address", "On campus", "Off campus", "Request location"]].head(n=10)

Write one or more statements in the cell below that use groupby and count to create a Series named location_count. location_count should contain the number of requests made from each type of location. Print out the value of location_count after you create it.

You should see approximately 38K requests from off-campus, and 31K requests from on-campus.

In [0]:

grouped_by_location = logs.groupby("Request location")
location_count = grouped_by_location['IP address'].count()
print(location_count)

Repeat the above to create Series objects named location_total, location_min, location_max, and location_mean that compute the corresponding statistics for the values in the Size column using the groupby method.

Print the value in location_total in millions to check your work. You should get approximately 2625M bytes from off campus, and 3376M bytes from on campus.

In [0]:

location_total = grouped_by_location['Size'].sum()
location_min = grouped_by_location['Size'].min()
location_max = grouped_by_location['Size'].max()
location_mean = grouped_by_location['Size'].mean()

print(location_total/1e6)

Now create a new data frame named location_summary with the columns Count, Min size, Max size, Mean size, and Total containing the Series objects created above.

Use the head method to display both rows.

In [0]:

location_summary = pd.DataFrame({'Count': location_count, 'Min size': location_min, 'Max size': location_max, 'Mean size': location_mean, 'Total': location_total})
location_summary.head(n=2)

Part 4 - Grouping by multiple columns

The following code illustrates the process of grouping by multiple columns, and then generating a bar chart from that data.

First, we generate a new DataFrame that contains only those rows whose Top-Level folder is one of the 5 most frequent folders.

In [0]:

parts = [logs[logs['Top-Level'] == value] for value in most_frequent_top_level_values.index]
top_folder_requests = pd.concat(parts, axis=0)

Now we can

Group top_folder_requests by the Top-Level and Request location columns.
Compute a count for each combination of values in these columns.
Display the resulting DataFrame.

In [0]:

grouped_by_level_and_request = top_folder_requests.groupby(['Top-Level', 'Request location'])
counts = grouped_by_level_and_request.count()
counts[['IP address', 'Date']]

Note that the index for counts is a multi-level index (also called a hierarchical index).

In [0]:

counts.index

In order to generate a stacked bar chart, we need to take one level of the index and create a column for each value of that level. This is what the unstack method does. In this case, it will create new columns named Off campus and On campus, containing the counts associated with those values.

In [0]:

unstacked = counts["IP address"].unstack()

Now that we have columns for Off campus and On campus, it's pretty straightforward for us to create a couple of bar charts from the counts data frame.

In [0]:

fix, axes = plt.subplots(1,2)
unstacked.plot.bar(ax=axes[0], stacked=True)
unstacked.plot.bar(ax=axes[1], stacked=False)

Your final task is to generate the chart shown below.

In [0]:

grouped_by_device_type_and_request_location = logs.groupby(["Device type", "Request location"])
counts = grouped_by_device_type_and_request_location["IP address"].count()/1000
unstacked = counts.unstack()
unstacked.head()

In [0]:

plot = unstacked.transpose()[["Desktop", "Mobile"]].plot.bar(stacked=False, title="Distribution of requests by device type and request location", rot=0)
plot.set_ylabel("Number of requests (thousands)")

The two cells below aren't part of the solution; they were just used to generate an image and data file for students to use while completing the assignment.

In [0]:

plot.get_figure().savefig("Location_By_Device_Type.png")

In [0]:

logs.to_pickle('final-logs.pkl')
most_frequent_top_level_values.to_pickle('most_frequest_top_level_values.pkl')

Getting Started

Part 1 - Decoding the User Agent Column

Part 2: Generating a column for the top level folder for the request

Part 3 - Recreating location_summary using GroupBy

Part 4 - Grouping by multiple columns

Part 3 - Recreating `location_summary` using GroupBy