CoCalc -- Homework-4-Completed.ipynb

Project: CSCI 195

Path: Class Samples / datascience / Homework-4-Solutions / Homework-4-Completed.ipynb

Views: ⁵⁹³⁴
Image: ubuntu2004

Kernel: Python 3 (system-wide)

Part 1 - Complete the original access log exercise

As part of this assignment, you should complete the processing-access-logs.ipynb Jupyter notebook contained in the data-science/in-class-exercises/processing-access-logs folder (the one we've been working on in class).

Once you have that notebook completed, you should copy it from the folder it's currently in to the data-science/homework-4 folder. You can either use CoCalc's file manager to perform the copy, or you can execute the following command from a Terminal session.

cp ~/data-science/in-class-exercises/processing-access-logs/processing-access-logs.ipynb ~/data-science/homework-4

Part 2 - Working with the status code column

Execute the 2 cells below to import our usual data science libraries, and then load the final version of the access log DataFrame from our access log processing exercise.

In [0]:

import pandas as pd
import numpy as np

In [0]:

logs = pd.read_pickle('access_logs.pkl')

Despite our best efforts in the previous exercise, it appears that some rows weren't processed correctly. To see this, write and execute a statement in the cell below displays each of the different values in the Status column. Look back at the code from the original exercise if you need a refresher on how to do this

In [0]:

logs['Status'].unique()

You should see a mixture of numbers, strings containing numbers, and what looks like an HTTP version number. Write a statement in the cell below that will display the number of rows whose Status column is set to this HTTP version number. The output should look like this:

There are 5 rows with the bad status value 'HTTP/1.1"'

In [0]:

bad_status = logs['Status'] == 'HTTP/1.1"'
print(f"There are {sum(bad_status)} rows with the bad status value 'HTTP/1.1\"'")

Since there are only 5 of them, write a statement in the cell below to display all 5 of the rows to get a sense for what they look like.

In [0]:

logs[bad_status].head(n=5)

You should see that they are all requests from the same IP address and for similar, but not identical, resources. To see this, write code in the cell below that displays just those two columns for these 5 rows.

In [0]:

logs[bad_status][['IP address', 'Resource']].head(n=5)

Review the process used in the first part of the exercise to create the Referer_Domain, Referer_Path, and Referer_Query. Then use a similar process to break the Resource column into Resource_Path and Resource_Query portions, separated at the ? character.

Display the values of the Resource_Path, and Resource_Query columns just for the 5 rows you've been examining.

In [0]:

from urllib.parse import urlparse

def parse_resource(url):
    if not pd.isnull(url):
        parsed_url = urlparse(url)
        return (parsed_url.path, parsed_url.query)
    else:
        return (np.nan, np.nan)

parts = logs.loc[:, 'Resource'].map(parse_resource)
logs.loc[:, 'Resource_Path'] = parts.map(lambda r : r[0])
logs.loc[:, 'Resource_Query'] = parts.map(lambda r : r[1])

logs.loc[bad_status, ['Resource', 'Resource_Path', 'Resource_Query']]

Next, use the apply function on the 5 bad rows from above to create a new Series object named address_and_resource (not a new column in the DataFrame named logs). The values in address_and_resource should be tuples whose first component is the value of the IP address column, and whose second component is the value of the Resource_Path column.

Write 2 separate print statements to display the components of the first tuple in address_and_resource, like this:

IP address: XX.YY.WW.ZZ
Resource path:RRRRR

where XX.YY.WW.ZZ is replaced with the originating IP address, and RRRRR is replaced with the requested resource.

In [0]:

address_and_resource = logs[bad_status].apply(lambda r: (r['IP address'], r['Resource_Path']), axis = 1)
print (f"IP address: {address_and_resource.iloc[0][0]}")
print (f"Resource path: {address_and_resource.iloc[0][1]}")

Finally, write a loop that iterates over the unique tuples in address_and_resource, printing out a statement like the one below for each unique combination of IP address and Resource_Path:

XX.YY.WW.ZZ requested resource RRRRR

where XX.YY.WW.ZZ is replaced with the originating IP address, and RRRRR is replaced with the requested resource. You should find that there's only one unique combination of IP address and resource.

In [0]:

for (IP_address, Resource_Path) in address_and_resource.unique():
    print(f'{IP_address} requested resource {Resource_Path}')

If you're interested in what's happening here, a quick Google search indicates this is likely an attempt to execute malicious code on our web server!

Write a statement in the cell below so that logs contains all the rows except for these 5 bad requests. Print out the number of rows in logs when you are done as a sanity check. It should be 69,995.

In [0]:

logs = logs[np.logical_not(bad_status)]
print(f"{len(logs):,}")

Write a statement in the cell below that uses the DataFrame method astype to convert the Status column to be an int64 instead of its current type of object.

Read the documentation on DataFrame.astype carefully before starting. In particular, note you'll want to pass a dictionary as the dtype argument to the astype method.

In [0]:

logs = logs.astype({'Status': 'int64'})

Write a statement in the cell below that executes the info method on the logs DataFrame to verify the Status column's type has been changed.

In [0]:

logs.info()

Write one or more statements in the cell to display the distinct status values in the Status values present in logs that represent errors. Any status in the range 400-499 represents an error. You should see 3 distinct error statuses: 404, 403, and 401.

In [0]:

errors = logs[(logs['Status'] >= 400) & (logs['Status'] < 500)]
errors['Status'].unique()

In the cell below, write one or more statements creates a new DataFrame named error_df that contains a single column named Number of errors for each of these error codes. The index for error_df should be the status code values, and the index's name should be Status Code. Use the DataFrame.sort_values method to sort error_df in descending order by the number of requests in logs with that status code.

Do not assume the 3 codes you found will be the only error codes present. Your code should work regardless of how many distinct error codes there are. In my implementation, I used a list comprehension to create a list of counts for each value, and passed that list when creating error_df.

Use the head method to display the values in error_dfafter constructing it. It should look like this:

	Number of errors
Status Code
404	24078
403	326
401	31

In [0]:

counts = [sum(logs['Status'] == code) for code in errors['Status'].unique()]

status_df = pd.DataFrame(
    data={
        "Number of errors": counts
    },
    index=errors['Status'].unique()
).sort_values(by="Number of errors",ascending=False)
status_df.index.name="Status Code"
status_df.head()

Part 3: Looking at where our requests came from

Next you'll use the IP address to categorize whether a the device making each request was on-campus or off-campus. IP addresses are generally represented as 4 numbers, separated by periods. This format is often called dotted decimal notation. In reality, each IP address is a distinct series of 32 zeros and ones. We use dotted decimal notation to make it easier for us to write and remember IP addresses!

All of Hope's IP addresses start with 209.140. The 3rd value is 192 or higher. The final value can be any value.

In the cell below, write a function named isHopeAddress which takes a single string value as an argument, and returns True if it meets the requirements described above, and False otherwise. My implementation

uses split to break the string into parts
uses int to convert each part to an integer
uses if statements to decide whether the address is a Hope address or not.

In [0]:

def isHopeAddress(addr):
    parts = addr.split(".")
    
    if int(parts[0]) != 209 or int(parts[1]) != 140:
        return False
    else:    
        return int(parts[2]) >= 192

You can execute the cell below to test your implementation of isHopeAddress.

In [0]:

print(f'isHopeAddress("209.140.192.21") should be True, yours is {isHopeAddress("209.140.192.21")}')
print(f'isHopeAddress("209.140.195.210") should be True, yours is {isHopeAddress("209.140.195.210")}')

print(f'isHopeAddress("209.140.191.21") should be False, yours is {isHopeAddress("209.140.191.21")}')
print(f'isHopeAddress("209.104.195.100") should be False, yours is {isHopeAddress("209.104.195.100")}')
print(f'isHopeAddress("208.140.200.1") should be False, yours is {isHopeAddress("208.140.200.1")}')

Now that you have a working isHopeAddress, use it in combination with map to create two new columns in logs:

On campus: Has a value of True for addresses that are on Hope's campus, and a value of False for off-campus addresses
Off campus: Has values that are the opposite of On campus

In [0]:

logs['On campus'] = logs['IP address'].map(isHopeAddress)
logs['Off campus'] = np.logical_not(logs['On campus'])

Write a statement in the cell below that displays the IP address, On campus, and Off campus columns for the first 10 rows of logs. Coincidentally, you'll see the first 2 and last 2 requests are from on campus devices, while the middle 6 are from off campus.

In [0]:

logs[['IP address', 'On campus', 'Off campus']].head(n=10)

In the cell below, write code to display a summary of some of the columns in logs broken down into on-campus and off-campus requests. Store the output in a new DataFrame named location_summary. The values of location_summary's index should be the values On campus and Off campus. The columns in location_summary should be:

Number of requests
Number of distinct resources
Distinct IPs
Smallest response
Largest response
Average response

In addition to these columns, location_summary should also contain columns that display the percentage of requests made for each distinct value in the Method column. You shouldn't hard code these column names; instead, use the value_counts method to get a listing of the possible values. You may assume that both on and off campus requests contain the same set of request methods.

After creating location_summary, use the head method to display its contents. Here's the output I got:

	Number of requests	Number of distinct resources	Distinct IPs	Smallest response	Largest response	Average response	GET	POST	HEAD
On campus	31608	2490	307	273	9750825	106820.740192	99.958871	0.025310	0.015819
Off campus	38387	7756	3212	0	54742549	68383.381692	99.184620	0.054706	0.760674

In [0]:

on_campus_mask = logs['On campus'] == True
off_campus_mask = np.logical_not(on_campus_mask)

on_campus_requests = logs[on_campus_mask]
off_campus_requests = logs[off_campus_mask]

data = {
    "Number of requests": [0, 0],
    "Number of distinct resources": [0, 0],
    "Distinct IPs": [0, 0],
    "Smallest response": [0, 0],
    "Largest response": [0, 0],
    "Average response": [0, 0]
}

location_summary = pd.DataFrame(data, index = ["On campus", "Off campus"])

request_types = [('On campus', on_campus_requests), ('Off campus', off_campus_requests)]

for location, requests in request_types:
    location_summary.loc[location, 'Number of requests'] = len(requests)
    location_summary.loc[location, 'Number of distinct resources'] = len(requests['Resource'].unique())
    location_summary.loc[location, 'Distinct IPs'] = len(requests['IP address'].unique())
    location_summary.loc[location, 'Smallest response'] = requests['Size'].min()
    location_summary.loc[location, 'Largest response'] = requests['Size'].max()
    location_summary.loc[location, 'Average response'] = requests['Size'].mean()

    method_counts = requests['Method'].value_counts()
    for method in method_counts.index:
        location_summary.loc[location, method] = method_counts[method] / location_summary.loc[location, 'Number of requests'] * 100

location_summary.head(n=2)

Part 1 - Complete the original access log exercise

Part 2 - Working with the status code column

Part 3: Looking at where our requests came from

Product

Resources

Company