Project: CSCI 195

Path: Class Samples / datascience / Homework-4-Solutions / processing-access-logs-completed.ipynb

Views: ⁵⁹³⁴
Image: ubuntu2004

Kernel: Python 3 (system-wide)

Overview

This exercise will give you some practice working with Pandas, and DataFrames in particular. You will be loading and processing information from one week's worth of server logs for www.hope.edu. Server logs contain information about the pages that are downloaded from a web server, including

the date and time the request was made
the specific resource that was requested
a status code indicating whether the request was successful or not
how much data was returned
the IP address of the device making the request
and more

You will first work with exploring and processing a single day's logs. Once that process can be completed successfully you'll create one DataFrame instance per day and concatenate them together into a single DataFrame. Finally, you'll perform some queries on the data in the aggregated DataFrame.

Part 1 - Exploring the data

Add a statement to the cell below that imports the pandas library using the alias pd

In [0]:

import pandas as pd

Now it's time to do some exploration of the data in the file. This is often necessary when you're importing data from an external source. You first have to get a sense for the data before knowing exactly what the parameter values you need to pass to one of Pandas read functions should be.

Write a statement in the blank cell below that calls pd.read_csv passing the following named arguments:

filepath_or_buffer: the string www_hope_edu_access.log.1.gz
delim_whitespace: True This option causes read_csv to break the file up into columns based on the presence of whitespace characters such as Space and Tab
nrows: 5 This will restrict how many rows are read by read_csv

Assign the DataFrame returned by pd.read_csv to a variable named sample and display the contents of sample to see what it looks like.

In [0]:

sample = pd.read_csv(filepath_or_buffer='www_hope_edu_access.log.1.gz', delim_whitespace=True, nrows=5)
sample

One thing that may surprise you is that the 6th column (labeled GET /news/2010/...) is treated as one column, even though its values contain spaces. This happens because those values are contained within double quotes in the data file, and read_csv will treat anything in double-quotes as a single value.

To see this, here are the first few rows from the "raw" file. The request is the quoted string starting with GET.

209.140.194.21 - - [12/Apr/2021:06:25:14 -0400] "GET /news/2010/09/07/index.xml HTTP/1.1"** 404 46089 "-" "-"
209.140.194.21 - - [12/Apr/2021:06:25:14 -0400] "GET /_resources/php/news/aggregator.php?path=/news/2010/09/07 HTTP/1.1" 200 22338 "-" "-"
207.178.216.39 - - [12/Apr/2021:06:25:14 -0400] "GET /news/2010/09/07/visiting-writers-series-to-feature-wayne-miller.html HTTP/1.1" 200 14785 "-" "Mozilla/4.0 (compatible; T-H-U-N-D-E-R-S-T-O-N-E)"

It should be clear from the example data above that the data file doesn't contain any header rows at the beginning of the file. read_csv uses the first row in the file as the column index by default. We will need to manually assign the column names.

Copy and paste the contents of the previous code cell into the cell below, after the declaration of column_headers. Add the argument names with the value column_headers to the call to pd.read_csv and execute the cell.

In [0]:

column_headers = ['IP address', 'Unknown 1', 'Unknown 2', 'Date', 'TZ offset', 'Request', 'Status', 'Size', 'Unknown 3', 'User Agent']
sample = pd.read_csv(filepath_or_buffer='www_hope_edu_access.log.1.gz', delim_whitespace=True, nrows=5, names=column_headers)
sample

That's better!

It seems likely that the log file is using the character - to represent a missing value. Copy and paste the two lines that read and display the file from the cell above into the blank cell below. Then do the following:

Add the named argument na_values with the string value '-' to the call to pd.read_csv This tells Pandas to treat that string as a missing value.
Change the value passed as the nrows argument to be 5000.
Instead of displaying the whole DataFrame after the call to pd.read_csv, call the head method on sample, to restrict how many rows we see.

Execute the cell once you've made the above changes.

In [0]:

sample = pd.read_csv(filepath_or_buffer='www_hope_edu_access.log.1.gz', delim_whitespace=True, nrows=5000, names=column_headers, na_values='-')
sample.head()

Let's see how many of the values in the 3 "Unknown" columns contain values other than NaN. In the blank cell below, write and execute a statement that calls the info method on the sample DataFrame.

In [0]:

sample.info()

Examining the output from the info method, it seems clear we don't need to keep the columns Unknown 1 and Unknown 2 since all of the values in those columns are missing (at least all of the sample of size 5000 are missing).

The column Unknown 3, however, is probably worth keeping because only about half of its values are missing.

Write several statements in the cell below that:

Creates a set named columns_to_include from the list named column_headers. You create a set just like you would create a list.
Calls the remove method on columns_to_include to remove the value Unknown 1.
Calls the remove method on columns_to_include to remove the value Unknown 2.
Prints out the value of columns_to_include.

The output should look like this:

{'Date',
 'IP address',
 'Request',
 'Size',
 'Status',
 'TZ offset',
 'Unknown 3',
 'User Agent'}

In [0]:

columns_to_include = set(column_headers)

columns_to_include.remove("Unknown 1")
columns_to_include.remove("Unknown 2")

columns_to_include

Now you can use columns_to_include to tell pd.read_csv which columns you want to keep in the DataFrame it creates.

Copy and paste the code you wrote the last time you called pd.read_csv into the blank cell below. Then edit the call to pd.read_csv to add the named argument usecols whose value is columns_to_include.

In [0]:

sample = pd.read_csv(filepath_or_buffer='www_hope_edu_access.log.1.gz', delim_whitespace=True, nrows=5000, names=column_headers, usecols=columns_to_include, na_values='-')
sample

Determining the contents of column "Unknown 3"

Write code in the cell below to get a Series object containing just the Unknown 3 column values. Store the Series in a variable named Unknown_3.

In [0]:

Unknown_3 = sample['Unknown 3']

Now write a statement that calls the unique method on Unknown_3. unique returns a NumPy array. Use an array slice so you see only the first 10 values from that column.

In [0]:

Unknown_3.unique()[0:10]

It looks like the values in the Unknown_3 column are URLs. They are likely the referer value, which is set when a hyperlink is followed to a page on the Hope web site.

Write code in the cell below to replace the value Unknown 3 in the list column_headers with the value Referer. You can do this by:

Calling the index method on column_headers, passing it the string argument Unknown 3.
Writing an assignment statement to assign the value in column_headers at the index returned by the index method to Referer.
Printing out the value of column_headers to verify your code worked correctly.

In [0]:

index = column_headers.index("Unknown 3")
column_headers[index] = "Referer"
print(column_headers)

You also need to remove Unknown 3 from the set columns_to_include and add Referer to that same set. Write code in the cell below to do these things. Print out the value of columns_to_include when you're done to verify it worked correctly.

In [0]:

columns_to_include.remove('Unknown 3')
columns_to_include.add('Referer')
print(columns_to_include)

Copy and paste the code you last used to read in the contents of the first log file into the cell below and execute it to verify it does what you want at this point.

In [0]:

sample = pd.read_csv(filepath_or_buffer='www_hope_edu_access.log.1.gz', delim_whitespace=True, nrows=5000, names=column_headers, usecols=columns_to_include, na_values='-')
sample

Execute this code to see some of the log entries where the Referer column has a value.

In [0]:

#  referer_present will have the same number of rows as sample, with a True or False value
referer_present = pd.notnull(sample['Referer'])
#  This will get us just the entries in sample where the corresponding value in referer_present is True
has_referer = sample[referer_present]
has_referer.head()

Fixing the `Date` and `TZ offset` columns

You may have noticed that the Date column values begin with a '[' character, and the TZ offset column values end with a ']' character. Execute the cell below to fix up the Date column.

In [0]:

sample['Date'] = sample['Date'].map(lambda value: value[1:])
sample.head()

The code lambda value: value[1:] is called a lambda function and is a shorthand way to do this:

def strip_leading_character(value):
  return value[1:]

Write and execute code similar to the cell above in the cell below to remove the trailing ']' from the TZ offset column.

In [0]:

sample['TZ offset'] = sample['TZ offset'].map(lambda v: v[:-1])
sample.head()

Execute the cell below to convert the values in the Date column to be actual dates, rather than strings. The format string in the call to pd.to_datetime uses the same conversion characters as strftime, which we saw in Homework 2.

The output of sample.info() should indicate that the Date column now has the datetime64 data type.

In [0]:

sample['Date'] = pd.to_datetime(sample['Date'], format="%d/%b/%Y:%H:%M:%S")
sample.info()

Breaking up the URL into parts

Review and then execute the code in the cell below. It creates a function named url_to_parts, which uses a function named urlparse from the urllib.parse module to break up a string representing a URL into its various components. It returns a tuple containing those parts.

In [0]:

from urllib.parse import urlparse
import numpy as np

def url_to_parts(url):
    if not pd.isnull(url):
        parsed_url = urlparse(url)
        return (parsed_url.netloc, parsed_url.path, parsed_url.query)
    else:
        return (np.nan, np.nan, np.nan)

url_to_parts('http://www.hope.edu/academics/computer-science/')

Now you can set up a new column named Referer_Domain using the code below.

In [0]:

#  What type will referer parts be?
referer_parts = sample['Referer'].map(url_to_parts)
sample['Referer_Domain'] = referer_parts.map(lambda ref: ref[0])
sample[pd.notnull(sample['Referer'])].head()

Add code in the cell below that will add two more columns named Referer_Path and Referer_Query to the sample DataFrame. The values for these two columns will be the second two values in referer_parts.

In [0]:

sample['Referer_Path'] = referer_parts.map(lambda ref: ref[1])
sample['Referer_Query'] = referer_parts.map(lambda ref: ref[2])

#  Add your code BEFORE this line
sample[pd.notnull(sample['Referer'])].head()

Breaking up the Request into its parts

Now we want to similarly separate the values in the Request column, which consists of 3 parts, separated by a space:

the HTTP request method (e.g. GET)
the resource being requested (e.g. /admissions)
the HTTP version (e.g. HTTP/1.1)

Write code in the cells below to:

Create a Series named request_parts by mapping the Request column with a lambda function that splits the Request into parts
Uses the pattern from above to create new columns in sample named Method, Resource, and HTTP_Version
Displays the first few rows in sample

In [0]:

request_parts = sample['Request'].map(lambda r: r.split(' '))
sample['Method'] = request_parts.map(lambda r: r[0])
sample['Resource'] = request_parts.map(lambda r: r[1])
sample['HTTP_Version'] = request_parts.map(lambda r: r[2])
sample.head()

Write code in the blank cell below to use the drop method to remove the original Request column from the sample DataFrame. You'll want to give a value for the axis argument to make sure a column is dropped, rather than row, which is the default axis.

By default the drop doesn't alter the DataFrame it operates on. You'll have to assign the value returned by .drop back to sample to actually change sample.

In [0]:

sample = sample.drop('Request', axis=1)
sample.head()

Finally, examine the code in the cell below and make a prediction about what it will do. Then execute and check your prediction.

In [0]:

rearranged_columns = list(sample.columns)
rearranged_columns.remove("Resource")
rearranged_columns.insert(0, "Resource")

sample = sample[rearranged_columns]

sample.head()

Part 2 -- Building a function to process a log file

Your next task will be to complete the function in the cell below that puts all of the pieces you wrote above together so they can be executed as a specific task. In particular, we want to:

Use pd.read_csv to read in rowsToRead lines of an access log file, assigning the final column names we determined above, and using '-' as a NaN character. The function definition below gives the default value None to rowsToRead. When pd.read_csv is given None as its nrows argument, it will read all the rows.
Uses map to remove the leading '[' from the Date column.
Uses pd.datetime to convert the values in the Date column to datetime64.
Uses map to remove the trailing ']' from the TZ offset column.
Creates the columns Referer_Domain, Referer_Path, and Referer_Query.
Creates the columns Method, Resource, and HTTP_Version. There may be requests that don't include an HTTP_Version, so when you split the Request column value, you only get 2 results rather than 3. Use the code below to generate the values in the HTTP_Version column.
```
request_parts.map(lambda r: r[2] if len(r) == 3 else np.nan)
```
Drops the Request column.
Reorders the columns by making the Resource column the first column in the DataFrame.
Returns the completed DataFrame.

In [0]:

def load_access_log(filepath, rowsToRead=None):
    column_headers = ['IP address', 'Unknown 1', 'Unknown 2', 'Date', 'TZ offset', 'Request', 'Status', 'Size', 'Referer', 'User Agent']
    columns_to_include = ['IP address', 'Date', 'TZ offset', 'Request', 'Status', 'Size', 'Referer', 'User Agent']

    df = pd.read_csv(filepath_or_buffer=filepath, delim_whitespace=True, names=column_headers, usecols=columns_to_include, na_values='-', nrows=rowsToRead)
    
    df['Date'] = pd.to_datetime(df['Date'].apply(lambda d: d[1:]), format="%d/%b/%Y:%H:%M:%S")
    df['TZ offset'] = df['TZ offset'].map(lambda tz: tz[:-1])

    referer_parts = df['Referer'].map(url_to_parts)
    df['Referer_Domain'] = referer_parts.apply(lambda ref: ref[0])
    df['Referer_Path'] = referer_parts.apply(lambda ref: ref[1])
    df['Referer_Query'] = referer_parts.apply(lambda ref: ref[2])

    request_parts = df['Request'].map(lambda r: r.split(' '))
    df['Method'] = request_parts.map(lambda r: r[0])
    df['Resource'] = request_parts.map(lambda r: r[1])
    df['HTTP_Version'] = request_parts.map(lambda r: r[2] if len(r) == 3 else np.nan)
    
    df = df.drop('Request', axis=1)
    
    return df

In the cell below, call the load_access_log function, passing it the string www_hope_edu_access.log.1.gz as the argument, storing the returned DataFrame in a variable named access_log, and then calling the head method on access_log to see the first 5 rows.

You will likely get a warning when running the function. For the purposes of this exercise, you can safely ignore the warning.

In [0]:

access_log = load_access_log('www_hope_edu_access.log.1.gz')
access_log.head()

Part 3 -- Loading multiple files

Now that you have a function that can successfully load a single file, you'll use it to read and combine several different files into a single DataFrame.

Getting a list of the files to process

We want to process all the files in the current directory whose names contain access.log. Rather than hard coding the file names, you'll ask Python for a list containing those files. By using Python code to get the list, you can be sure your code will work regardless of how many files are present.

The code in the cell below uses a list comprehension to accomplish this task. It first:

Executes the os.listdir function to get a list of all the files in the current directory
Only includes files whose name contains access.log

We haven't seen the use of an if within a list comprehension before. It does what you would hopefully expect: only includes items that pass the test specified in the if statement.

In [0]:

import os

log_files = sorted([file for file in os.listdir() if "access.log" in file])
print (log_files)

Writing a function to process a list of files

Fill in the body of the load_logs function below to process a list of files. The function takes two arguments:

the list of files to be processed
the number of rows within each file to process. As the load_access_log function did, load_logs defines a default value of None for the number of rows to process (nrows), indicating all rows should be processed.

The body of load_logs should:

Use a list comprehension to create a list of DataFrame objects, one per file in log_files, by calling load_access_log for each file. Pass the name of the current file and the value of nrows to load_access_log.
Call pd.concat to concatenate the DataFrame objects created by the list comprehension into a single DataFrame named access_log.
Return access_log to the caller.

In [0]:

def load_logs(log_files, nrows=None):
    all_logs = [load_access_log(file, nrows) for file in log_files]
    access_log = pd.concat(all_logs, axis=0)
    return access_log

Now you're ready to call load_logs for the files you stored in log_files earlier. Write code in the cell below to do this, storing the resulting data frame in a variable named access_logs_mini. Pass the value 2 as the value of the nrows argument to load_logs.

In [0]:

access_logs_mini = load_logs(log_files, nrows=2)

Write a statement in the cell below that calls the head method on access_logs_mini, passing 14 as the value of the n argument.

If things worked successfully, you should see 2 entries from each of the days in the range April 6 - April 12, 2021.

In [0]:

access_logs_mini.head(n=14)

Assuming the above worked, repeat to create a DataFrame named all_logs containing ten thousand rows from all the files (due to memory and processing limits on CoCalc, we can't grab them all, unfortunately).

In [0]:

all_logs = load_logs(log_files, nrows=1e4)

Part 4 - Analyzing the data

Now that we have all our data loaded, write statements in each of the cells below to complete the requested task.

Display the total number of rows in all_logs.

In [0]:

print(len(all_logs))

List the unique values in the Method column.

In [0]:

all_logs['Method'].unique()

Display the first 5 rows for which the Status column has the value 404, which corresponds to Not Found.

In [0]:

not_found = all_logs[all_logs['Status'] == 404]
not_found[0:5]

Of those values for which the Status was 404, which of them had a Referer_Domain of hope.edu? These are broken links/images within pages hosted by Hope College!

In [0]:

bad_hope_links = not_found[not_found['Referer_Domain'] == 'hope.edu']
bad_hope_links

Of the set of requests from above, how many different requested Resources were there?

In [0]:

print(len(bad_hope_links['Resource'].unique()))

Excecute the code in the cell below this one. Then explain what the output represents in the cell below the code.

In [0]:

sum(all_logs['Resource'].map(lambda res: "computer-science" in res))

Replace this cell's contents with an explanation of the above output.

This cell finds the rows in the logs data frame whose Resource column contains the string computer-science. For the most part, this means pages within the computer science portion of the Hope web site, although it's not a guarantee that other pages don't contain computer-science somewhere in their URL.

Overview

Part 1 - Exploring the data

Determining the contents of column "Unknown 3"

Fixing the `Date` and `TZ offset` columns

Breaking up the URL into parts

Breaking up the Request into its parts

Part 2 -- Building a function to process a log file

Part 3 -- Loading multiple files

Getting a list of the files to process

Writing a function to process a list of files

Part 4 - Analyzing the data

Product

Resources

Company

Overview

Part 1 - Exploring the data

Determining the contents of column "Unknown 3"

Fixing the Date and TZ offset columns

Breaking up the URL into parts

Breaking up the Request into its parts

Part 2 -- Building a function to process a log file

Part 3 -- Loading multiple files

Getting a list of the files to process

Writing a function to process a list of files

Part 4 - Analyzing the data

Fixing the `Date` and `TZ offset` columns