Overview
This exercise will give you some practice working with Pandas, and DataFrames
in particular. You will be loading and processing information from one week's worth of server logs for www.hope.edu. Server logs contain information about the pages that are downloaded from a web server, including
the date and time the request was made
the specific resource that was requested
a status code indicating whether the request was successful or not
how much data was returned
the IP address of the device making the request
and more
You will first work with exploring and processing a single day's logs. Once that process can be completed successfully you'll create one DataFrame
instance per day and concatenate them together into a single DataFrame
. Finally, you'll perform some queries on the data in the aggregated DataFrame
.
Part 1 - Exploring the data
Add a statement to the cell below that imports the pandas library using the alias pd
Now it's time to do some exploration of the data in the file. This is often necessary when you're importing data from an external source. You first have to get a sense for the data before knowing exactly what the parameter values you need to pass to one of Pandas read
functions should be.
Write a statement in the blank cell below that calls pd.read_csv
passing the following named arguments:
filepath_or_buffer: the string www_hope_edu_access.log.1.gz
delim_whitespace: True This option causes
read_csv
to break the file up into columns based on the presence of whitespace characters such as Space and Tabnrows: 5 This will restrict how many rows are read by
read_csv
Assign the DataFrame
returned by pd.read_csv
to a variable named sample
and display the contents of sample
to see what it looks like.
One thing that may surprise you is that the 6th column (labeled GET /news/2010/...) is treated as one column, even though its values contain spaces. This happens because those values are contained within double quotes in the data file, and read_csv
will treat anything in double-quotes as a single value.
To see this, here are the first few rows from the "raw" file. The request is the quoted string starting with GET.
It should be clear from the example data above that the data file doesn't contain any header rows at the beginning of the file. read_csv
uses the first row in the file as the column index by default. We will need to manually assign the column names.
Copy and paste the contents of the previous code cell into the cell below, after the declaration of column_headers
. Add the argument names with the value column_headers to the call to pd.read_csv
and execute the cell.
That's better!
It seems likely that the log file is using the character - to represent a missing value. Copy and paste the two lines that read and display the file from the cell above into the blank cell below. Then do the following:
Add the named argument na_values with the string value '-' to the call to
pd.read_csv
This tells Pandas to treat that string as a missing value.Change the value passed as the
nrows
argument to be 5000.Instead of displaying the whole DataFrame after the call to
pd.read_csv
, call thehead
method onsample
, to restrict how many rows we see.
Execute the cell once you've made the above changes.
Let's see how many of the values in the 3 "Unknown" columns contain values other than NaN. In the blank cell below, write and execute a statement that calls the info
method on the sample
DataFrame.
Examining the output from the info
method, it seems clear we don't need to keep the columns Unknown 1 and Unknown 2 since all of the values in those columns are missing (at least all of the sample of size 5000 are missing).
The column Unknown 3, however, is probably worth keeping because only about half of its values are missing.
Write several statements in the cell below that:
Creates a set named
columns_to_include
from the list namedcolumn_headers
. You create aset
just like you would create a list.Calls the remove method on
columns_to_include
to remove the value Unknown 1.Calls the remove method on
columns_to_include
to remove the value Unknown 2.Prints out the value of
columns_to_include
.
The output should look like this:
Now you can use columns_to_include
to tell pd.read_csv
which columns you want to keep in the DataFrame it creates.
Copy and paste the code you wrote the last time you called pd.read_csv
into the blank cell below. Then edit the call to pd.read_csv
to add the named argument usecols
whose value is columns_to_include
.
Determining the contents of column "Unknown 3"
Write code in the cell below to get a Series
object containing just the Unknown 3 column values. Store the Series
in a variable named Unknown_3
.
Now write a statement that calls the unique
method on Unknown_3
. unique
returns a NumPy array. Use an array slice so you see only the first 10 values from that column.
It looks like the values in the Unknown_3
column are URLs. They are likely the referer value, which is set when a hyperlink is followed to a page on the Hope web site.
Write code in the cell below to replace the value Unknown 3 in the list column_headers
with the value Referer. You can do this by:
Calling the
index
method oncolumn_headers
, passing it the string argument Unknown 3.Writing an assignment statement to assign the value in
column_headers
at the index returned by theindex
method to Referer.Printing out the value of
column_headers
to verify your code worked correctly.
You also need to remove Unknown 3 from the set columns_to_include
and add Referer to that same set. Write code in the cell below to do these things. Print out the value of columns_to_include
when you're done to verify it worked correctly.
Copy and paste the code you last used to read in the contents of the first log file into the cell below and execute it to verify it does what you want at this point.
Execute this code to see some of the log entries where the Referer column has a value.
Fixing the Date
and TZ offset
columns
You may have noticed that the Date
column values begin with a '[' character, and the TZ offset
column values end with a ']' character. Execute the cell below to fix up the Date
column.
The code lambda value: value[1:]
is called a lambda function and is a shorthand way to do this:
Write and execute code similar to the cell above in the cell below to remove the trailing ']' from the TZ offset
column.
Execute the cell below to convert the values in the Date
column to be actual dates, rather than strings. The format string in the call to pd.to_datetime
uses the same conversion characters as strftime
, which we saw in Homework 2.
The output of sample.info()
should indicate that the Date
column now has the datetime64 data type.
Breaking up the URL into parts
Review and then execute the code in the cell below. It creates a function named url_to_parts
, which uses a function named urlparse
from the urllib.parse
module to break up a string representing a URL into its various components. It returns a tuple containing those parts.
Now you can set up a new column named Referer_Domain using the code below.
Add code in the cell below that will add two more columns named Referer_Path and Referer_Query to the sample
DataFrame. The values for these two columns will be the second two values in referer_parts
.
Breaking up the Request into its parts
Now we want to similarly separate the values in the Request
column, which consists of 3 parts, separated by a space:
the HTTP request method (e.g. GET)
the resource being requested (e.g. /admissions)
the HTTP version (e.g. HTTP/1.1)
Write code in the cells below to:
Create a
Series
namedrequest_parts
by mapping the Request column with a lambda function that splits the Request into partsUses the pattern from above to create new columns in
sample
named Method, Resource, and HTTP_VersionDisplays the first few rows in
sample
Write code in the blank cell below to use the drop
method to remove the original Request column from the sample
DataFrame
. You'll want to give a value for the axis argument to make sure a column is dropped, rather than row, which is the default axis.
By default the drop
doesn't alter the DataFrame
it operates on. You'll have to assign the value returned by .drop
back to sample
to actually change sample
.
Finally, examine the code in the cell below and make a prediction about what it will do. Then execute and check your prediction.
Part 2 -- Building a function to process a log file
Your next task will be to complete the function in the cell below that puts all of the pieces you wrote above together so they can be executed as a specific task. In particular, we want to:
Use
pd.read_csv
to read inrowsToRead
lines of an access log file, assigning the final column names we determined above, and using '-' as a NaN character. The function definition below gives the default valueNone
torowsToRead
. Whenpd.read_csv
is givenNone
as itsnrows
argument, it will read all the rows.Uses
map
to remove the leading '[' from theDate
column.Uses
pd.datetime
to convert the values in theDate
column to datetime64.Uses
map
to remove the trailing ']' from theTZ offset
column.Creates the columns
Referer_Domain
,Referer_Path
, andReferer_Query
.Creates the columns
Method
,Resource
, andHTTP_Version
. There may be requests that don't include anHTTP_Version
, so when you split theRequest
column value, you only get 2 results rather than 3. Use the code below to generate the values in theHTTP_Version
column.Drops the
Request
column.Reorders the columns by making the
Resource
column the first column in theDataFrame
.Returns the completed
DataFrame
.
In the cell below, call the load_access_log
function, passing it the string www_hope_edu_access.log.1.gz as the argument, storing the returned DataFrame
in a variable named access_log
, and then calling the head
method on access_log
to see the first 5 rows.
Part 3 -- Loading multiple files
Now that you have a function that can successfully load a single file, you'll use it to read and combine several different files into a single DataFrame
.
Getting a list of the files to process
We want to process all the files in the current directory whose names contain access.log. Rather than hard coding the file names, you'll ask Python for a list containing those files. By using Python code to get the list, you can be sure your code will work regardless of how many files are present.
The code in the cell below uses a list comprehension to accomplish this task. It first:
Executes the
os.listdir
function to get a list of all the files in the current directoryOnly includes files whose name contains access.log
We haven't seen the use of an if
within a list comprehension before. It does what you would hopefully expect: only includes items that pass the test specified in the if statement.
Writing a function to process a list of files
Fill in the body of the load_logs
function below to process a list of files. The function takes two arguments:
the list of files to be processed
the number of rows within each file to process. As the
load_access_log
function did,load_logs
defines a default value ofNone
for the number of rows to process (nrows
), indicating all rows should be processed.
The body of load_logs
should:
Use a list comprehension to create a list of
DataFrame
objects, one per file inlog_files
, by callingload_access_log
for each file. Pass the name of the current file and the value ofnrows
toload_access_log
.Call
pd.concat
to concatenate theDataFrame
objects created by the list comprehension into a singleDataFrame
namedaccess_log
.Return
access_log
to the caller.
Now you're ready to call load_logs
for the files you stored in log_files
earlier. Write code in the cell below to do this, storing the resulting data frame in a variable named access_logs_mini
. Pass the value 2 as the value of the nrows
argument to load_logs
.
Write a statement in the cell below that calls the head
method on access_logs_mini
, passing 14 as the value of the n
argument.
If things worked successfully, you should see 2 entries from each of the days in the range April 6 - April 12, 2021.
Assuming the above worked, repeat to create a DataFrame
named all_logs
containing ten thousand rows from all the files (due to memory and processing limits on CoCalc, we can't grab them all, unfortunately).
Part 4 - Analyzing the data
Now that we have all our data loaded, write statements in each of the cells below to complete the requested task.
Display the total number of rows in all_logs.
List the unique values in the Method
column.
Display the first 5 rows for which the Status
column has the value 404, which corresponds to Not Found.
Of those values for which the Status
was 404, which of them had a Referer_Domain
of hope.edu? These are broken links/images within pages hosted by Hope College!
Of the set of requests from above, how many different requested Resources
were there?
Excecute the code in the cell below this one. Then explain what the output represents in the cell below the code.
Replace this cell's contents with an explanation of the above output.
This cell finds the rows in the logs
data frame whose Resource
column contains the string computer-science. For the most part, this means pages within the computer science portion of the Hope web site, although it's not a guarantee that other pages don't contain computer-science somewhere in their URL.