Part 1 - Complete the original access log exercise
As part of this assignment, you should complete the processing-access-logs.ipynb
Jupyter notebook contained in the data-science/in-class-exercises/processing-access-logs
folder (the one we've been working on in class).
Once you have that notebook completed, you should copy it from the folder it's currently in to the data-science/homework-4
folder. You can either use CoCalc's file manager to perform the copy, or you can execute the following command from a Terminal session.
Part 2 - Working with the status code column
Execute the 2 cells below to import our usual data science libraries, and then load the final version of the access log DataFrame
from our access log processing exercise.
Despite our best efforts in the previous exercise, it appears that some rows weren't processed correctly. To see this, write and execute a statement in the cell below displays each of the different values in the Status
column. Look back at the code from the original exercise if you need a refresher on how to do this
You should see a mixture of numbers, strings containing numbers, and what looks like an HTTP version number. Write a statement in the cell below that will display the number of rows whose Status
column is set to this HTTP version number. The output should look like this:
Since there are only 5 of them, write a statement in the cell below to display all 5 of the rows to get a sense for what they look like.
You should see that they are all requests from the same IP address and for similar, but not identical, resources. To see this, write code in the cell below that displays just those two columns for these 5 rows.
Review the process used in the first part of the exercise to create the Referer_Domain
, Referer_Path
, and Referer_Query
. Then use a similar process to break the Resource
column into Resource_Path
and Resource_Query
portions, separated at the ? character.
Display the values of the Resource_Path
, and Resource_Query
columns just for the 5 rows you've been examining.
Next, use the apply
function on the 5 bad rows from above to create a new Series
object named address_and_resource
(not a new column in the DataFrame
named logs
). The values in address_and_resource
should be tuples whose first component is the value of the IP address
column, and whose second component is the value of the Resource_Path
column.
Write 2 separate print
statements to display the components of the first tuple in address_and_resource
, like this:
where XX.YY.WW.ZZ is replaced with the originating IP address, and RRRRR is replaced with the requested resource.
Finally, write a loop that iterates over the unique tuples in address_and_resource
, printing out a statement like the one below for each unique combination of IP address and Resource_Path:
where XX.YY.WW.ZZ is replaced with the originating IP address, and RRRRR is replaced with the requested resource. You should find that there's only one unique combination of IP address and resource.
If you're interested in what's happening here, a quick Google search indicates this is likely an attempt to execute malicious code on our web server!
Write a statement in the cell below so that logs
contains all the rows except for these 5 bad requests. Print out the number of rows in logs
when you are done as a sanity check. It should be 69,995.
Write a statement in the cell below that uses the DataFrame
method astype
to convert the Status
column to be an int64 instead of its current type of object.
Read the documentation on DataFrame.astype carefully before starting. In particular, note you'll want to pass a dictionary as the dtype
argument to the astype
method.
Write a statement in the cell below that executes the info
method on the logs
DataFrame
to verify the Status
column's type has been changed.
Write one or more statements in the cell to display the distinct status values in the Status
values present in logs
that represent errors. Any status in the range 400-499 represents an error. You should see 3 distinct error statuses: 404, 403, and 401.
In the cell below, write one or more statements creates a new DataFrame
named error_df
that contains a single column named Number of errors for each of these error codes. The index for error_df
should be the status code values, and the index's name should be Status Code. Use the DataFrame.sort_values
method to sort error_df
in descending order by the number of requests in logs
with that status code.
Do not assume the 3 codes you found will be the only error codes present. Your code should work regardless of how many distinct error codes there are. In my implementation, I used a list comprehension to create a list of counts for each value, and passed that list when creating error_df
.
Use the head
method to display the values in error_df
after constructing it. It should look like this:
Number of errors | |
---|---|
Status Code | |
404 | 24078 |
403 | 326 |
401 | 31 |
Part 3: Looking at where our requests came from
Next you'll use the IP address to categorize whether a the device making each request was on-campus or off-campus. IP addresses are generally represented as 4 numbers, separated by periods. This format is often called dotted decimal notation. In reality, each IP address is a distinct series of 32 zeros and ones. We use dotted decimal notation to make it easier for us to write and remember IP addresses!
All of Hope's IP addresses start with 209.140. The 3rd value is 192 or higher. The final value can be any value.
In the cell below, write a function named isHopeAddress
which takes a single string value as an argument, and returns True
if it meets the requirements described above, and False
otherwise. My implementation
uses
split
to break the string into partsuses
int
to convert each part to an integeruses
if
statements to decide whether the address is a Hope address or not.
You can execute the cell below to test your implementation of isHopeAddress
.
Now that you have a working isHopeAddress
, use it in combination with map
to create two new columns in logs
:
On campus
: Has a value ofTrue
for addresses that are on Hope's campus, and a value ofFalse
for off-campus addressesOff campus
: Has values that are the opposite ofOn campus
Write a statement in the cell below that displays the IP address
, On campus
, and Off campus
columns for the first 10 rows of logs
. Coincidentally, you'll see the first 2 and last 2 requests are from on campus devices, while the middle 6 are from off campus.
In the cell below, write code to display a summary of some of the columns in logs
broken down into on-campus and off-campus requests. Store the output in a new DataFrame
named location_summary
. The values of location_summary
's index should be the values On campus and Off campus. The columns in location_summary
should be:
Number of requests
Number of distinct resources
Distinct IPs
Smallest response
Largest response
Average response
In addition to these columns, location_summary
should also contain columns that display the percentage of requests made for each distinct value in the Method
column. You shouldn't hard code these column names; instead, use the value_counts
method to get a listing of the possible values. You may assume that both on and off campus requests contain the same set of request methods.
After creating location_summary
, use the head
method to display its contents. Here's the output I got:
Number of requests | Number of distinct resources | Distinct IPs | Smallest response | Largest response | Average response | GET | POST | HEAD | |
---|---|---|---|---|---|---|---|---|---|
On campus | 31608 | 2490 | 307 | 273 | 9750825 | 106820.740192 | 99.958871 | 0.025310 | 0.015819 |
Off campus | 38387 | 7756 | 3212 | 0 | 54742549 | 68383.381692 | 99.184620 | 0.054706 | 0.760674 |