Getting Started
Execute the 2 cells below to import our usual data science libraries, and then load the final version of the access log DataFrame
from our access log processing exercise.
Part 1 - Decoding the User Agent Column
Later in this notebook you'll be doing some analysis of requests by user agent, which is a more technical term for browser. To prepare for this, you'll standardize the values in the User Agent
column to a restricted set of values.
First, use the fillna
method to replace all missing values in the User Agent
column with the string Unknown.
Now, create a column in logs
named User Agent Lower
which is the result of converting the values in the User Agent
column to lower case. See pages 220-222 (end of Section 7.3) to see how to do this efficiently. You should not write a for loop to accomplish this task.
Display the first 10 values in the User Agent Lower
column only to verify your code worked correctly.
Next, write one or more statements in the cell below that add a column named Platform
to logs
. The contents of Platform
will initially be the value Other.
After creating Platform
, display the contents of the User Agent
and Platform
columns for the first 10 rows of logs
to verify it worked correctly.
In the cell below, write code that looks for the following strings in each of the values in User Agent Lower
, setting the value in the Platform
column to the associated value.
User Agent Lower Contains | Set Platform to |
---|---|
android | Android |
iphone | iPhone |
ipad | iPad |
macintosh | Mac |
windows | Windows |
bot | Web crawler |
spider | Web crawler |
You can again reference pages 220-222 (end of section 7.3) from the course textbook for more information on how to efficiently process string values in a Pandas column. Do not write a for loop
over the rows in logs
to accomplish this task!
Use the loc
function when setting the values in the Platform
column to avoid an error message from Pandas about setting values on a copy of a slice.
Display the User Agent Lower
and Platform
columns for the first 10 rows in logs
when you are done to verify your work.
Now we want to set the Platform
column to Linux for those values of User Agent Lower
that contain linux but do not contain Android.
First, print out the number of rows in logs
where the Platform
column has the value Android. We will want to verify this number doesn't change when we set values for linux.
Now write code in the cell below to change the Platform
value to Linux when the User Agent Lower
column contains linux and does not contain android. I used np.logical_not
to help me with this.
Again, use the loc
method when setting values.
Now copy/paste your code that printed out the number of rows whose Platform
column is Android into the cell below. Verify it is the same as it was previously. If it's not, check your statement above.
Use the technique for using a dictionary to map values from one column into a new column at the top of page 203 (section 7.2, under the heading Transforming Data Using a Function or Mapping) to create a new column named Device type
from the Platform
column according to the following rules:
Android, iPhone, and iPad should be given the value Mobile.
Windows, Mac, and Linux should be be given the value Desktop.
Unknown, Other, and Web crawler should be given the value Unknown.
To help verify the correctness of he above cell, write code in the cell below that displays the Platform
and Device type
columns for the first 20 rows whose Platform
value is not Other.
Part 2: Generating a column for the top level folder for the request
Next we will want to do some analysis by the "top-level" folder for each request. This top-level folder will include values like admissions, news, and academics.
To get started, write the definition of a function named parse_resource
in the cell below. parse_resource
takes a single argument r
, which is a string, and returns a value according to the following rules:
If
r
is the string "/", thenr
itself is returned.Otherwise,
r
is split intoparts
based on the / character, and the 2nd component ofparts
is examined. Call that 2nd componentbase
:If
base
contains a period, then "/" is returned.If
base
contains a question mark, then the portion ofbase
before the question mark is returned.Otherwise,
base
is returned.
The index
method on a string can be used to find the first index in a string where a particular character appears.
Now use map
to apply the parse_resource
function to the values in the Resource column . Store the resulting Series
object into logs
with the name Top-Level
Write a statement in the cell below that prints out the number of distinct values in the Top-Level
column. The answer should be 208, but you (hopefully obviously) shouldn't just print that number literally.
Write one or more statements in the cell below that prints out how many requests were made for the top-level folder named academics.
Write one or more statements in the cell below to get a Series
object that summarizes how many requests occurred for each of the top 5 most frequently accessed values of Top-Level
. Exclude the two values _resources
and graphics
when picking the top 5 most frequently accessed top-level folders.
Hint: first create a subset of logs
that excludes the rows where the Top-Level
column is either _resources
or graphics
. Then use the value_counts
method on that subset, followed by a slicing operation. value_counts
always returns values sorted in descending order.
Write one or more statements in the cell below that create a tuple named date_range
which contains the smallest and largest values in the Date
column for logs
. Print out the value of date_range
after creating the tuple.
The values in date_range
with be Pandas Timestamp objects. These objects have a strftime
method that calls the standard datetime.strftime
Python function we have used before.
For example, we can print out just the year for the first element in date_range
like this:
Write one or more statements in the cell below that creates a string variable named date_range_str
. The value of date_range_str
should look like Apr 01, 2021 - Apr 05, 2021, where the two dates are replaced by the values in date_range
. Use the strftime documentation to determine the correct formatting codes.
Print out the value of date_range_str
once you've created it.
Write code in the cell below so that if the year components of the two values in date_range
are the same, date_range_str
only includes the year at the end.
For example, Apr 01 - Apr 05, 2021 would be the output for the example above. Otherwise, leave the value of date_range_str
as it was set in the cell above. Print the value of date_range_str
at the end of the cell.
If ts
is a Timestamp
object, you can get the year associated with ts
like this:
Use the "recipe" described in the textbook on page 279 in the cell below to produce a horizontal bar chart showing the number of requests per top-level value for those top 5 most frequent top-level folders.
The chart title should be Most frequent top-level folders date_range, where date_range is replaced with the contents of the
date_range_str
variable you created above.The color of the bars should be the string #036 which is a Hope College blue color.
The x-axis labels should be Top-level folder.
The y-axis labels should be Number of requests.
I set the chart title and color using arguments to the Pandas plotting function. For the axis labels, I saved the axes object returned by the Pandas plotting function into a variable, and then called methods on that object to specify the x-axis and y-axis labels. There is example code to set the x-axis label on page 268 of the textbook; similar code can be used to set the y-axis label.
Part 3 - Recreating location_summary
using GroupBy
In our previous work with this data, we computed two columns named On campus and Off campus that have boolean values. Pandas calls these dummy variables. To simplify some of our computations in the following cells, write code in the cell below that creates a new column named Request location. The values in Request location should be On campus if the value of the On campus
column is True, otherwise the value should be Off campus.
I did this using map
, but it can also be done using a series of assignment statements. If you choose to do it using map
, you will find conditional expressions useful in implementing a lambda function. Or you can also define a function to be passed to map
instead of using a lambda function.
Write a statement in the cell below that displays only the values of the IP address
, On campus
, Off campus
, and Request location
columns for the first 10 rows in logs
.
Write one or more statements in the cell below that use groupby
and count
to create a Series
named location_count. location_count
should contain the number of requests made from each type of location. Print out the value of location_count after you create it.
You should see approximately 38K requests from off-campus, and 31K requests from on-campus.
Repeat the above to create Series
objects named location_total
, location_min
, location_max
, and location_mean
that compute the corresponding statistics for the values in the Size
column using the groupby
method.
Print the value in location_total
in millions to check your work. You should get approximately 2625M bytes from off campus, and 3376M bytes from on campus.
Now create a new data frame named location_summary
with the columns Count, Min size, Max size, Mean size, and Total containing the Series
objects created above.
Use the head
method to display both rows.
Part 4 - Grouping by multiple columns
The following code illustrates the process of grouping by multiple columns, and then generating a bar chart from that data.
First, we generate a new DataFrame
that contains only those rows whose Top-Level folder is one of the 5 most frequent folders.
Now we can
Group
top_folder_requests
by theTop-Level
andRequest location
columns.Compute a count for each combination of values in these columns.
Display the resulting
DataFrame
.
Note that the index for counts
is a multi-level index (also called a hierarchical index).
In order to generate a stacked bar chart, we need to take one level of the index and create a column for each value of that level. This is what the unstack
method does. In this case, it will create new columns named Off campus
and On campus
, containing the counts associated with those values.
Now that we have columns for Off campus and On campus, it's pretty straightforward for us to create a couple of bar charts from the counts
data frame.
Your final task is to generate the chart shown below.
The two cells below aren't part of the solution; they were just used to generate an image and data file for students to use while completing the assignment.