Path: blob/main/Lessons/Lesson 04 - Combining Predictive and Prescriptive Analytics (Project)/Lesson_04.ipynb
870 views
Lesson 4: Combining Predictive and Prescriptive Analytics (Project)
This course has three lessons that are project lessons. Typically, projects combine skills and knowledge from multiple weeks (and sometimes multiple courses). Project weeks generally do not involve textbook readings and have presentations that only cover the project scenario and any additional information you might need to complete the project.
This week's project involves combining predictive and prescriptive analytics. We're using the simplest of predictive analytics, linear regression. Everyone should have learned linear regression in DS705 (which is a prerequisite for this course). However, in 705, you used R, whereas in 775, we're using Python. We'll walk you through using the statsmodel
package in Python to do linear regression, describe how it could be used in predictive analytics, and introduce you to the scenario for your project.
The Sample Problem - Wyndor Revisited
Once again, let's go back to the Wyndor problem. When Wyndor decided to build new products, they needed to enlist the help of key personnel in various units of the company. After discussing production hours with plant managers and engineers, they felt they had solid estimates for the number of hours of production time they'd need and the number of hours available. But, they still weren't sure about how to estimate the profit for each batch of windows and doors.
To set their estimated profit, they worked with the engineers, marketing division, and the accountants. We don't know what data they used to make their decisions. But, if they'd previously introduced other new products, it's feasible that they may have used some predictive analytics to model the estimated profit for each new product and fill in the question marks in the image above.
The Sample Data
Let's assume that Wyndor had introduced new, similar, doors and windows previously, as a paired product launch, and they've done some market research of competitors. Wyndor has gathered the following data, showing profit of batches of doors and windows and overall profit.
We can plot the data and note that it appears that both door and window batch profits share a linear relationship with overall profit, based on this historical data.
Running Linear Regression with Statsmodel
Remember that the objective function of a linear program is really nothing more than a linear regression equation, fitted through the origin. We can use this historical data to estimate our coefficients for our linear program. In Python, we'll use statsmodel
to run our linear regression.
Here we can see that our model accounts for 99.7% of the variability in the overall profit, which is quite good, and both of our predictors are significant with p-values less than .05. (Of course, this data is fairly contrived, and we probably have some issues with multi-collinearity that we're not addressing. But we're just trying to give you a general idea of how this might be used here.)
Utilizing Linear Regression Results in Optimization
So how do we use this in optimization? Remember that multiple linear regression with 2 variables and no intercept results in an equation of the form:
Where corresponds to the coefficient for the first variable (in this case doors) and corresponds to the coefficient for the second variable (in this case windows).
Doesn't that look exactly like what our optimization objective function looked like when we hard-coded the coefficients for windows and doors?
Maximize
In fact, we can just substitute the coefficients from our linear regression to generate an objective function which uses the predicted coefficients. Like so:
Maximize
Let's see what that looks like in code.
Now that you've seen the basic concepts, we'd like to introduce you to a more complicated scenario that you'll be using for your homework. Don't panic, though. It all works the same way. You'll use predictive analytics to generate estimated coefficients. Once your regression models are built, you'll set the historical data aside, and proceed with linear programming, using those modeled coefficients.
The Scenario
The following problem takes place in the United States in the late 1990s, when many major US cities were facing issues with airport congestion, partly as a result of the 1978 deregulation of airlines. Both fares and routes were freed from regulation, and low-fare carriers such as Southwest (SW) began competing on existing routes and starting non-stop service on routes that previously lacked it. Building new airports is not generally feasible, but sometimes decommissioned military bases or smaller municipal airports can be reconfigured as regional or larger commercial airports. There are numerous players and interests involved in the issue (airlines, city, state, and federal authorities, civic groups, the military, airport operators), and an aviation consulting firm is seeking advisory contracts with these players.
A consulting firm wishes to determine the maximum average fare (FARE) as a function of three variables: COUPON, HI, and DISTANCE. Moreover, they need to impose constraints on
the number of passengers on that route (PAX)
the starting city’s average personal income (S_INCOME)
the ending city’s average personal income (E_INCOME)
However, the variables PAX, S_INCOME, and E_INCOME are not decision variables so the firm first model these variables using COUPON, HI, and DISTANCE as predictors using linear regression (predictive analytics). They'll also use linear regression to model a linear relation between FARE and COUPON, HI, and DISTANCE. Armed with these predictive models, the firm will build a linear program (prescriptive analytics) to maximize the average fare.
Suppose you are in the aviation consulting firm and you want to maximize airfares for the particular set circumstances described below. The file Airfares.xlsx contains real data that were collected between Q3-1996 and Q2-1997. The first sheet contains variable descriptions while the second sheet contains the data. A csv file of the data is also provided (called Airfares.csv).
NOTE: This problem scenario is developed from pp. 170-171 in Data Mining for Business Analytics: Concepts, Techniques, and Applications in R, by Shmueli, Bruce, Yahav, Patel, and Lichtendahl, Wiley, 2017)
Predictive Analytics
You will use multiple linear regression through the origin to fit airfare (FARE) as a linear function of the average number of coupons (COUPON) for that route, the Herfindel Index (HI), and the distance between the two endpoint airports in miles (DISTANCE).
You will build three more linear regression models with COUPON, HI, and DISTANCE as predictors to fit separate regression equations through the origin for response variables:
the number of passengers on that route (PAX)
the starting city’s average personal income (S_INCOME)
the ending city’s average personal income (E_INCOME)
Since each of these models uses the same predictors and the only thing that varies is the response variable, you'll write a function that takes in the predictors and response variables which:
runs the linear regression
returns the model
prints the regression equation.
Prescriptive Analytics
Linear Programming
Use the fitted regression equation for airfare (FARE) as a linear function of the average number of coupons (COUPON) for that route, the Herfindel Index (HI), and the distance between the two endpoint airports in miles (DISTANCE) as the objective function.
The three linear regression equations for the number of passengers on that route (PAX), the starting city’s average personal income (S_INCOME), the ending city’s average personal income (E_INCOME) as functions of the average number of coupons (COUPON) for that route, the Herfindel Index (HI), and the distance between the two endpoint airports in miles (DISTANCE) are to be used as three of the constraint equations.
the number of passengers on that route (PAX)
the starting city’s average personal income (S_INCOME)
the ending city’s average personal income (E_INCOME)
For additional constraints:
restrict COUPON to no more than 1.5
limit HI to between 4000 and 8000, inclusive
consider only routes with DISTANCE between 500 and 1000 miles, inclusive.
Complete the Homework Quiz
We've provided a Jupyter notebook that walks you through each of the questions in the homework quiz. Use the notebook to run your code. Transfer your answers to the Canvas quiz when complete.