Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
DataScienceUWL
GitHub Repository: DataScienceUWL/DS775
Path: blob/main/Homework/Lesson 04 HW - Predictive & Prescriptive Project/Homework_04.ipynb
870 views
Kernel: Python 3 (system-wide)
# execute to import notebook styling for tables and width etc. from IPython.core.display import HTML import statsmodels.api as sm import pandas as pd import copy from pyomo.environ import * import warnings warnings.filterwarnings('ignore')

Lesson 4: Homework - Combining Predictive and Prescriptive Analytics

When asking questions about homework in Piazza please use a tag in the subject line like HW1.3 to refer to Homework 1, Question 3. So the subject line might be HW1.3 question. Note there are no spaces in "HW1.3". This really helps keep Piazza easily searchable for everyone!

For full credit, all code in this notebook must be both executed in this notebook and copied to the Canvas quiz where indicated.

Question 1 Why must your linear regression equations be fit through the origin? (2 points)

  • It is not feasible to have zero passengers, therefore the line must originate away from (0,0).

  • The multiple regressions run through the origin, i.e (0,0), due to the additivity assumption of linear programming.

  • The regression equations are all fit through the origin (no intercept term) so that we maintain proportionality without an additive constant in our objective function and to simplify the construction of the constraints.

  • If the intercept is not constant then the certainty assumption is violated. We must fit through the origin to make it certain.

Problem Set up

Note: this information is not included in the Canvas quiz.

The file Airfares.xlsx contains real data that were collected between Q3-1996 and Q2-1997. A csv file of the data is also provided (called Airfares.csv).

We're copying the instructions from the presentation file here for ease of use.

The following problem takes place in the United States in the late 1990s, when many major US cities were facing issues with airport congestion, partly as a result of the 1978 deregulation of airlines. Both fares and routes were freed from regulation, and low-fare carriers such as Southwest (SW) began competing on existing routes and starting non-stop service on routes that previously lacked it. Building new airports is not generally feasible, but sometimes decommissioned military bases or smaller municipal airports can be reconfigured as regional or larger commercial airports. There are numerous players and interests involved in the issue (airlines, city, state, and federal authorities, civic groups, the military, airport operators), and an aviation consulting firm is seeking advisory contracts with these players.

A consulting firm wishes to determine the maximum average fare (FARE) as a function of three variables: COUPON, HI, and DISTANCE. COUPON, HI, and DISTANCE are things that an airline could control, when determining where to locate new routes.

Moreover, they need to impose constraints on

  • the number of passengers on that route (PAX) 20000\leq 20000

  • the starting city’s average personal income (S_INCOME) 30000\leq 30000

  • the ending city’s average personal income (E_INCOME) 30000\geq 30000

For additional constraints:

  • restrict COUPON to no more than 1.5

  • limit HI to between 4000 and 8000, inclusive

  • consider only routes with DISTANCE between 500 and 1000 miles, inclusive.

However, the variables PAX, S_INCOME, and E_INCOME are not decision variables so the firm must first model these variables using COUPON, HI, and DISTANCE as predictors using linear regression (predictive analytics). They'll also use linear regression to model a linear relation between FARE and COUPON, HI, and DISTANCE. Armed with these predictive models the firm will build a linear program (prescriptive analytics) to maximize the average fare.

Suppose you are in the aviation consulting firm and you want to maximize airfares for the particular set circumstances described below. The file Airfares.xlsx contains real data that were collected between Q3-1996 and Q2-1997. The first sheet contains variable descriptions, while the second sheet contains the data. A csv file of the data is also provided (called Airfares.csv).

NOTE: This problem scenario is developed from pp. 170-171 in Data Mining for Business Analytics: Concepts, Techniques, and Applications in R, by Shmueli, Bruce, Yahav, Patel, and Lichtendahl, Wiley, 2017)

Part 1: The Predictive Models

Since each of these models uses the same predictors and the only thing that varies is the response variable, write a function that takes in the dataframe, a list of predictors and a response variable string which:

  • runs the linear regression based on the

  • returns the model

  • prints the regression equation.

Use a non-repetitive approach to run multiple linear regression through the origin using the average number of coupons (COUPON) for that route, the Herfindel Index (HI), and the distance between the two endpoint airports in miles (DISTANCE) as predictors. You'll build 4 multiple linear regression models, one for each of the following response variables:

  • the average fare (FARE)

  • the number of passengers on that route (PAX)

  • the starting city’s average personal income (S_INCOME)

  • the ending city’s average personal income (E_INCOME)

For each of the models, you'll need to:

  • print the resulting linear equation. For instance: FARE=X1COUPON+X2HI+X3DISTANCEFARE = X_1COUPON + X_2HI + X_3DISTANCE with the XnX_n coefficients filled in.

  • print the R2R^2 for each model. (Hint, it's stored in a variable that can be accessed by calling .rsquared on whatever variable you created when you fit the model.)

  • store the data in such a way that you can use the coefficients directly in the linear program.

To solve this, start by completing the regModel() function below. Then call the regModel function 4 times to produce the coefficients for each model. You should store the coefficients in such a way that you can access them easily to build the linear programs that are needed later (we suggest you store all the coefficients in one data structure). Here is a template for the function

def regModel(df, X, Y): """ find linear regression coefficients for Y ~ X using the data in df Parameters: df (pandas data frame): contains the response and predictor variables Y (str): a string matching the column name of the response variable X (list of str): column names of the predictor variables Returns: (list of floats or similar): linear regression model coefficients for each predictor variable """ # find model # print output # return coefs

For example this code:

airfares = pd.read_csv("data/Airfares.csv") predictors = ['COUPON', 'HI', 'DISTANCE'] coef = regModel( airfares, predictors, 'FARE') print('\nThe coefficients of the regression model are:') print(coef)

Would produce output similar to this:

Model is: FARE = 22.5900 COUPON + 0.0118 HI + 0.0833 DISTANCE (R^2 = 0.91) The coefficients of the regression model are: COUPON 22.590019 HI 0.011798 DISTANCE 0.083336

Question 2 - Your regModel() function (manually graded) (5 points)

In the following cell, write your non-repetitive code and run your 4 linear regressions.

#your code here

Question 3 (2 points)

In the model that predicts FARE, what is the coefficient for COUPON, rounded to 2 digits?

Answer here

Question 4 (2 points)

In the model that predicts PAX, what is the coefficient for HI, rounded to 2 digits?

Answer here

Question 5 (2 points)

In the model that predicts S_INCOME, what is the coefficient for DISTANCE, rounded to 2 digits?

Answer here

Question 6 (2 points)

Match the models with their R2R^2 values:

A. FARE

  • .97

  • .96

  • .91

  • .42

  • .87

B. PAX

  • .97

  • .96

  • .91

  • .42

  • .87

C. S_INCOME

  • .97

  • .96

  • .91

  • .42

  • .87

D. E_INCOME

  • .97

  • .96

  • .91

  • .42

  • .87

Optimal LP Solution

Question 7 Generate the linear program (manually graded) 8 points

Generate a linear program to find the optimal maximum value of FARE, given the constraints noted in the introduction and the results of your linear regression modeling. Be sure to use the values directly from your linear regression without rounding. Do not hard-code the coefficient values. Access them directly from your saved models, as demonstrated in the lesson.

For full credit, you must use an abstract approach. To make it easier to group the constraints, you can rewrite the E_INCOME constraint so that it uses \leq instead of \geq. A simple way to do this is to multiply both sides of the inequality by (-1). For example 2x3y5 2x - 3y \geq 5 becomes 2x+3y5. -2x + 3y \leq -5.

Question 8 (2 points)

What is the maximum airfare, rounded to 2 digits. (Do not input the dollar sign.)

Answer here

Question 9 (2 points)

What is the optimal value for COUPON, rounded to 2 digits.

Answer here

Question 10 (2 points)

What is the optimal value for HI, rounded to 2 digits.

Answer here

Question 11 (2 points)

What is the optimal value for DISTANCE, rounded to 2 digits.

Answer here

Sensitivity Analysis

Run your linear program multiple times, modifying your resource constraints to determine which constraints are binding. You'll be making the following changes. (Remember to always reset your variables back to the baseline after each change.)

Increment the following constraints by one (setting them back before incrementing the next one):

  • S_INCOME (from 30000 to 30001)

  • E_INCOME (from 30000 to 30001)

  • PAX (from 20000 to 20001)

  • COUPON (from 1.5 to 2.5)

  • HI (from 8000 to 8001)

  • DISTANCE (from 1000 to 1001)

Use the values of the objective function from each run, and the value from Question 8, to estimate the shadow prices for each constraint. To get accurate values you should store the optimized objective function values in memory and use those to compute the shadow prices.

You don't need to produce a sensitivity report, but you can if want to use it to check your work.

# use this cell to do your work

Question 12 (3 points)

Which of the following are binding constraints? Check all the apply.

  • S_INCOME

  • E_INCOME

  • PAX

  • COUPON

  • HI

  • DISTANCE

Question 13 (2 points)

What is the shadow price for S_INCOME (rounded to 2 digits)?

  • .00

  • .01

  • .09

  • 1.00

  • 1.01

Question 14 (2 points)

What is the shadow price for COUPON (rounded to 2 digits)?

  • .00

  • .01

  • .09

  • 1.00

  • 1.01

Question 15 (2 points)

What is the shadow price for DISTANCE (rounded to 2 digits)?

  • .00

  • .01

  • .09

  • 1.00

  • 1.01

Question 16 (Manually graded) (5 points)

Briefly summarize the main conclusion of this project, state what you see as any limitations of the methods used here, and suggest other possible methods of addressing the maximizing of airfare in this problem scenario. To get full credit, you should address any limitations you see in the regression models and provide at least one recommendation for improvement.

Add your answer

Question 17 (Manually graded) (5 points)

Show the mathematical formulation for the linear programming problem used in this project. You can either use LaTeX and markdown or take a clean, cropped picture of neatly handwritten equations and upload it. (Note: both the equation editor and the image upload are hidden behind the 3 vertical dot "more" menu in Canvas.)