Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
DataScienceUWL
GitHub Repository: DataScienceUWL/DS775
Path: blob/main/Lessons/Lesson 04 - Combining Predictive and Prescriptive Analytics (Project)/Lesson_04.ipynb
870 views
Kernel: Python 3 (system-wide)
# EXECUTE FIRST # computational imports import statsmodels.api as sm import pandas as pd import numpy as np from sklearn.linear_model import LinearRegression import plotly.express as px import plotly.io as pio # display imports from IPython.core.display import HTML # suppress warnings import warnings warnings.filterwarnings('ignore')

Lesson 4: Combining Predictive and Prescriptive Analytics (Project)

This course has three lessons that are project lessons. Typically, projects combine skills and knowledge from multiple weeks (and sometimes multiple courses). Project weeks generally do not involve textbook readings and have presentations that only cover the project scenario and any additional information you might need to complete the project.

This week's project involves combining predictive and prescriptive analytics. We're using the simplest of predictive analytics, linear regression. Everyone should have learned linear regression in DS705 (which is a prerequisite for this course). However, in 705, you used R, whereas in 775, we're using Python. We'll walk you through using the statsmodel package in Python to do linear regression, describe how it could be used in predictive analytics, and introduce you to the scenario for your project.

The Sample Problem - Wyndor Revisited

Once again, let's go back to the Wyndor problem. When Wyndor decided to build new products, they needed to enlist the help of key personnel in various units of the company. After discussing production hours with plant managers and engineers, they felt they had solid estimates for the number of hours of production time they'd need and the number of hours available. But, they still weren't sure about how to estimate the profit for each batch of windows and doors.

Problem set up for Wyndor - includes table of data with production time needed and available for 2 products over 3 plants.

To set their estimated profit, they worked with the engineers, marketing division, and the accountants. We don't know what data they used to make their decisions. But, if they'd previously introduced other new products, it's feasible that they may have used some predictive analytics to model the estimated profit for each new product and fill in the question marks in the image above.

The Sample Data

Let's assume that Wyndor had introduced new, similar, doors and windows previously, as a paired product launch, and they've done some market research of competitors. Wyndor has gathered the following data, showing profit of batches of doors and windows and overall profit.

df = pd.DataFrame({ 'doors': [2, 3, 4, 6, 7, 8, 8, 10], 'windows': [3, 3, 3, 4, 4, 5, 5, 5], 'profit': [25, 30, 35, 43, 50, 55, 63, 70] }) df

We can plot the data and note that it appears that both door and window batch profits share a linear relationship with overall profit, based on this historical data.

# uncomment the next line to see plot in vs code jupyter notebook #pio.renderers.default = "plotly_mimetype+notebook" #fig = go.Figure() fig = px.line(df, x='profit', y = ['doors','windows']) # Edit the layout fig.update_layout(title='Relationship of Door and Window Batch Profit to Overall Profit', xaxis_title='Overall Profit', yaxis_title='Profit per Batch') fig.show()

Running Linear Regression with Statsmodel

Remember that the objective function of a linear program is really nothing more than a linear regression equation, fitted through the origin. We can use this historical data to estimate our coefficients for our linear program. In Python, we'll use statsmodel to run our linear regression.

# this is the response variable - what we're trying to predict Y = df['profit'] # these are the predictor variables - note that we're using 2 predictor variables, so this is multiple linear regression X = df[['doors', 'windows']] # fitting the regression through the origin is as simple as calling the OLS (ordinary least squares) function, passing in the Y and X (in that order) and fitting the data with the fit function model_obj = sm.OLS(Y, X).fit() # we can extract just the coefficients from the model_obj.params coefs = model_obj.params # we can print the whole model summary to review our model by calling the summary function print(model_obj.summary()) print(f'\nThe coefficients are:\n {coefs}') print(f'R-squared is {model_obj.rsquared}')
OLS Regression Results ======================================================================================= Dep. Variable: profit R-squared (uncentered): 0.997 Model: OLS Adj. R-squared (uncentered): 0.996 Method: Least Squares F-statistic: 1017. Date: Sun, 25 Aug 2024 Prob (F-statistic): 2.55e-08 Time: 18:13:29 Log-Likelihood: -19.127 No. Observations: 8 AIC: 42.25 Df Residuals: 6 BIC: 42.41 Df Model: 2 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ doors 4.0289 0.762 5.284 0.002 2.163 5.894 windows 5.5221 1.218 4.534 0.004 2.542 8.503 ============================================================================== Omnibus: 1.599 Durbin-Watson: 2.285 Prob(Omnibus): 0.450 Jarque-Bera (JB): 0.971 Skew: -0.757 Prob(JB): 0.615 Kurtosis: 2.211 Cond. No. 10.2 ============================================================================== Notes: [1] R² is computed without centering (uncentered) since the model does not contain a constant. [2] Standard Errors assume that the covariance matrix of the errors is correctly specified. The coefficients are: doors 4.028878 windows 5.522124 dtype: float64 R-squared is 0.9970577496364398

Here we can see that our model accounts for 99.7% of the variability in the overall profit, which is quite good, and both of our predictors are significant with p-values less than .05. (Of course, this data is fairly contrived, and we probably have some issues with multi-collinearity that we're not addressing. But we're just trying to give you a general idea of how this might be used here.)

Utilizing Linear Regression Results in Optimization

So how do we use this in optimization? Remember that multiple linear regression with 2 variables and no intercept results in an equation of the form:

y=β1x1+β2x2y = \beta_1 * x_1 + \beta_2 * x_2

Where β1\beta_1 corresponds to the coefficient for the first variable (in this case doors) and β2\beta_2 corresponds to the coefficient for the second variable (in this case windows).

Doesn't that look exactly like what our optimization objective function looked like when we hard-coded the coefficients for windows and doors?

Maximize Z=3d+5wZ = 3 * d + 5 * w

In fact, we can just substitute the coefficients from our linear regression to generate an objective function which uses the predicted coefficients. Like so:

Maximize Z=β1d+β2wZ = \beta_1 * d + \beta_2 * w

Let's see what that looks like in code.

from pyomo.environ import * # abstract Wyndor model, using the predictions from linear regression to replace objective coefficients products = ['Doors', 'Windows'] plants = ['Plant1', 'Plant2', 'Plant3'] # we simply replace the formally static profit rate for windows and doors with the values returned from the linear regression model. profit_rate = {'Doors': coefs['doors'], 'Windows': coefs['windows']} hours_available = {'Plant1': 4, 'Plant2': 12, 'Plant3': 18} hours_per_batch = { 'Plant1': { 'Doors': 1, 'Windows': 0 }, 'Plant2': { 'Doors': 0, 'Windows': 2 }, 'Plant3': { 'Doors': 3, 'Windows': 2 } } # Concrete Model model = ConcreteModel(name="WyndorWithPredictions") # Decision Variables model.weekly_prod = Var(products, domain=NonNegativeReals) # Objective model.profit = Objective(expr=sum(profit_rate[pr] * model.weekly_prod[pr] for pr in products), sense=maximize) model.capacity = ConstraintList() for pl in plants: model.capacity.add( sum(hours_per_batch[pl][pr] * model.weekly_prod[pr] for pr in products) <= hours_available[pl]) # Solve solver = SolverFactory('glpk') solver.solve(model) # display solution (again, we've changed to f-strings) print(f"Maximum Profit = ${1000*model.profit():,.2f}") for j in products: print(f"Batches of {j} = {model.weekly_prod[j]():.1f}")
Maximum Profit = $41,190.50 Batches of Doors = 2.0 Batches of Windows = 6.0

Now that you've seen the basic concepts, we'd like to introduce you to a more complicated scenario that you'll be using for your homework. Don't panic, though. It all works the same way. You'll use predictive analytics to generate estimated coefficients. Once your regression models are built, you'll set the historical data aside, and proceed with linear programming, using those modeled coefficients.

The Scenario

The following problem takes place in the United States in the late 1990s, when many major US cities were facing issues with airport congestion, partly as a result of the 1978 deregulation of airlines. Both fares and routes were freed from regulation, and low-fare carriers such as Southwest (SW) began competing on existing routes and starting non-stop service on routes that previously lacked it. Building new airports is not generally feasible, but sometimes decommissioned military bases or smaller municipal airports can be reconfigured as regional or larger commercial airports. There are numerous players and interests involved in the issue (airlines, city, state, and federal authorities, civic groups, the military, airport operators), and an aviation consulting firm is seeking advisory contracts with these players.

A consulting firm wishes to determine the maximum average fare (FARE) as a function of three variables: COUPON, HI, and DISTANCE. Moreover, they need to impose constraints on

  • the number of passengers on that route (PAX) 20000\leq 20000

  • the starting city’s average personal income (S_INCOME) 30000\leq 30000

  • the ending city’s average personal income (E_INCOME) 30000\geq 30000

However, the variables PAX, S_INCOME, and E_INCOME are not decision variables so the firm first model these variables using COUPON, HI, and DISTANCE as predictors using linear regression (predictive analytics). They'll also use linear regression to model a linear relation between FARE and COUPON, HI, and DISTANCE. Armed with these predictive models, the firm will build a linear program (prescriptive analytics) to maximize the average fare.

Suppose you are in the aviation consulting firm and you want to maximize airfares for the particular set circumstances described below. The file Airfares.xlsx contains real data that were collected between Q3-1996 and Q2-1997. The first sheet contains variable descriptions while the second sheet contains the data. A csv file of the data is also provided (called Airfares.csv).

NOTE: This problem scenario is developed from pp. 170-171 in Data Mining for Business Analytics: Concepts, Techniques, and Applications in R, by Shmueli, Bruce, Yahav, Patel, and Lichtendahl, Wiley, 2017)

Predictive Analytics

You will use multiple linear regression through the origin to fit airfare (FARE) as a linear function of the average number of coupons (COUPON) for that route, the Herfindel Index (HI), and the distance between the two endpoint airports in miles (DISTANCE).

You will build three more linear regression models with COUPON, HI, and DISTANCE as predictors to fit separate regression equations through the origin for response variables:

  • the number of passengers on that route (PAX)

  • the starting city’s average personal income (S_INCOME)

  • the ending city’s average personal income (E_INCOME)

Since each of these models uses the same predictors and the only thing that varies is the response variable, you'll write a function that takes in the predictors and response variables which:

  • runs the linear regression

  • returns the model

  • prints the regression equation.

Prescriptive Analytics

Linear Programming

Use the fitted regression equation for airfare (FARE) as a linear function of the average number of coupons (COUPON) for that route, the Herfindel Index (HI), and the distance between the two endpoint airports in miles (DISTANCE) as the objective function.

The three linear regression equations for the number of passengers on that route (PAX), the starting city’s average personal income (S_INCOME), the ending city’s average personal income (E_INCOME) as functions of the average number of coupons (COUPON) for that route, the Herfindel Index (HI), and the distance between the two endpoint airports in miles (DISTANCE) are to be used as three of the constraint equations.

  • the number of passengers on that route (PAX) 20000\leq 20000

  • the starting city’s average personal income (S_INCOME) 30000\leq 30000

  • the ending city’s average personal income (E_INCOME) 30000\geq 30000

For additional constraints:

  • restrict COUPON to no more than 1.5

  • limit HI to between 4000 and 8000, inclusive

  • consider only routes with DISTANCE between 500 and 1000 miles, inclusive.

Complete the Homework Quiz

We've provided a Jupyter notebook that walks you through each of the questions in the homework quiz. Use the notebook to run your code. Transfer your answers to the Canvas quiz when complete.