Path: blob/master/incubator/linreg-professor-salary.ipynb
411 views
Introduction
In this notebook, we will use data from here, in which we want to see whether there's discrimination in professor salaries or not.
Problem Type
This is standard linear regression! To those who have been trained in stats, this shouldn't be too difficult to understand.
Data structure
To use it with this model, the data should be structured as such:
Each row is one measurement.
The columns should indicate, at the minimum:
The explanatory variables, each getting their own column.
The dependent variable, one column.
Extensions to the model
None
Reporting summarized findings
Here are examples of how to summarize the findings.
For every unit increase in
explanatory_variable_i
, dependent variable increases bybeta
(95% HPD: [lower
,upper
]).
Other notes
None
The data are from here, and have the following columns:
Independent variables:
sx
: the biological sex of the professor.0 for male
1 for female
rk
: the rank of the professor.1 for assistant professor
2 for associate professor
3 for full professor
dg
: the highest degree attained0 for masters degree
1 for doctorate degre
yd
: years since degree obtained. Essentially an 'experience' term.
Dependent variables:
sl
: annual salary
Read Data
Let's read in the data and do some preprocessing to make all of the data numerical.
Model
We will perform linear regression on the salary data. Here's some of the modelling choices that go into this.
Choice of priors:
Intercept: Normal distribution. Very wide.
Errors: Can only be positive, therefore use HalfNormal distribution, again, very wide.
Choices for salary likelihood function:
The salary is modelled as a linear combination of the independent variables.
We assume that the salary is going to be normally distributed around the linear combination of independent variables with the same variance around the expected value.
That is how we get the code below.
With the recipe above, you'll have a general starting point for linear regressions (and its variants, e.g. poisson regression). The key idea, which you'll see later on, is swapping out the likelihood function.
The awesome PyMC3 developers provide also a GLM module that lets you write the above more concisely:
However, I have given you the more verbose version, as I want you to see the code at the level of abstraction that will let you flexibly modify the model as you need it.
Borrowing shamelessly from Thomas Wiecki, we hit the Inference Button (TM) below.
Let's visualize the traceplots.
The traceplots give us a visual diagnostic on the convergence of the MCMC sampler. The ADVI initialization gets us pretty darn close to the places of highest likelihood. Sampling converges pretty soon after, so let's use a burn-in of ~1000 steps and re-check.
Should be pretty clear - very good convergence. Let's look at a forestplot of the inferred variables.
Interpretation
The interpretation here is as such.
Given the data on hand,
a professor's baseline salary is in the range of ${{ intercept_percs[0] }} to ${{ intercept_percs[2] }} dollars
every increase in rank gives ${{rank_percs[0]}} to ${{rank_percs[2]}} dollars increase in salary
females earn ${{sex_percs[0]}} to ${{sex_percs[2]}} more dollars than males
every extra year of work earns the professor ${{year_percs[0]}} to ${{year_percs[2]}} in salary
having an advanced degree earns the professor ${{degree_percs[0]}} to ${{degree_percs[2]}} in salary
every year away from the degree earned earns the professor ${{experience_percs[0]}} to ${{experience_percs[2]}} in salary.
Conclusions
Overall, rank and years of work are the best predictors of professor salary.