Régressions Linéaire et Prédictions

  • Nous voulons modéliser la relation entre une variable dépendante (cible), «y», et une ou plusieurs variables explicatives, «x»
  • La fonction lm () est utilisée pour ajuster les modèles linéaires dans R
  • Utiliser la «notation Wilkinson» pour définir des modèles statistiques dans R    * [Variable de réponse] ~ [variables prédictives]

There are 6 steps involved in the analysis of a linear model:

  • get to know the data;
  • suggest a suitable model;
  • fit the model to the data;
  • subject the model to criticism;
  • analyse for influential points;
  • simplify the model to its bare essentials

With regression data, the first step involves the visual inspection of plots of the response variable against the explanatory variable.

We need answers to the following questions:

  • is there a trend in the data?
  • what is the slope of the trend (positive or negative)?
  • is the trend linear or curved?
  • is there any pattern to the scatter around the trend?
In [ ]:
Cgrowth <- c(12,10,8,11,6,7,2,3)
Ctannin <- seq(1,8)
In [ ]:

How do you "get to know the data"?

In [ ]:
plot(Ctannin, Cgrowth)
In [ ]:
# Autres étapes nécessaires?

Note that in the plot directive the arguments are in the order x then y. The fitted line added by abline like this:

In [ ]:
plot(Ctannin, Cgrowth)
abline(lm(Cgrowth~Ctannin))

Where lm means linear model, ~ is pronounced tilde, and the order of the variables is y ~ x (in contrast to the plot directive, above).

Statistical modelling of regression proceeds as follows: We pick a name, say model, for the regression object, then choose a fitting procedure.

We could use any one of several but let’s keep it simple and use lm, the linear model that we have already used in abline, earlier. All we do it this:

In [ ]:
model<-lm(Cgrowth~Ctannin)

which is read: “model gets a linear model in which growth is modelled as a function of tannin”. Now we can do lots of different things with the object called model. The first thing we might do is summarise it:

In [ ]:
summary(model)

Please look at all of the output and think of what each means/relates to for the model regression!

Next, we might plot the object called model. This produces a useful series of diagnostics.

In [ ]:
par(mfrow=c(2,2))
plot(model)

Cooks distance plot: It is often a good idea to repeat the modelling exercise with the most influential points omitted in order to assess the magnitude of their impact on the structure of the model.

Suppose we want to know the predicted value y at tannin = 5.5 %. We could write out the equation in calculator mode, using the parameter estimates from the summary table, like this:

In [ ]:
11.7556-1.2167*5.5

Alternatively, we could use the predict directive with the object called model. We provide the value for tannin = 5.5 in a list in order to protect the vector of x values (called tannin, of course) from being adulterated. The name of the x values to be used for prediction (tannin in this case) must be exactly the same as the name of the explanatory variable in the model.

In [ ]:
predict(model,list(tannin=5.5))

Summary

The steps in the regression analysis were:

  • data inspection
  • model specification
  • model fitting
  • model criticism

We conclude that the present example is well described by a linear model with normally distributed errors. There are some large residuals, but no obvious patterns in the residuals that might suggest any systematic inadequacy of the model. The slope of the regression line is highly significantly different from zero, and we can be confident that, for the caterpillars in question, increasing dietary tannin reduces weight gain and that, over the range of tannin concentrations considered, the relationship is reasonably linear.

Whether the model could be used accurately for predicting what would happen with much higher concentrations of tannin, would need to be tested by further experimentation. It is extremely unlikely to remain linear, however, as this would soon begin to predict substantial weight losses (negative gains), and these could not be sustained in the medium term.

Transformation of non-linear responses

The next example involves data that can not be represented by a straight line. The response variable is dry mass of organic matter remaining after different periods of time in a decomposition experiment. The data are in a file called decay.txt

In [4]:
decay.x <- seq(0,30)
decay.y <- c(125, 100.2488583, 70, 83.47079531,100,65.90786956,66.53371457,53.58808673,61.33235104,43.92743547,40.29544843,44.71345876,32.53314287,36.64033626,40.15471123,23.08029549,39.86792844,22.84978589,35.01464535,17.97726709,21.1591801,27.99827279,21.88573499,14.27396172,13.66596918,11.81643542,25.18901636,8.195643809,17.19133663,24.28335424,17.72277625)
plot(decay.x,decay.y)

This does not look particularly like a straight-line relationship. Indeed, it makes no scientific sense to fit a linear relationship to these data, because it would begin to predict negative dry matter once time (x) got bigger than 30 or so. Let’s put a straight line through the data using abline to get a better impression of the curvature.

In [5]:
plot(decay.x,decay.y) 
abline(lm(decay.y ~ decay.x))

What we see are “groups of residuals”. Below about x = 5 most of the residuals are positive, as is the case for the residuals for x bigger than about 25. In between, most of the residuals are negative. This is what we mean be “evidence of curvature”. Or more forthrightly “systematic inadequacy of the model”. Let’s do a linear regression anyway, then look at what the model checking tells us. We call the model object “result” and fit the model as before:

In [6]:
result<-lm(decay.y~decay.x)
summary(result)
Call:
lm(formula = decay.y ~ decay.x)

Residuals:
    Min      1Q  Median      3Q     Max 
-19.065 -10.029  -2.058   5.107  40.447 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  84.5534     5.0277   16.82  < 2e-16 ***
decay.x      -2.8272     0.2879   -9.82 9.94e-11 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 14.34 on 29 degrees of freedom
Multiple R-squared:  0.7688,	Adjusted R-squared:  0.7608 
F-statistic: 96.44 on 1 and 29 DF,  p-value: 9.939e-11

Everything is highly significant. But is the model any good? To test this we carry out the diagnostic plots.

In [ ]:
par(mfrow=c(2,2))
plot(result)

The first plot (residuals against fitted values) shows the problem at once. This should look like the sky at night (i.e. no pattern) but in fact, it looks like a letter U. The positive residuals occur only for small and large fitted values of y (as we suspected from our earlier inspection of the abline plot). The message here is simple. A U- shaped plot of residuals against fitted values means that the linear model is inadequate. We need to account for curvature in the relationship between y and x.

The second diagnostic plot is banana shaped, not straight as it should be. It gets much steeper above theoretical quantiles of about +0.5, further evidence that something is wrong with the model.

The first response to curvature is often transformation, and that is what we shall do here.

Looking back at the plot and thinking about the fact that it is a decay curve leads us first to try a model in which log(y) is a linear function of x.

In [ ]:
plot(decay.x,decay.y,log="y")

That is, indeed, much straighter, but now a new problem has arisen: the variance in y increases as y gets smaller (the scatter of the data around the line is fan-shaped). We shall deal with that later. For the moment, we fit a different linear model.

Instead of y ~ x we shall fit log(y) ~ x

This is fantastically easy to do by replacing “y” in the model formula by “log(y)”. Let’s call the new model object “transformed” like this :

In [ ]:
transformed <- lm(log(decay.y)~decay.x)
summary(transformed)

As before, everything is highly significant. But does the model behave any better? Lets look at the diagnostic plots:

In [ ]:
par(mfrow=c(2,2))
plot(transformed)

Régression linéaire simple

  • Un exemple de la forme:
    • y ~ x, ou y(i) = c0 + c1 * x(i) + e(i), et e(i) est l'erreur sur la prédiction

Autre example:

In [8]:
x <- 0:100 # créer un vecteur de 1 à 100 par 1
e <- rnorm(length(x), 0, 10)
y <- 0.5 + 2 * x + e
In [9]:
model <- lm(y ~ x)
summary(model)
Call:
lm(formula = y ~ x)

Residuals:
    Min      1Q  Median      3Q     Max 
-27.830  -5.358   1.741   6.151  20.734 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -3.26688    1.94848  -1.677   0.0968 .  
x            2.02435    0.03366  60.133   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9.864 on 99 degrees of freedom
Multiple R-squared:  0.9734,	Adjusted R-squared:  0.9731 
F-statistic:  3616 on 1 and 99 DF,  p-value: < 2.2e-16
In [10]:
plot(x, y)
abline(model)

Other Models

Wilkinson notation Equation
y ~ x y(i) = c0 + c1 * x(i) + e(i)
y ~ x^3 y(i) = c0 + c1 * x(i)^3 + e(i)
y ~ x + z, y ~ x * z - x : z y(i) = c0 + c1 * x(i) + c2 * z(i) + e(i)
y ~ x * z y(i) = c0 + c1 * x(i) + c2 * z(i) + c3 * x(i)*z(i) + e(i)
y ~ x : z, y ~ x * z - x - z y(i) = c0 + c3 * x(i)*z(i) + e(i)
y ~ . Include all variables

Quadratic Models

These formulae refer to the same model:

  • y ~ (w + x + z)^2
  • y ~ (w + x + z) + (w : x + w : z + x : z)
  • y ~ (w * x * z) - (w : x : z)

Doing Predictions

  • The predict() function can make predictions about the value of target variable, y, for new data
  • Use se.fit=TRUE to get standard errors
In [11]:
new_data <- data.frame(x=c(40, 50, 60, 120))

predict(model, new_data, se.fit=TRUE)
$fit
1
77.7072965179236
2
97.950840387614
3
118.194384257304
4
239.655647475446
$se.fit
1
1.03761440331374
2
0.981484966487818
3
1.03761440331374
4
2.55274898003036
$df
99
$residual.scale
9.86380183720257

Exercise 01

  • In the airquality data, compute a linear model with Ozone as the dependant variable, and Temp and Wind as the predictor variables
    • Include interactions between Temp and Wind in your model
  • Predict the Ozone level on a day when Temp=80 degrees F and Wind=15 mph
    • Hint: do not use dollar signs in your model formula
      • lm(Ozone ~ ..., airquality)
      • NOT: lm(airquality$Ozone ~ ...)
In [ ]:
new_data <- data.frame(Temp=80, Wind=15)