lm () est utilisée pour ajuster les modèles linéaires dans RThere are 6 steps involved in the analysis of a linear model:
With regression data, the first step involves the visual inspection of plots of the response variable against the explanatory variable.
We need answers to the following questions:
Cgrowth <- c(12,10,8,11,6,7,2,3)
Ctannin <- seq(1,8)
How do you "get to know the data"?
plot(Ctannin, Cgrowth)
# Autres étapes nécessaires?
Note that in the plot directive the arguments are in the order x then y. The fitted line added by abline like this:
plot(Ctannin, Cgrowth)
abline(lm(Cgrowth~Ctannin))
Where lm means linear model, ~ is pronounced tilde, and the order of the variables is y ~ x (in contrast to the plot directive, above).
Statistical modelling of regression proceeds as follows: We pick a name, say model, for the regression object, then choose a fitting procedure.
We could use any one of several but let’s keep it simple and use lm, the linear model that we have already used in abline, earlier. All we do it this:
model<-lm(Cgrowth~Ctannin)
which is read: “model gets a linear model in which growth is modelled as a function of tannin”. Now we can do lots of different things with the object called model. The first thing we might do is summarise it:
summary(model)
Please look at all of the output and think of what each means/relates to for the model regression!
Next, we might plot the object called model. This produces a useful series of diagnostics.
par(mfrow=c(2,2))
plot(model)
Cooks distance plot: It is often a good idea to repeat the modelling exercise with the most influential points omitted in order to assess the magnitude of their impact on the structure of the model.
Suppose we want to know the predicted value y at tannin = 5.5 %. We could write out the equation in calculator mode, using the parameter estimates from the summary table, like this:
11.7556-1.2167*5.5
Alternatively, we could use the predict directive with the object called model. We provide the value for tannin = 5.5 in a list in order to protect the vector of x values (called tannin, of course) from being adulterated. The name of the x values to be used for prediction (tannin in this case) must be exactly the same as the name of the explanatory variable in the model.
predict(model,list(tannin=5.5))
Summary
The steps in the regression analysis were:
We conclude that the present example is well described by a linear model with normally distributed errors. There are some large residuals, but no obvious patterns in the residuals that might suggest any systematic inadequacy of the model. The slope of the regression line is highly significantly different from zero, and we can be confident that, for the caterpillars in question, increasing dietary tannin reduces weight gain and that, over the range of tannin concentrations considered, the relationship is reasonably linear.
Whether the model could be used accurately for predicting what would happen with much higher concentrations of tannin, would need to be tested by further experimentation. It is extremely unlikely to remain linear, however, as this would soon begin to predict substantial weight losses (negative gains), and these could not be sustained in the medium term.
The next example involves data that can not be represented by a straight line. The response variable is dry mass of organic matter remaining after different periods of time in a decomposition experiment. The data are in a file called decay.txt
decay.x <- seq(0,30)
decay.y <- c(125, 100.2488583, 70, 83.47079531,100,65.90786956,66.53371457,53.58808673,61.33235104,43.92743547,40.29544843,44.71345876,32.53314287,36.64033626,40.15471123,23.08029549,39.86792844,22.84978589,35.01464535,17.97726709,21.1591801,27.99827279,21.88573499,14.27396172,13.66596918,11.81643542,25.18901636,8.195643809,17.19133663,24.28335424,17.72277625)
plot(decay.x,decay.y)
This does not look particularly like a straight-line relationship. Indeed, it makes no scientific sense to fit a linear relationship to these data, because it would begin to predict negative dry matter once time (x) got bigger than 30 or so. Let’s put a straight line through the data using abline to get a better impression of the curvature.
plot(decay.x,decay.y)
abline(lm(decay.y ~ decay.x))
What we see are “groups of residuals”. Below about x = 5 most of the residuals are positive, as is the case for the residuals for x bigger than about 25. In between, most of the residuals are negative. This is what we mean be “evidence of curvature”. Or more forthrightly “systematic inadequacy of the model”. Let’s do a linear regression anyway, then look at what the model checking tells us. We call the model object “result” and fit the model as before:
result<-lm(decay.y~decay.x)
summary(result)
Everything is highly significant. But is the model any good? To test this we carry out the diagnostic plots.
par(mfrow=c(2,2))
plot(result)
The first plot (residuals against fitted values) shows the problem at once. This should look like the sky at night (i.e. no pattern) but in fact, it looks like a letter U. The positive residuals occur only for small and large fitted values of y (as we suspected from our earlier inspection of the abline plot). The message here is simple. A U- shaped plot of residuals against fitted values means that the linear model is inadequate. We need to account for curvature in the relationship between y and x.
The second diagnostic plot is banana shaped, not straight as it should be. It gets much steeper above theoretical quantiles of about +0.5, further evidence that something is wrong with the model.
The first response to curvature is often transformation, and that is what we shall do here.
Looking back at the plot and thinking about the fact that it is a decay curve leads us first to try a model in which log(y) is a linear function of x.
plot(decay.x,decay.y,log="y")
That is, indeed, much straighter, but now a new problem has arisen: the variance in y increases as y gets smaller (the scatter of the data around the line is fan-shaped). We shall deal with that later. For the moment, we fit a different linear model.
Instead of y ~ x we shall fit log(y) ~ x
This is fantastically easy to do by replacing “y” in the model formula by “log(y)”. Let’s call the new model object “transformed” like this :
transformed <- lm(log(decay.y)~decay.x)
summary(transformed)
As before, everything is highly significant. But does the model behave any better? Lets look at the diagnostic plots:
par(mfrow=c(2,2))
plot(transformed)
y(i) = c0 + c1 * x(i) + e(i), et e(i) est l'erreur sur la prédictionAutre example:
x <- 0:100 # créer un vecteur de 1 à 100 par 1
e <- rnorm(length(x), 0, 10)
y <- 0.5 + 2 * x + e
model <- lm(y ~ x)
summary(model)
plot(x, y)
abline(model)
| Wilkinson notation | Equation |
|---|---|
| y ~ x | y(i) = c0 + c1 * x(i) + e(i) |
| y ~ x^3 | y(i) = c0 + c1 * x(i)^3 + e(i) |
| y ~ x + z, y ~ x * z - x : z | y(i) = c0 + c1 * x(i) + c2 * z(i) + e(i) |
| y ~ x * z | y(i) = c0 + c1 * x(i) + c2 * z(i) + c3 * x(i)*z(i) + e(i) |
| y ~ x : z, y ~ x * z - x - z | y(i) = c0 + c3 * x(i)*z(i) + e(i) |
| y ~ . | Include all variables |
These formulae refer to the same model:
* x * z) - (w : x : z)predict() function can make predictions about the value of target variable, y, for new datase.fit=TRUE to get standard errorsnew_data <- data.frame(x=c(40, 50, 60, 120))
predict(model, new_data, se.fit=TRUE)
airquality data, compute a linear model with Ozone as the dependant variable, and Temp and Wind as the predictor variablesTemp and Wind in your modelOzone level on a day when Temp=80 degrees F and Wind=15 mphlm(Ozone ~ ..., airquality)lm(airquality$Ozone ~ ...)new_data <- data.frame(Temp=80, Wind=15)