Linear Models

A linear model is sometimes known as a least-squares fit, or a linear fit or various other names.

In this worksheet we'll go through the process of reading in data and fitting a linear model. The data is a classic example of ammonia loss up a chimney varying according to the operating conditions of the plant.

First read the data file, and check the first few rows.

stackdata = read.table("./Data/stack.txt")
head(stackdata)
##   Air.Flow Water.Temp Acid.Conc. stack.loss
## 1       80         27         89         42
## 2       80         27         88         37
## 3       75         25         90         37
## 4       62         24         87         28
## 5       62         22         87         18
## 6       62         23         87         18

We only have a few variables, so a scatterplot matrix is a good start

plot(stackdata)
plot of chunk scaterplot

This shows some trends and correlations. Specifically the bottom row shows how the stack loss varies with each of our explanatory variables.

Next we can fit a model for how the loss depends on one of the variables

lm(stack.loss ~ Air.Flow, data = stackdata)
## 
## Call:
## lm(formula = stack.loss ~ Air.Flow, data = stackdata)
## 
## Coefficients:
## (Intercept)     Air.Flow  
##      -44.13         1.02

That prints out the model fit, and shows us the model fit parameters. We can save it in a variable:

fitAir = lm(stack.loss ~ Air.Flow, data = stackdata)

This means we can then use the model in further function. For example:

summary(fitAir)
## 
## Call:
## lm(formula = stack.loss ~ Air.Flow, data = stackdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -12.290  -1.127  -0.046   1.117   8.873 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -44.13       6.11   -7.23  7.3e-07 ***
## Air.Flow        1.02       0.10   10.21  3.8e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 4.1 on 19 degrees of freedom
## Multiple R-squared: 0.846,	Adjusted R-squared: 0.838 
## F-statistic:  104 on 1 and 19 DF,  p-value: 3.77e-09

The table of coefficients here shows us the parameters and standard errors, together with t-values and significance values

We can fit a model with all the explanatory variables

fitAll = lm(stack.loss ~ Air.Flow + Water.Temp + Acid.Conc., data = stackdata)
summary(fitAll)
## 
## Call:
## lm(formula = stack.loss ~ Air.Flow + Water.Temp + Acid.Conc., 
##     data = stackdata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.238 -1.712 -0.455  2.361  5.698 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -39.920     11.896   -3.36   0.0038 ** 
## Air.Flow       0.716      0.135    5.31  5.8e-05 ***
## Water.Temp     1.295      0.368    3.52   0.0026 ** 
## Acid.Conc.    -0.152      0.156   -0.97   0.3440    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 3.24 on 17 degrees of freedom
## Multiple R-squared: 0.914,	Adjusted R-squared: 0.898 
## F-statistic: 59.9 on 3 and 17 DF,  p-value: 3.02e-09

The acid concentration parameter isn't significant here, so we can drop it. We should probably do a formal likelihood ratio test.

fit3 = lm(stack.loss ~ Air.Flow + Water.Temp, data = stackdata)
summary(fit3)$coef
##             Estimate Std. Error t value  Pr(>|t|)
## (Intercept) -50.3588     5.1383  -9.801 1.216e-08
## Air.Flow      0.6712     0.1267   5.298 4.898e-05
## Water.Temp    1.2954     0.3675   3.525 2.419e-03

Automatic model selection can be done by the step function. Let's see if it comes to the same conclusion

stepFit = step(fitAll)
## Start:  AIC=52.98
## stack.loss ~ Air.Flow + Water.Temp + Acid.Conc.
## 
##              Df Sum of Sq RSS  AIC
## - Acid.Conc.  1        10 189 52.1
##                     179 53.0
## - Water.Temp  1       130 309 62.5
## - Air.Flow    1       296 475 71.5
## 
## Step:  AIC=52.12
## stack.loss ~ Air.Flow + Water.Temp
## 
##              Df Sum of Sq RSS  AIC
##                     189 52.1
## - Water.Temp  1       130 319 61.1
## - Air.Flow    1       294 483 69.9
coef(summary(stepFit))
##             Estimate Std. Error t value  Pr(>|t|)
## (Intercept) -50.3588     5.1383  -9.801 1.216e-08
## Air.Flow      0.6712     0.1267   5.298 4.898e-05
## Water.Temp    1.2954     0.3675   3.525 2.419e-03

Model terms beyond the raw data can be added to a model. Here we fit a polynomial to see if there is significant non-linearity in the response:

fitP = lm(stack.loss ~ poly(Air.Flow, 2) + poly(Water.Temp, 2), data = stackdata)
summary(fitP)
## 
## Call:
## lm(formula = stack.loss ~ poly(Air.Flow, 2) + poly(Water.Temp, 
##     2), data = stackdata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.685 -1.685 -0.034  1.315  5.933 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            17.524      0.646   27.11  8.4e-15 ***
## poly(Air.Flow, 2)1     19.172      6.492    2.95  0.00935 ** 
## poly(Air.Flow, 2)2     -1.622      3.972   -0.41  0.68851    
## poly(Water.Temp, 2)1   24.945      5.995    4.16  0.00074 ***
## poly(Water.Temp, 2)2    9.403      4.688    2.01  0.06212 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 2.96 on 16 degrees of freedom
## Multiple R-squared: 0.932,	Adjusted R-squared: 0.915 
## F-statistic:   55 on 4 and 16 DF,  p-value: 3.79e-09

This doesn't show much evidence of non-linearity in those variables.

GLMs

Suppose instead of a continuous response variable we had a binary measure. Perhaps all we know is if the loss was over 14. Let's simulate that, and then delete the original variable.

stackdata$alert = stackdata$stack.loss > 14
stackdata$stack.loss = NULL

Now to fit the model we use a generalised linear response. Our binary outcome is modelled as a logistic transform of an underlying linear model

g1 = glm(alert ~ Air.Flow, data = stackdata, family = binomial)
summary(g1)
## 
## Call:
## glm(formula = alert ~ Air.Flow, family = binomial, data = stackdata)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.9781  -0.0618   0.0006   0.4172   2.0334  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  -42.341     23.173   -1.83    0.068 .
## Air.Flow       0.722      0.395    1.83    0.068 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 29.065  on 20  degrees of freedom
## Residual deviance: 11.742  on 19  degrees of freedom
## AIC: 15.74
## 
## Number of Fisher Scoring iterations: 8

Now try adding the other terms. Why does this fail? A scatterplot might help. Its always good to do an exploratory plot first.