# Linear Models

A linear model is sometimes known as a least-squares fit, or a linear fit or various other names.

In this worksheet we'll go through the process of reading in data and fitting a linear model. The data is a classic example of ammonia loss up a chimney varying according to the operating conditions of the plant.

First read the data file, and check the first few rows.

```stackdata = read.table("./Data/stack.txt")
```
```##   Air.Flow Water.Temp Acid.Conc. stack.loss
## 1       80         27         89         42
## 2       80         27         88         37
## 3       75         25         90         37
## 4       62         24         87         28
## 5       62         22         87         18
## 6       62         23         87         18
```

We only have a few variables, so a scatterplot matrix is a good start

```plot(stackdata)
``` This shows some trends and correlations. Specifically the bottom row shows how the stack loss varies with each of our explanatory variables.

Next we can fit a model for how the loss depends on one of the variables

```lm(stack.loss ~ Air.Flow, data = stackdata)
```
```##
## Call:
## lm(formula = stack.loss ~ Air.Flow, data = stackdata)
##
## Coefficients:
## (Intercept)     Air.Flow
##      -44.13         1.02
```

That prints out the model fit, and shows us the model fit parameters. We can save it in a variable:

```fitAir = lm(stack.loss ~ Air.Flow, data = stackdata)
```

This means we can then use the model in further function. For example:

```summary(fitAir)
```
```##
## Call:
## lm(formula = stack.loss ~ Air.Flow, data = stackdata)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -12.290  -1.127  -0.046   1.117   8.873
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   -44.13       6.11   -7.23  7.3e-07 ***
## Air.Flow        1.02       0.10   10.21  3.8e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.1 on 19 degrees of freedom
## Multiple R-squared: 0.846,	Adjusted R-squared: 0.838
## F-statistic:  104 on 1 and 19 DF,  p-value: 3.77e-09
```

The table of coefficients here shows us the parameters and standard errors, together with t-values and significance values

We can fit a model with all the explanatory variables

```fitAll = lm(stack.loss ~ Air.Flow + Water.Temp + Acid.Conc., data = stackdata)
summary(fitAll)
```
```##
## Call:
## lm(formula = stack.loss ~ Air.Flow + Water.Temp + Acid.Conc.,
##     data = stackdata)
##
## Residuals:
##    Min     1Q Median     3Q    Max
## -7.238 -1.712 -0.455  2.361  5.698
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  -39.920     11.896   -3.36   0.0038 **
## Air.Flow       0.716      0.135    5.31  5.8e-05 ***
## Water.Temp     1.295      0.368    3.52   0.0026 **
## Acid.Conc.    -0.152      0.156   -0.97   0.3440
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.24 on 17 degrees of freedom
## Multiple R-squared: 0.914,	Adjusted R-squared: 0.898
## F-statistic: 59.9 on 3 and 17 DF,  p-value: 3.02e-09
```

The acid concentration parameter isn't significant here, so we can drop it. We should probably do a formal likelihood ratio test.

```fit3 = lm(stack.loss ~ Air.Flow + Water.Temp, data = stackdata)
summary(fit3)\$coef
```
```##             Estimate Std. Error t value  Pr(>|t|)
## (Intercept) -50.3588     5.1383  -9.801 1.216e-08
## Air.Flow      0.6712     0.1267   5.298 4.898e-05
## Water.Temp    1.2954     0.3675   3.525 2.419e-03
```

Automatic model selection can be done by the `step` function. Let's see if it comes to the same conclusion

```stepFit = step(fitAll)
```
```## Start:  AIC=52.98
## stack.loss ~ Air.Flow + Water.Temp + Acid.Conc.
##
##              Df Sum of Sq RSS  AIC
## - Acid.Conc.  1        10 189 52.1
##                     179 53.0
## - Water.Temp  1       130 309 62.5
## - Air.Flow    1       296 475 71.5
##
## Step:  AIC=52.12
## stack.loss ~ Air.Flow + Water.Temp
##
##              Df Sum of Sq RSS  AIC
##                     189 52.1
## - Water.Temp  1       130 319 61.1
## - Air.Flow    1       294 483 69.9
```
```coef(summary(stepFit))
```
```##             Estimate Std. Error t value  Pr(>|t|)
## (Intercept) -50.3588     5.1383  -9.801 1.216e-08
## Air.Flow      0.6712     0.1267   5.298 4.898e-05
## Water.Temp    1.2954     0.3675   3.525 2.419e-03
```

Model terms beyond the raw data can be added to a model. Here we fit a polynomial to see if there is significant non-linearity in the response:

```fitP = lm(stack.loss ~ poly(Air.Flow, 2) + poly(Water.Temp, 2), data = stackdata)
summary(fitP)
```
```##
## Call:
## lm(formula = stack.loss ~ poly(Air.Flow, 2) + poly(Water.Temp,
##     2), data = stackdata)
##
## Residuals:
##    Min     1Q Median     3Q    Max
## -3.685 -1.685 -0.034  1.315  5.933
##
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)
## (Intercept)            17.524      0.646   27.11  8.4e-15 ***
## poly(Air.Flow, 2)1     19.172      6.492    2.95  0.00935 **
## poly(Air.Flow, 2)2     -1.622      3.972   -0.41  0.68851
## poly(Water.Temp, 2)1   24.945      5.995    4.16  0.00074 ***
## poly(Water.Temp, 2)2    9.403      4.688    2.01  0.06212 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.96 on 16 degrees of freedom
## Multiple R-squared: 0.932,	Adjusted R-squared: 0.915
## F-statistic:   55 on 4 and 16 DF,  p-value: 3.79e-09
```

This doesn't show much evidence of non-linearity in those variables.

## GLMs

Suppose instead of a continuous response variable we had a binary measure. Perhaps all we know is if the loss was over 14. Let's simulate that, and then delete the original variable.

```stackdata\$alert = stackdata\$stack.loss > 14
stackdata\$stack.loss = NULL
```

Now to fit the model we use a generalised linear response. Our binary outcome is modelled as a logistic transform of an underlying linear model

```g1 = glm(alert ~ Air.Flow, data = stackdata, family = binomial)
summary(g1)
```
```##
## Call:
## glm(formula = alert ~ Air.Flow, family = binomial, data = stackdata)
##
## Deviance Residuals:
##     Min       1Q   Median       3Q      Max
## -0.9781  -0.0618   0.0006   0.4172   2.0334
##
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  -42.341     23.173   -1.83    0.068 .
## Air.Flow       0.722      0.395    1.83    0.068 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
##     Null deviance: 29.065  on 20  degrees of freedom
## Residual deviance: 11.742  on 19  degrees of freedom
## AIC: 15.74
##
## Number of Fisher Scoring iterations: 8
```

Now try adding the other terms. Why does this fail? A scatterplot might help. Its always good to do an exploratory plot first.