# Simple Statistics

## Simple Statistics

This presentation will illustrate:

• Data summaries
• Basic plots
• Linear models

## Read In The Otter Data

Some otter data has been simulated for this example. It consists of weight/kg, length/m, and sex (male/female) for a number of sea otters.

First read in the data and do some simple summaries to see what we've got:

```> otters = read.csv("./Data/ottersurvey.csv")
> nrow(otters)
```
```[1] 44
```
```> head(otters)
```
```  weight length    sex
1     27   1.29   male
2     24   1.20 female
3     24   1.24 female
4     29   1.29   male
5     32   1.36 female
6     36   1.36   male
```
```> summary(otters)
```
```     weight         length         sex
Min.   :16.0   Min.   :1.00   female:20
1st Qu.:24.0   1st Qu.:1.21   male  :24
Median :27.0   Median :1.27
Mean   :27.6   Mean   :1.26
3rd Qu.:32.0   3rd Qu.:1.33
Max.   :40.0   Max.   :1.47
```

## Plots...

```> plot(otters\$length, otters\$weight,
pch = as.character(otters\$sex))
```

Building nice plots with the simple graphics functions can be a pain. Can we use something else?

• Add new functions and data to R
• Can modify existing functions
• Install with `install.packages("ggplot2")`
• Use per-session with `require(ggplot2)`
• Don't ever call them libraries.

## Use ggplot goodies

```> require(ggplot2)
> g = ggplot(otters, aes(x = length,
y = weight))
```

Nothing happens until we add layers.

```> g + geom_point()
```
```> g + geom_point(aes(colour = sex))
```
```> g + geom_point() + facet_wrap(~sex)
```

## Densities and distributions

Using base graphics

```> hist(otters\$weight)
```
```> boxplot(weight ~ sex, otters)
```

Using ggplot2

```> ggplot(otters, aes(sex, weight)) +
geom_boxplot()
```

## Model fitting - using ggplot

```> g = ggplot(otters, aes(x = length,
y = weight, colour = sex))
> g + geom_point()
```
```> g + geom_point() + stat_smooth(method = "lm")
```

Seems to be evidence of a different slope for male and female sea otters. Let's investigate formally.

## Overall model

Use `lm`

```> m1 = lm(weight ~ length, otters)
> summary(m1)
```
```Call:
lm(formula = weight ~ length, data = otters)

Residuals:
Min     1Q Median     3Q    Max
-5.113 -1.514 -0.072  1.469  4.445

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)   -36.51       3.94   -9.27    1e-11 ***
length         50.82       3.12   16.31   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.14 on 42 degrees of freedom
Multiple R-squared: 0.864,	Adjusted R-squared: 0.86
F-statistic:  266 on 1 and 42 DF,  p-value: <2e-16
```

```> plot(otters\$length, otters\$weight)
> abline(m1)
```

## Residuals

```> plot(fitted(m1), residuals(m1))
```

## Non-linearity?

```> m2 = lm(weight ~ poly(length, 2),
otters)
> m3 = lm(weight ~ poly(length, 3),
otters)
> summary(m2)
```
```Call:
lm(formula = weight ~ poly(length, 2), data = otters)

Residuals:
Min     1Q Median     3Q    Max
-5.467 -1.228 -0.063  1.137  4.769

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)        27.545      0.308   89.48   <2e-16 ***
poly(length, 2)1   34.982      2.042   17.13   <2e-16 ***
poly(length, 2)2    4.712      2.042    2.31    0.026 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.04 on 41 degrees of freedom
Multiple R-squared: 0.879,	Adjusted R-squared: 0.873
F-statistic:  149 on 2 and 41 DF,  p-value: <2e-16
```
```> summary(m3)
```
```Call:
lm(formula = weight ~ poly(length, 3), data = otters)

Residuals:
Min     1Q Median     3Q    Max
-5.668 -1.208 -0.121  1.174  4.591

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)        27.545      0.309   89.24   <2e-16 ***
poly(length, 3)1   34.982      2.047   17.09   <2e-16 ***
poly(length, 3)2    4.712      2.047    2.30    0.027 *
poly(length, 3)3   -1.807      2.047   -0.88    0.383
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.05 on 40 degrees of freedom
Multiple R-squared: 0.882,	Adjusted R-squared: 0.873
F-statistic: 99.3 on 3 and 40 DF,  p-value: <2e-16
```

```> par(mfrow = c(1, 2))
> plot(fitted(m1), residuals(m1))
> plot(fitted(m2), residuals(m2))
```
```> par(mfrow = c(1, 1))
```

## Compare models

```> m0 = lm(weight ~ 1, otters)
> anova(m0, m1, m2, m3)
```
```Analysis of Variance Table

Model 1: weight ~ 1
Model 2: weight ~ length
Model 3: weight ~ poly(length, 2)
Model 4: weight ~ poly(length, 3)
Res.Df  RSS Df Sum of Sq      F Pr(>F)
1     43 1417
2     42  193  1      1224 291.92 <2e-16 ***
3     41  171  1        22   5.30  0.027 *
4     40  168  1         3   0.78  0.383
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

## Separate models by sex

```> mmodel = lm(weight ~ length, data = otters,
subset = (sex == "male"))
> fmodel = lm(weight ~ length, data = otters,
subset = (sex == "female"))
> summary(mmodel)\$coef
```
```            Estimate Std. Error t value  Pr(>|t|)
(Intercept)   -52.08      7.314   -7.12 3.857e-07
length         63.18      5.600   11.28 1.290e-10
```
```> summary(fmodel)\$coef
```
```            Estimate Std. Error t value  Pr(>|t|)
(Intercept)   -22.13      2.914  -7.593 5.111e-07
length         38.38      2.402  15.979 4.456e-12
```

## Plot model results

We can make a coloured scatterplot by colour, and add the regression lines using the `abline` function that draws a line from the slope/intercept of a model:

```> otters\$col = ifelse(otters\$sex ==
"male", "#8080FF", "#FF8080")
> plot(otters\$length, otters\$weight,
col = otters\$col, pch = 19)
> abline(fmodel, col = "pink", lwd = 2,
lty = 2)
> abline(mmodel, col = "blue", lwd = 2,
lty = 2)
```

## Increasing complexity

We want to build our model incrementally from simple to complex. Here's our strategy:

• Same slope, same intercept (`m1` above)
• Same slope, separate intercepts for sex
• Different slopes and intercepts for each sex

We fit the model with separate intercepts by adding the factor as a term in the formula:

```> commonSlope = lm(weight ~ length +
sex, otters)
> summary(commonSlope)
```
```Call:
lm(formula = weight ~ length + sex, data = otters)

Residuals:
Min     1Q Median     3Q    Max
-3.920 -1.182  0.104  1.315  4.187

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)   -33.15       4.08   -8.12  4.6e-10 ***
length         47.49       3.36   14.14  < 2e-16 ***
sexmale         1.52       0.70    2.17    0.036 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.06 on 41 degrees of freedom
Multiple R-squared: 0.878,	Adjusted R-squared: 0.872
F-statistic:  147 on 2 and 41 DF,  p-value: <2e-16
```

All the parameters are significantly non-zero, but the change in the intercept for male otters has the largest p-value but it is still <0.05.

## Plot fits

We can't plot one line for this fit, so we take out the coefficients of the model and plot two lines.

```> coefs = coefficients(commonSlope)
> coefs
```
```(Intercept)      length     sexmale
-33.145      47.493       1.518
```
```> plot(weight ~ length, data = otters,
col = otters\$col, pch = 19)
> abline(coefs[1], coefs[2], col = "pink",
lwd = 2)
> abline(coefs[1] + coefs[3], coefs[2],
col = "blue", lwd = 2)
```

Note that `coefs[3]` is added to `coefs[1]` to get the intercept for the male otters. Our test for a significant difference is a test that this parameter is zero. We can specify models in R where the two intercepts are the parameters but that would mean testing for a non-zero difference in the parameters. With this formula we can just use the test statistic in the summary table.

## Two slopes, two intercepts

To fit this model we use `*` to say we want all the interaction terms between length and sex.

```> twoSlopes = lm(formula = weight ~
length * sex, data = otters)
> summary(twoSlopes)
```
```Call:
lm(formula = weight ~ length * sex, data = otters)

Residuals:
Min     1Q Median     3Q    Max
-3.687 -0.989 -0.117  1.074  4.000

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)      -22.13       4.31   -5.13  7.7e-06 ***
length            38.38       3.55   10.80  2.0e-13 ***
sexmale          -29.95       7.46   -4.02  0.00025 ***
length:sexmale    24.80       5.86    4.23  0.00013 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.73 on 40 degrees of freedom
Multiple R-squared: 0.916,	Adjusted R-squared: 0.909
F-statistic:  145 on 3 and 40 DF,  p-value: <2e-16
```

Once again all our parameters are significantly non-zero, all of them extremely so.

## Plot

Again we have two lines to plot. Again our test for difference between sexes is a test for some parameters being zero, so we can consult our summary table and conclude there is a significant difference.

```> plot(weight ~ length, data = otters,
col = otters\$col, pch = 19)
> coefs = coefficients(twoSlopes)
> coefs
```
```   (Intercept)         length        sexmale length:sexmale
-22.13          38.38         -29.95          24.80
```
```> abline(coefs[1], coefs[2], col = "pink",
lwd = 2)
> abline(coefs[1] + coefs[3], coefs[2] +
coefs[4], col = "blue", lwd = 2)
```

## Residual Check

```> plot(fitted(twoSlopes), residuals(twoSlopes))
```

## Anova test for comparison

People seem to like analysis of variance tables, so here's one. It confirms all the conclusions we already made, but without actually showing the parameter values which I always think are the interesting thing...

```> anova(m1, commonSlope, twoSlopes)
```
```Analysis of Variance Table

Model 1: weight ~ length
Model 2: weight ~ length + sex
Model 3: weight ~ length * sex
Res.Df RSS Df Sum of Sq     F  Pr(>F)
1     42 193
2     41 173  1      19.9  6.65 0.01372 *
3     40 120  1      53.6 17.91 0.00013 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

## Thoughts

• Most of this applies to generalised linear models with the `glm` function - for when you have count, binary, or binomial response data
• Confused by the two plotting systems? Don't worry, there's really at least three. Or four...