Simple Statistics

Simple Statistics

This presentation will illustrate:

  • Loading data
  • Data summaries
  • Basic plots
  • Linear models

Read In The Otter Data

Some otter data has been simulated for this example. It consists of weight/kg, length/m, and sex (male/female) for a number of sea otters.

First read in the data and do some simple summaries to see what we've got:

> otters = read.csv("./Data/ottersurvey.csv")
> nrow(otters)
[1] 44
> head(otters)
  weight length    sex
1     27   1.29   male
2     24   1.20 female
3     24   1.24 female
4     29   1.29   male
5     32   1.36 female
6     36   1.36   male
> summary(otters)
     weight         length         sex    
 Min.   :16.0   Min.   :1.00   female:20  
 1st Qu.:24.0   1st Qu.:1.21   male  :24  
 Median :27.0   Median :1.27              
 Mean   :27.6   Mean   :1.26              
 3rd Qu.:32.0   3rd Qu.:1.33              
 Max.   :40.0   Max.   :1.47              

Plots...

> plot(otters$length, otters$weight, 
     pch = as.character(otters$sex))
plot of chunk unnamed-chunk-2

Building nice plots with the simple graphics functions can be a pain. Can we use something else?

Add-on packages

  • Add new functions and data to R
  • Can modify existing functions
  • Download mostly from a CRAN mirror site
  • Install with install.packages("ggplot2")
  • Use per-session with require(ggplot2)
  • Don't ever call them libraries.

Use ggplot goodies

> require(ggplot2)
> g = ggplot(otters, aes(x = length, 
     y = weight))

Nothing happens until we add layers.

> g + geom_point()
plot of chunk addlayers
> g + geom_point(aes(colour = sex))
plot of chunk addlayers
> g + geom_point() + facet_wrap(~sex)
plot of chunk addlayers

Densities and distributions

Using base graphics

> hist(otters$weight)
plot of chunk unnamed-chunk-4
> boxplot(weight ~ sex, otters)
plot of chunk unnamed-chunk-4

Using ggplot2

> ggplot(otters, aes(sex, weight)) + 
     geom_boxplot()
plot of chunk unnamed-chunk-5

Model fitting - using ggplot

> g = ggplot(otters, aes(x = length, 
     y = weight, colour = sex))
> g + geom_point()
plot of chunk unnamed-chunk-6
> g + geom_point() + stat_smooth(method = "lm")
plot of chunk unnamed-chunk-6

Seems to be evidence of a different slope for male and female sea otters. Let's investigate formally.

Overall model

Use lm

> m1 = lm(weight ~ length, otters)
> summary(m1)
Call:
lm(formula = weight ~ length, data = otters)

Residuals:
   Min     1Q Median     3Q    Max 
-5.113 -1.514 -0.072  1.469  4.445 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   -36.51       3.94   -9.27    1e-11 ***
length         50.82       3.12   16.31   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Residual standard error: 2.14 on 42 degrees of freedom
Multiple R-squared: 0.864,	Adjusted R-squared: 0.86 
F-statistic:  266 on 1 and 42 DF,  p-value: <2e-16 

Plot and add line

> plot(otters$length, otters$weight)
> abline(m1)
plot of chunk unnamed-chunk-8

Residuals

> plot(fitted(m1), residuals(m1))
plot of chunk unnamed-chunk-9

Non-linearity?

Try adding a quadratic or cubic term

> m2 = lm(weight ~ poly(length, 2), 
     otters)
> m3 = lm(weight ~ poly(length, 3), 
     otters)
> summary(m2)
Call:
lm(formula = weight ~ poly(length, 2), data = otters)

Residuals:
   Min     1Q Median     3Q    Max 
-5.467 -1.228 -0.063  1.137  4.769 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)        27.545      0.308   89.48   <2e-16 ***
poly(length, 2)1   34.982      2.042   17.13   <2e-16 ***
poly(length, 2)2    4.712      2.042    2.31    0.026 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Residual standard error: 2.04 on 41 degrees of freedom
Multiple R-squared: 0.879,	Adjusted R-squared: 0.873 
F-statistic:  149 on 2 and 41 DF,  p-value: <2e-16 
> summary(m3)
Call:
lm(formula = weight ~ poly(length, 3), data = otters)

Residuals:
   Min     1Q Median     3Q    Max 
-5.668 -1.208 -0.121  1.174  4.591 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)        27.545      0.309   89.24   <2e-16 ***
poly(length, 3)1   34.982      2.047   17.09   <2e-16 ***
poly(length, 3)2    4.712      2.047    2.30    0.027 *  
poly(length, 3)3   -1.807      2.047   -0.88    0.383    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Residual standard error: 2.05 on 40 degrees of freedom
Multiple R-squared: 0.882,	Adjusted R-squared: 0.873 
F-statistic: 99.3 on 3 and 40 DF,  p-value: <2e-16 

Compare Residuals for Quadratic

> par(mfrow = c(1, 2))
> plot(fitted(m1), residuals(m1))
> plot(fitted(m2), residuals(m2))
plot of chunk unnamed-chunk-11
> par(mfrow = c(1, 1))

Compare models

> m0 = lm(weight ~ 1, otters)
> anova(m0, m1, m2, m3)
Analysis of Variance Table

Model 1: weight ~ 1
Model 2: weight ~ length
Model 3: weight ~ poly(length, 2)
Model 4: weight ~ poly(length, 3)
  Res.Df  RSS Df Sum of Sq      F Pr(>F)    
1     43 1417                               
2     42  193  1      1224 291.92 <2e-16 ***
3     41  171  1        22   5.30  0.027 *  
4     40  168  1         3   0.78  0.383    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Separate models by sex

> mmodel = lm(weight ~ length, data = otters, 
     subset = (sex == "male"))
> fmodel = lm(weight ~ length, data = otters, 
     subset = (sex == "female"))
> summary(mmodel)$coef
            Estimate Std. Error t value  Pr(>|t|)
(Intercept)   -52.08      7.314   -7.12 3.857e-07
length         63.18      5.600   11.28 1.290e-10
> summary(fmodel)$coef
            Estimate Std. Error t value  Pr(>|t|)
(Intercept)   -22.13      2.914  -7.593 5.111e-07
length         38.38      2.402  15.979 4.456e-12

Plot model results

We can make a coloured scatterplot by colour, and add the regression lines using the abline function that draws a line from the slope/intercept of a model:

> otters$col = ifelse(otters$sex == 
     "male", "#8080FF", "#FF8080")
> plot(otters$length, otters$weight, 
     col = otters$col, pch = 19)
> abline(fmodel, col = "pink", lwd = 2, 
     lty = 2)
> abline(mmodel, col = "blue", lwd = 2, 
     lty = 2)
plot of chunk unnamed-chunk-14

Increasing complexity

We want to build our model incrementally from simple to complex. Here's our strategy:

  • Same slope, same intercept (m1 above)
  • Same slope, separate intercepts for sex
  • Different slopes and intercepts for each sex

We fit the model with separate intercepts by adding the factor as a term in the formula:

> commonSlope = lm(weight ~ length + 
     sex, otters)
> summary(commonSlope)
Call:
lm(formula = weight ~ length + sex, data = otters)

Residuals:
   Min     1Q Median     3Q    Max 
-3.920 -1.182  0.104  1.315  4.187 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   -33.15       4.08   -8.12  4.6e-10 ***
length         47.49       3.36   14.14  < 2e-16 ***
sexmale         1.52       0.70    2.17    0.036 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Residual standard error: 2.06 on 41 degrees of freedom
Multiple R-squared: 0.878,	Adjusted R-squared: 0.872 
F-statistic:  147 on 2 and 41 DF,  p-value: <2e-16 

All the parameters are significantly non-zero, but the change in the intercept for male otters has the largest p-value but it is still <0.05.

Plot fits

We can't plot one line for this fit, so we take out the coefficients of the model and plot two lines.

> coefs = coefficients(commonSlope)
> coefs
(Intercept)      length     sexmale 
    -33.145      47.493       1.518 
> plot(weight ~ length, data = otters, 
     col = otters$col, pch = 19)
> abline(coefs[1], coefs[2], col = "pink", 
     lwd = 2)
> abline(coefs[1] + coefs[3], coefs[2], 
     col = "blue", lwd = 2)
plot of chunk unnamed-chunk-16

Note that coefs[3] is added to coefs[1] to get the intercept for the male otters. Our test for a significant difference is a test that this parameter is zero. We can specify models in R where the two intercepts are the parameters but that would mean testing for a non-zero difference in the parameters. With this formula we can just use the test statistic in the summary table.

Two slopes, two intercepts

To fit this model we use * to say we want all the interaction terms between length and sex.

> twoSlopes = lm(formula = weight ~ 
     length * sex, data = otters)
> summary(twoSlopes)
Call:
lm(formula = weight ~ length * sex, data = otters)

Residuals:
   Min     1Q Median     3Q    Max 
-3.687 -0.989 -0.117  1.074  4.000 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)      -22.13       4.31   -5.13  7.7e-06 ***
length            38.38       3.55   10.80  2.0e-13 ***
sexmale          -29.95       7.46   -4.02  0.00025 ***
length:sexmale    24.80       5.86    4.23  0.00013 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Residual standard error: 1.73 on 40 degrees of freedom
Multiple R-squared: 0.916,	Adjusted R-squared: 0.909 
F-statistic:  145 on 3 and 40 DF,  p-value: <2e-16 

Once again all our parameters are significantly non-zero, all of them extremely so.

Plot

Again we have two lines to plot. Again our test for difference between sexes is a test for some parameters being zero, so we can consult our summary table and conclude there is a significant difference.

> plot(weight ~ length, data = otters, 
     col = otters$col, pch = 19)
> coefs = coefficients(twoSlopes)
> coefs
   (Intercept)         length        sexmale length:sexmale 
        -22.13          38.38         -29.95          24.80 
> abline(coefs[1], coefs[2], col = "pink", 
     lwd = 2)
> abline(coefs[1] + coefs[3], coefs[2] + 
     coefs[4], col = "blue", lwd = 2)
plot of chunk unnamed-chunk-18

Residual Check

> plot(fitted(twoSlopes), residuals(twoSlopes))
plot of chunk unnamed-chunk-19

Anova test for comparison

People seem to like analysis of variance tables, so here's one. It confirms all the conclusions we already made, but without actually showing the parameter values which I always think are the interesting thing...

> anova(m1, commonSlope, twoSlopes)
Analysis of Variance Table

Model 1: weight ~ length
Model 2: weight ~ length + sex
Model 3: weight ~ length * sex
  Res.Df RSS Df Sum of Sq     F  Pr(>F)    
1     42 193                               
2     41 173  1      19.9  6.65 0.01372 *  
3     40 120  1      53.6 17.91 0.00013 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Thoughts

  • Most of this applies to generalised linear models with the glm function - for when you have count, binary, or binomial response data
  • Confused by the two plotting systems? Don't worry, there's really at least three. Or four...