Lancaster University
This presentation will illustrate:
Some otter data has been simulated for this example. It consists of weight/kg, length/m, and sex (male/female) for a number of sea otters.
First read in the data and do some simple summaries to see what we've got:
> otters = read.csv("./Data/ottersurvey.csv") > nrow(otters)
[1] 44
> head(otters)
weight length sex 1 27 1.29 male 2 24 1.20 female 3 24 1.24 female 4 29 1.29 male 5 32 1.36 female 6 36 1.36 male
> summary(otters)
weight length sex Min. :16.0 Min. :1.00 female:20 1st Qu.:24.0 1st Qu.:1.21 male :24 Median :27.0 Median :1.27 Mean :27.6 Mean :1.26 3rd Qu.:32.0 3rd Qu.:1.33 Max. :40.0 Max. :1.47
> plot(otters$length, otters$weight, pch = as.character(otters$sex))
Building nice plots with the simple graphics functions can be a pain. Can we use something else?
install.packages("ggplot2")
require(ggplot2)
> require(ggplot2) > g = ggplot(otters, aes(x = length, y = weight))
Nothing happens until we add layers.
> g + geom_point()
> g + geom_point(aes(colour = sex))
> g + geom_point() + facet_wrap(~sex)
Using base graphics
> hist(otters$weight)
> boxplot(weight ~ sex, otters)
Using ggplot2
> ggplot(otters, aes(sex, weight)) + geom_boxplot()
> g = ggplot(otters, aes(x = length, y = weight, colour = sex)) > g + geom_point()
> g + geom_point() + stat_smooth(method = "lm")
Seems to be evidence of a different slope for male and female sea otters. Let's investigate formally.
Use lm
> m1 = lm(weight ~ length, otters) > summary(m1)
Call: lm(formula = weight ~ length, data = otters) Residuals: Min 1Q Median 3Q Max -5.113 -1.514 -0.072 1.469 4.445 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -36.51 3.94 -9.27 1e-11 *** length 50.82 3.12 16.31 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 2.14 on 42 degrees of freedom Multiple R-squared: 0.864, Adjusted R-squared: 0.86 F-statistic: 266 on 1 and 42 DF, p-value: <2e-16
> plot(otters$length, otters$weight) > abline(m1)
> plot(fitted(m1), residuals(m1))
Try adding a quadratic or cubic term
> m2 = lm(weight ~ poly(length, 2), otters) > m3 = lm(weight ~ poly(length, 3), otters) > summary(m2)
Call: lm(formula = weight ~ poly(length, 2), data = otters) Residuals: Min 1Q Median 3Q Max -5.467 -1.228 -0.063 1.137 4.769 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 27.545 0.308 89.48 <2e-16 *** poly(length, 2)1 34.982 2.042 17.13 <2e-16 *** poly(length, 2)2 4.712 2.042 2.31 0.026 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 2.04 on 41 degrees of freedom Multiple R-squared: 0.879, Adjusted R-squared: 0.873 F-statistic: 149 on 2 and 41 DF, p-value: <2e-16
> summary(m3)
Call: lm(formula = weight ~ poly(length, 3), data = otters) Residuals: Min 1Q Median 3Q Max -5.668 -1.208 -0.121 1.174 4.591 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 27.545 0.309 89.24 <2e-16 *** poly(length, 3)1 34.982 2.047 17.09 <2e-16 *** poly(length, 3)2 4.712 2.047 2.30 0.027 * poly(length, 3)3 -1.807 2.047 -0.88 0.383 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 2.05 on 40 degrees of freedom Multiple R-squared: 0.882, Adjusted R-squared: 0.873 F-statistic: 99.3 on 3 and 40 DF, p-value: <2e-16
> par(mfrow = c(1, 2)) > plot(fitted(m1), residuals(m1)) > plot(fitted(m2), residuals(m2))
> par(mfrow = c(1, 1))
> m0 = lm(weight ~ 1, otters) > anova(m0, m1, m2, m3)
Analysis of Variance Table Model 1: weight ~ 1 Model 2: weight ~ length Model 3: weight ~ poly(length, 2) Model 4: weight ~ poly(length, 3) Res.Df RSS Df Sum of Sq F Pr(>F) 1 43 1417 2 42 193 1 1224 291.92 <2e-16 *** 3 41 171 1 22 5.30 0.027 * 4 40 168 1 3 0.78 0.383 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> mmodel = lm(weight ~ length, data = otters, subset = (sex == "male")) > fmodel = lm(weight ~ length, data = otters, subset = (sex == "female")) > summary(mmodel)$coef
Estimate Std. Error t value Pr(>|t|) (Intercept) -52.08 7.314 -7.12 3.857e-07 length 63.18 5.600 11.28 1.290e-10
> summary(fmodel)$coef
Estimate Std. Error t value Pr(>|t|) (Intercept) -22.13 2.914 -7.593 5.111e-07 length 38.38 2.402 15.979 4.456e-12
We can make a coloured scatterplot by colour, and add the regression lines using the
abline
function that draws a line from the slope/intercept of a model:
> otters$col = ifelse(otters$sex == "male", "#8080FF", "#FF8080") > plot(otters$length, otters$weight, col = otters$col, pch = 19) > abline(fmodel, col = "pink", lwd = 2, lty = 2) > abline(mmodel, col = "blue", lwd = 2, lty = 2)
We want to build our model incrementally from simple to complex. Here's our strategy:
m1
above)We fit the model with separate intercepts by adding the factor as a term in the formula:
> commonSlope = lm(weight ~ length + sex, otters) > summary(commonSlope)
Call: lm(formula = weight ~ length + sex, data = otters) Residuals: Min 1Q Median 3Q Max -3.920 -1.182 0.104 1.315 4.187 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -33.15 4.08 -8.12 4.6e-10 *** length 47.49 3.36 14.14 < 2e-16 *** sexmale 1.52 0.70 2.17 0.036 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 2.06 on 41 degrees of freedom Multiple R-squared: 0.878, Adjusted R-squared: 0.872 F-statistic: 147 on 2 and 41 DF, p-value: <2e-16
All the parameters are significantly non-zero, but the change in the intercept for male otters has the largest p-value but it is still <0.05.
We can't plot one line for this fit, so we take out the coefficients of the model and plot two lines.
> coefs = coefficients(commonSlope)
> coefs
(Intercept) length sexmale -33.145 47.493 1.518
> plot(weight ~ length, data = otters, col = otters$col, pch = 19) > abline(coefs[1], coefs[2], col = "pink", lwd = 2) > abline(coefs[1] + coefs[3], coefs[2], col = "blue", lwd = 2)
Note that coefs[3]
is added to
coefs[1]
to get the intercept for the male otters. Our
test for a significant difference is a test that this parameter is
zero. We can specify models in R where the two intercepts are the parameters
but that would mean testing for a non-zero difference in the parameters. With this
formula we can just use the test statistic in the summary table.
To fit this model we use *
to say we want all the interaction terms between
length and sex.
> twoSlopes = lm(formula = weight ~ length * sex, data = otters) > summary(twoSlopes)
Call: lm(formula = weight ~ length * sex, data = otters) Residuals: Min 1Q Median 3Q Max -3.687 -0.989 -0.117 1.074 4.000 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -22.13 4.31 -5.13 7.7e-06 *** length 38.38 3.55 10.80 2.0e-13 *** sexmale -29.95 7.46 -4.02 0.00025 *** length:sexmale 24.80 5.86 4.23 0.00013 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 1.73 on 40 degrees of freedom Multiple R-squared: 0.916, Adjusted R-squared: 0.909 F-statistic: 145 on 3 and 40 DF, p-value: <2e-16
Once again all our parameters are significantly non-zero, all of them extremely so.
Again we have two lines to plot. Again our test for difference between sexes is a test for some parameters being zero, so we can consult our summary table and conclude there is a significant difference.
> plot(weight ~ length, data = otters, col = otters$col, pch = 19) > coefs = coefficients(twoSlopes) > coefs
(Intercept) length sexmale length:sexmale -22.13 38.38 -29.95 24.80
> abline(coefs[1], coefs[2], col = "pink", lwd = 2) > abline(coefs[1] + coefs[3], coefs[2] + coefs[4], col = "blue", lwd = 2)
> plot(fitted(twoSlopes), residuals(twoSlopes))
People seem to like analysis of variance tables, so here's one. It confirms all the conclusions we already made, but without actually showing the parameter values which I always think are the interesting thing...
> anova(m1, commonSlope, twoSlopes)
Analysis of Variance Table Model 1: weight ~ length Model 2: weight ~ length + sex Model 3: weight ~ length * sex Res.Df RSS Df Sum of Sq F Pr(>F) 1 42 193 2 41 173 1 19.9 6.65 0.01372 * 3 40 120 1 53.6 17.91 0.00013 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
glm
function - for when
you have count, binary, or binomial response data