R - the basics

Basics

  • Starting
  • Simple Arithmetic
  • Save your work
  • Simple Graphics
  • Writing Functions

Start

You can run R in various ways...

RStudio and Emacs/ESS

The prompt

>

This is where you type stuff

The other prompt

+

Keep typing stuff

The Editor (in RStudio)

This is where you type stuff you only want to type once

Type Some Stuff

> 2 + 2
[1] 4
> 1 + 2 * 3/4^5
[1] 1.006
> pi * 2
[1] 6.283

Scientific functions

> sqrt(2)
[1] 1.414
> exp(2)
[1] 7.389
> log(7.389)  # base e
[1] 2
> exp(log(2))
[1] 2
> log10(1000)
[1] 3
> sin(pi/3)
[1] 0.866
> cos(pi/3)
[1] 0.5
> tan(pi/3)
[1] 1.732

Assigning Results

> a = sqrt(4)
> b = 3
> c = -2
> d = b^2 - 4 * a * c
> r1 = (-b + sqrt(d))/(2 * a)
> r2 = (-b - sqrt(d))/(2 * a)

Showing Results

> ls()
[1] "a"  "b"  "c"  "d"  "r1" "r2"
> r1
[1] 0.5
> r2
[1] -2

Help!

How do I get help on R things?

  • help(sqrt) - gets the help page for a function - ?sqrt for short.
  • help.search("model") - searches the help pages
  • help.start() - interactive help system
  • RDocumentation.org - nice new R help site
  • The R-Help mailing list
  • Other R mailing lists (R-sig-Ecology, R-sig-Geo) for specialists
  • StackOverflow for programming questions
  • Cross-Validated for stats questions

Saving Objects

You can save objects

> save(a, b, c, r1, r2, file = "output.rda")
> rm(a, b, c, r1, r2)
> r1
Error: object 'r1' not found
> load("output.rda")
> r1
[1] 0.5
> r2
[1] -2

R has save.image() which saves all your objects to a file called .RData

R will offer to do this when you quit: q()

R will reload .RData when you start in that folder.

Sea Otter Weight Data

Suppose we have sea otter weights in kg for males and females in two files:

Data/males.dat
27 28 39 32 26 28 25 42 28 38


Data/females.dat
22 25 24 31 26 30 14 17 21 30


We want to test if the sexes have different mean weights - a classic two-sample t-test. But first:

  • Always have a look at your data!

Read In Data

scan reads space-separated numbers from a file and returns...

> m = scan("./Data/males.dat")
> f = scan("./Data/females.dat")
> m
 [1] 27 28 39 32 26 28 25 42 28 38
> f
 [1] 22 25 24 31 26 30 14 17 21 30

...one-dimensional vectors.

Vectors

> length(m)
[1] 10
> length(f)
[1] 10
> summary(m)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   25.0    27.2    28.0    31.3    36.5    42.0 
> summary(f)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   14.0    21.2    24.5    24.0    29.0    31.0 

Histograms, boxplots

Plotting functions generally have the side effect of making a graphic:

> hist(m)
plot of chunk unnamed-chunk-10
> hist(f)
plot of chunk unnamed-chunk-10
> boxplot(m, f)
plot of chunk unnamed-chunk-10
> boxplot(m, f, names = c("males", 
     "females"), main = "Sea Otters", 
     ylab = "Weight/kg", xlab = "Sex")
plot of chunk unnamed-chunk-10

A Tale of Two t-tests

That loads one set of data in

Now repeat for the second file. Or save the format. Or edit and run the 'syntax'

Now we have two datasets.

Can't see how to work across datasets, so lets combine - cut n paste:

Now, all I can do with this is a paired t-test

But that's wrong! These are not paired observations! Need to get my data in 'long' form.

Some more cut n paste action later...

And my data looks like this:

Then I can use the independent sample t-test with Sex as grouping:

And get some correct output!

In R

Three lines

> m = scan("./Data/males.dat")
> f = scan("./Data/females.dat")
> t.test(m, f)
	Welch Two Sample t-test

data:  m and f 
t = 2.768, df = 17.89, p-value = 0.01274
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
  1.756 12.844 
sample estimates:
mean of x mean of y 
     31.3      24.0 

Note: if typing filenames is too much bother, on Windows:

> m = scan(file.choose())

Unfair!

What if the data is already in long form, ready for SPSS?

weight,sex 
 22,female 
 28,male 
 32,male 
 25,female 
 24,female 
 31,female 
 26,female 
 30,female 
 14,female 


Use read.csv (or read.table) to input

> otters = read.csv("./Data/otterweight.csv")
> summary(otters)
     weight         sex    
 Min.   :14.0   female:10  
 1st Qu.:24.8   male  :10  
 Median :27.5              
 Mean   :27.6              
 3rd Qu.:30.2              
 Max.   :42.0              
> otters
   weight    sex
1      22 female
2      28   male
3      32   male
4      25 female
5      24 female
6      31 female
7      26 female
8      30 female
9      14 female
10     17 female
11     21 female
12     30 female
13     27   male
14     39   male
15     26   male
16     28   male
17     25   male
18     42   male
19     28   male
20     38   male
> head(otters)
  weight    sex
1     22 female
2     28   male
3     32   male
4     25 female
5     24 female
6     31 female
> names(otters)
[1] "weight" "sex"   
> dim(otters)
[1] 20  2

This is a data frame...

Data Frames - use when:

  • Regular, tabular data
  • A row is a record
  • A column is a measurement
  • Much like a spreadsheet grid - but no formulae!

Slicing and dicing

You can extract columns:

> # by column number
> otters[, 1]
 [1] 22 28 32 25 24 31 26 30 14 17 21 30 27 39 26 28 25
[18] 42 28 38
> # by name
> otters$weight
 [1] 22 28 32 25 24 31 26 30 14 17 21 30 27 39 26 28 25
[18] 42 28 38
> # and its just a vector
> otters$weight > 36
 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [9] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
[17] FALSE  TRUE FALSE  TRUE

You can subset rows of data frames to get smaller data frames:

> # by row number
> otters[1, ]
  weight    sex
1     22 female
> otters[2:4, ]
  weight    sex
2     28   male
3     32   male
4     25 female
> # by true/false values:
> otters[otters$weight > 36, ]
   weight  sex
14     39 male
18     42 male
20     38 male
> # or use subset:
> subset(otters, otters$weight > 
     36)
   weight  sex
14     39 male
18     42 male
20     38 male
> subset(otters, sex == "male")
   weight  sex
2      28 male
3      32 male
13     27 male
14     39 male
15     26 male
16     28 male
17     25 male
18     42 male
19     28 male
20     38 male

t-test from a data frame

I could subset and do what I did before

> t.test(subset(otters, sex == "male")$weight, 
     subset(otters, sex == "female")$weight)
	Welch Two Sample t-test

data:  subset(otters, sex == "male")$weight and subset(otters, sex == "female")$weight 
t = 2.768, df = 17.89, p-value = 0.01274
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
  1.756 12.844 
sample estimates:
mean of x mean of y 
     31.3      24.0 

That works, but its ugly - data frames give us a nicer way...

t-test with a data frame

Use a formula notation:

> t.test(weight ~ sex, otters)
	Welch Two Sample t-test

data:  weight by sex 
t = -2.768, df = 17.89, p-value = 0.01274
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
 -12.844  -1.756 
sample estimates:
mean in group female   mean in group male 
                24.0                 31.3 

Data frame boxplots

> boxplot(weight ~ sex, otters, col = c("pink", 
     "cyan"), ylab = "Weight/kg")
plot of chunk unnamed-chunk-16

Note how the axis labels come from the data

Wrapping This Up

If we put these commands in a new file, called ottertest.R

data = read.csv(filename)
print(summary(data))
boxplot(weight ~ sex, data)
t.test(weight ~ sex, data)

Then we can repeat the analysis on a new file...

> filename = "newdata.csv"
> source("ottertest.R")

You could automate this to hundreds of data files with a loop...

Ultimate Report Writing

Using the knitr package:

  • You can put chunks of plain R code into a document
  • Process the document - run the code, create outputs
  • Generate a new document with results and embedded plots
  • The output may be text, Web page, LaTeX, PDF - maybe Word?!
  • All these web pages have been made with it!

Thoughts

  • Don't fear the command line
  • Use your environment's features to help
  • The power of a programming language
  • Repeatability
  • Repeatability