Programming in R

R Is A Programming Language

  • A dialect of S
  • Data Structures
  • Flow Control
  • Conditional Execution
  • Partly "Functional"
  • Partly "Object-oriented"

Data Structures

Simple scalar (single) values

can be numbers, character strings, TRUE/FALSE, date... any 'atomic' value

Vectors

Can be numeric

Or character string

Or true/false

Or any "atomic" value

Next Dimensions

2-d matrix

3-d array

  • All elements are the same type
  • Can be any atomic type
  • Can have row and column names

Data Frame

  • All rows have the same types
  • Each column is one type
  • A bit like a spreadsheet table
  • But stricter!

Irregular Data - lists

  • A number of elements
  • Each element can have a name
  • Each element can be anything

Nested Irregular Data

  • Elements can be lists!
  • Careful thought needed for designing list structures...

Scalars and Vectors

Scalars

> x = 1
> x = 1.2
> x = "Hello World"
> x = TRUE

Vectors

> v = c(1, 4, 9, 16)
> v
[1]  1  4  9 16
> v[3]
[1] 9

Matrices and arrays

> m = matrix(c(1, 2, 3, 4, 5, 6), ncol = 2)
> m
     [,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6
> m[, 2]
[1] 4 5 6
> a = array(1:24, dim = c(2, 3, 4))
> a[1, , ]
     [,1] [,2] [,3] [,4]
[1,]    1    7   13   19
[2,]    3    9   15   21
[3,]    5   11   17   23
> a[1, 2, ]
[1]  3  9 15 21
> a[1, 2, 3]
[1] 15
> a
, , 1

     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

, , 2

     [,1] [,2] [,3]
[1,]    7    9   11
[2,]    8   10   12

, , 3

     [,1] [,2] [,3]
[1,]   13   15   17
[2,]   14   16   18

, , 4

     [,1] [,2] [,3]
[1,]   19   21   23
[2,]   20   22   24

Data Frames

> d = data.frame(x = 1:5, n = c("a", "b", "b", "c", 
     "d"))
> d
  x n
1 1 a
2 2 b
3 3 b
4 4 c
5 5 d
> d$x
[1] 1 2 3 4 5
> d[, 1]
[1] 1 2 3 4 5
> d[2:3, ]
  x n
2 2 b
3 3 b

Lists - for irregular data

Suppose each person can take a test any number of times...

> e1 = list(name = "Fred", scores = c(23, 74, 12))
> e1
$name
[1] "Fred"

$scores
[1] 23 74 12
> names(e1)
[1] "name"   "scores"
> e1$name
[1] "Fred"
> mean(e1$scores)
[1] 36.33
> mean(e1[[2]])
[1] 36.33
> e2 = list(name = "Joe", scores = c(27, 65, 17, 
     19, 32))
> exams = list(e1, e2)
> exams[[1]]
$name
[1] "Fred"

$scores
[1] 23 74 12
> exams[[1]]$name
[1] "Fred"

Program Flow

Loop

> for (i in c(1, 2, 3, 4, 5)) {
     cat("square root of ", i, " is ", sqrt(i), 
         "\n")
 }
square root of  1  is  1 
square root of  2  is  1.414 
square root of  3  is  1.732 
square root of  4  is  2 
square root of  5  is  2.236 

Don't loop...

In some programming languages you loop a lot.

> # divide every element by two
> x = c(1, 2, 4, 8, 16)
> for (i in 1:5) {
     x[i] = x[i]/2
 }
> x
[1] 0.5 1.0 2.0 4.0 8.0

But in R many basic operations don't need it.

> x = c(1, 2, 4, 8, 16)
> x = x/2
> x
[1] 0.5 1.0 2.0 4.0 8.0

If-Then-Else

> for (i in 1:100) {
     if (sqrt(i) == as.integer(sqrt(i))) {
         cat("sqrt(", i, ") is integer\n")
     }
 }
sqrt( 1 ) is integer
sqrt( 4 ) is integer
sqrt( 9 ) is integer
sqrt( 16 ) is integer
sqrt( 25 ) is integer
sqrt( 36 ) is integer
sqrt( 49 ) is integer
sqrt( 64 ) is integer
sqrt( 81 ) is integer
sqrt( 100 ) is integer
> random = runif(1)  # one random number
> if (random > 0.5) {
     cat("Heads you win!\n")
 } else {
     cat("Tails you lose!\n")
 }
Heads you win!

While...

> count = 0
> while (runif(1) < 0.99) {
     count = count + 1
 }
> cat("Took ", count, " iterations\n")
Took  128  iterations

Writing functions

> quadrat = function(x, a, b, c) {
     return(a * x^2 + b * x + c)
 }
> quadrat(c(1, 2, 3, 4), 1, 0.5, -2)
[1] -0.5  3.0  8.5 16.0
> qsolve = function(a, b, c) {
     det = b^2 - 4 * a * c
     if (det < 0) {
         stop("Complex roots")
     }
     rplus = (-b + sqrt(det))/(2 * a)
     rminus = (-b - sqrt(det))/(2 * a)
     return(c(rplus, rminus))
 }
> qsolve(1, 0.5, -2)
[1]  1.186 -1.686
> quadrat(qsolve(1, 0.5, -2), 1, 0.5, -2)
[1] 0 0
> qsolve(3, -2, 1)
Error: Complex roots

Function formality

A function:

  • Has a name
  • Has zero or more arguments
  • Arguments can be named or positional
  • Arguments can have default values
  • Returns a single value
  • May cause side-effects

I'd like to have an argument...

> args(log)  # inspect arguments
function (x, base = exp(1)) 
NULL
> log(100)  # default base-e natural log
[1] 4.605
> log(100, base = 10)  # name match
[1] 2
> log(100, 10)  # position match
[1] 2
> log(100, b = 10)  # partial name
[1] 2
> log(100, z = 10)  # wrong
Error: unused argument(s) (z = 10)
> log(base = 10, 100)  # args backwards
[1] 2

Tips (or "knowing your args from your elbow")

  • Don't mess with the order unless you have a very good reason
  • Using named arguments helps with clarity
  • Don't shorten argument names - clarity is a good thing
  • The help(log) help page for every function should explain its arguments
  • When writing functions, think carefully about arguments

Functional Programming

Using functions as arguments...

> m
     [,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6
> apply(m, 1, sum)
[1] 5 7 9
> apply(m, 2, mean)
[1] 2 5

Object-oriented what?

A programming methodology.

  • Model things as 'objects' of a class
  • Have a nested hierarchy of classes
  • Specify attributes of objects
  • Specify "methods" - what you can do with those objects

We've already seen this in action.

> x = 1:10
> y = rnorm(10)
> m = lm(y ~ x)
> class(m)
[1] "lm"

This is an object of the lm (linear model) class. If I do residuals(m) I'm calling the residuals method.

> x = 1:10
> p = rpois(10, 3)
> gm = glm(p ~ x, family = "poisson")
> class(gm)
[1] "glm" "lm" 

This is an object of class glm and lm. An GLM is a generalisation of an LM, so it should have all the same methods, and possibly new ones.

Now when I do residuals(gm) I am calling the method for GLMs. If no such method exists, then R will use the lm method.

Objects Everywhere

Once you start working seriously with R you will not only be using objects everywhere (data frames are a class), but you might start defining your own classes.