In the first of two talks I'm going to tell you about the basic spatial data operations in R.
I always like to give some history in my talks. I won't be going back to 1066 this time.
R's genealogy can be traced back to the S Language from 1976, developed partly out of the frustration of scientists with Fortran... By 1988 it had been commercialised into S-Plus, an environment for statistics and graphics.
The graphics capabilities of S-Plus meant you could produce maps with some effort. There were functions for drawing points, lines, and polygons.
Fairly early on, the map package appeared – this could draw geographic maps such as those made from polygonal regions.
Useful maps could be made with this package – data could be overlaid, maps could be projected and so on. Coloured maps could also be produced.
However, the data that the map package needed was in a strict format. Initially you needed a .line, a .gon, and a .names file...
The .line file consisted of a number of records separated by EOR. The first line of the record is which polygons are on either side of the line. Then there's a series of coordinates of the line.
The .gon file is also composed of a number of records separated by EOR. Each record lists which of the lines (from the .line file) is needed to draw the polygon. If the line needs to be drawn in the reverse direction from the order in the .line file, the index is a negative number.
Finally the .names file gives each polygon a name. The .gon file records are strictly single rings, so regions composed of more than one ring (such as a country with islands) are identified by having the same name. A convention of name:subname is used to identify parts of a larger region, such as the Scottish island of Harris here.
Got those files ready? Now run them through mapgetl and mapgetg to produce binary version of the files. These files can be accessed more quickly than the text files.
The map package still exists, and has been ported and maintained for R. Several geographic databases in this format exist and it still finds some use.
Around the year 2000 there was a lot of activity in the spatial arena. GIS software was getting wide use and maps were appearing on the web from providers such as MultiMap (Google Maps was still 5 years away). To try and make sense of all this, the Open Geospatial Consortium creates standards for data and protocols.
By now R was getting lots of use. And people were wanting to do more with maps, and to integrate with their GIS software. A few small projects developed, and some specific spatial statistical packages appeared, most of which used their own spatial data format, incompatible with anything else.
So a bunch of us came together to work out how better to do spatial data in R.
The Big Idea was to implement something like the OGC Simple Features Specification – this defines data for points, lines, and polygons with associated attribute data.
This all came together in the sp package and the book by Roger, Edzer, and Virgilio “Applied Spatial Data Analysis in R”.
Now I'll go through the basics of making spatial data objects with the sp package and classes. In these examples x and y are vectors of coordinates and coords is a 2-column matrix.
Points are easy. Just call SpatialPoints with a 2-column matrix of coordinates and you get your first sp-class object.
You can give each point some attributes by creating a SpatialPointsDataFrame with either the 2-column matrix or the previous SpatialPoints object and a dataframe.
Another way is to use the coordinates function to promote a data frame to a spatial data frame.
There are two ways (at least) of getting back to a plain data frame from a SpatialPointsDataFrame.
From a database point of view, we've created a table where one of the columns is the point coordinate, and the other columns are the other data.
We can manipulate a SpatialPointsDataFrame in much the same way as an ordinary data frame.
Let's go up a dimension. Line objects are single chains of points. Here's three of them, created from three vectors of x and y coordinates.
Lines L2 and L3 are part of the same river network. We can create Lines objects as lists of Line objects. Each Lines objects has an ID.
Currently these things are just geometric. To make them geographic, we create SpatialLines objects. Our two Lines objects are now wrapped in one SpatialLines object.
We can then make a SpatialLinesDataFrame where each of our two Lines objects has a corresponding row in a data frame.
So from a database point of view, we have tables where there is a column describing the Lines object underlying each feature.
Time to add another dimension...
A Polygon is a single ring, constructed from coordinates with an identical start and end coordinate.
A Polygons object is a list of Polygon objects with an ID, like Lines objects are to Line objects.
Just as we can make Lines objects with more than one Line, we can make Polygons objects with more than one Polygon.
You can use the hole flag to create holes in Polygons objects. A Polygons object can have multiple islands and holes and nested holes on islands...
Similarly to Lines, we can create SpatialPolygons as a list of Polygons, and then also create SpatialPolygonsDataFrame objects by adding a data frame.
Getting back to the coordinates from a SpatialPolygonsDataFrame is possible, but messy. If you do this a lot, you'll want to write some functions to help.
So, to complete the database picture, each row of the table has a coordinates column with a complex description of a polygonal feature, plus its attributes.
This is the main reason for making Spatial things. Coordinates.
If you make a SpatialPoints object and add a proj4string argument with a CRS (Coordinate Reference System) object you've told R that this is using a particular coordinate system.
And you can do the same for SpatialAnythingDataFrames.
Proj4 is a library written in C for doing projections and coordinates. Proj4 has a number of ways of specifying coordinate systems, such as epsg codes or explicitly specifying transformations and projections.
This is quite important. If someone gives you coordinates from their GPS, quite likely they are in the epsg code 4326 system. Setting the CRS of your SpatialPoints to anything else is wrong.
Once you've set the right CRS, you can use spTransform to convert to something else, such as the UK metric grid system.
That's enough for vector data. There's also raster data, which is data that exists on regular grids...
You get this from photography – which might be in visible or invisible such as IR wavelengths – radar height sensor data, rainfall radar measurements etc. Sometimes you'll get a raster of counts in each grid cell, perhaps from animal or plant surveys.
Another common source is from modelled values, such as the output from climate models or interpolations from an irregularly-spaced set of sensors.
R has had some raster capability by working with matrices. The image function lets you draw raster plots and can take x and y vectors to position the raster in some coordinate system.
The volcano dataset is supplied with R and is used in several examples. I've always wondered what this volcano looks like...
..and it turns out to be a green space in Auckland. You can park your car on the top.
Not the rocky, fuming mess of boiling lava I was expecting.
Nowadays we have the raster package. It defines classes for various types of raster data.
These raster objects can be made in several ways, perhaps the simplest is directly from a matrix and some x and y coordinates. Plot this, and you get a raster map with a legend.
The raster function can also read rasters from standard GIS format raster files. Here I've loaded a digital elevation model.
You can operate on raster objects almost like they were matrices. Arithmetic operations, functions, and conditional tests may all work. Here I've computed the hill shading of my DEM and produced a coloured plot with hill shading for texture.
Some rasters are categorical – each number refers to a particular type. This example is a land cover raster. Red are urban regions, greens are various types of woodland or forest, blue is lakes, and dark blue/purple are boggy marshland, most of which I've walked over.
Notice there's no legend on the plot. You can get the values of the raster and tabulate it.
The rasterVis package lets you do 3d interactive plots – this can be grabbed with the mouse and spun around and zoomed.
What about projections? Here is a plot of some settlements in this region, and the raster. The numbers are clearly different. On the left are UK Grid system coordinates. What's on the right?
The land use raster is a rectangular crop from data defined on a European grid.
The coordinate system for this grid is held in the data file and read in by the raster package. You can extract it with the projection function. This tells us its a Lambert Azimuthal Equal Area projection with particular parameters. All we need to know is that this is a proj4 string...
We can use the projectRaster function to transform the raster to another coordinate system, in this case to UK grid coordinates. However, raster seems to lose the categorical nature of the raster at this point and thinks its now continuous values. Also, reprojecting a raster like this can be a lossy process – each pixel of the projected raster has come from the overlap of some pixels from the original. Various interpolation schemes are possible, but none are perfect...
The better option is to transform vector data – in this case the points – to the coordinate system of the raster.
We can use spTransform and get the projection directly from the land use raster.
Now the two data sets match up.
That's the fundamentals of vector and raster data in the sp and raster packages... Now we'll do something with them.
So we have a SpatialPointsDataFrame of settlements and a raster of land use codes.
The simplest thing to do is to find out what land use category is under each point. The vector that comes out of extract is the same length as the number of settlements, and gives the land use code.
Now let's try to categorise the land use within a kilometre of each settlements. The rgeos package has a buffer algorithm that computes buffer zones around features.
This first effort however returns one feature, composed of lots of regions, so when we extract the land use we are actually getting a summary of the land use within 1km of all settlements.
If we add the byid=TRUE argument, then we get individual buffers for each settlement. Since we have point features, the buffers are circular.
Summarizing the land use in each buffer is a two-step process – first extract the pixel values, the tabulate for each buffer...
The extract function in this case returns a list, because buffers may be different sizes and so return different numbers of pixels. Here is the first two elements of that list.
Next we apply the table function to each of the elements of that list. We can see the make-up of the pixels around the 55th and 56th town here.
With a bit of manipulation we can create a new SpatialPointsDataFrame with the coordinates of the settlements and the grid cell counts in the land use categories we are interested in. We can use spplot to make a coloured map of the amount of certain categories.
Now we have a substantive question.
We can test the attributes of SpatialPointsDataFrame to find those with any conifer forest nearby. Then we plot them.
As ever in R, it can save time to write a flexible function to do all this. At some point you are going to be asked to do it again with a radius of 2km, or 2.5km, or on a new set of settlements or in a different area altogether.
Putting it into a function means that all you need do is change a number and run it...
...and out pops your map. Combine this with knitr or Sweave for instant report writing!
Part two will talk about the wider spatial data infrastructure, and how R fits into that. We're going off-road!