Week 1, Lesson 5: Intro To Raster

Today's learning objectives

At the end of the lecture, you should be able to:

  • Read/write raster data
  • Perform basic raster file operations/conversions
  • Perform simple raster calculations

ToDo today

  • Morning: Read the following tutorial
  • 13:30 to 14:30:
    • Feedback on yesterday's exercise,
    • Clarifications about today's tutorial
    • Explanation of the exercise
  • Rest of the afternoon: Do the exercise

Assumed knowledge from previous lectures

  • Understand system architecture (R, R packages, libraries, drivers, bindings)
  • Good scripting habits

Reminder on overall system architecture

R system architecture graph

Tip: RKWard is an alternative R GUI, which is well-suited for beginners and those who are familiar with traditional statistical packages like SPSS and STATISTICA, since it is menu-driven. It is usually easier to install on Linux than RStudio as well. You can work in whichever R GUI you prefer.

In a previous lecture you got introduced to the overall system architecture of R programming (see figure above) and you saw how to read and write vector data from/to file into your R environment. These vector read/write operations were made possible thanks to the OGR library. The OGR library is interfaced with R thanks to the rgdal package/binding. By analogy, raster data can be read/written thanks to the GDAL library. The figure above provides an overview of the connections between these elements. GDAL stands for Geospatial Data Abstraction Library. You can check the project home page at http://www.gdal.org/ and you will be surprised to see that a lot of the software you have used in the past to read gridded geospatial data use GDAL (i.e.: ArcGIS, QGIS, GRASS, etc). In this lesson, we will use GDAL indirectly via the raster package, which uses rgdal extensively. However, it is possible to call GDAL functionalities directly through the command line from a terminal, which is equivalent to calling a system() command directly from within R. In addition, if you are familiar with R and its string handling utilities, it may facilitate the building of the expressions that have to be passed to GDAL. (Note: This is also doable in Bash scripting, as learned in the previous lesson, and you can even combine the two.)

Let's start working with rgdal by performing system setup checks.
# Example to perform system set-up checks
## Loading required package: sp
## rgdal: version: 1.2-5, (SVN revision 648)
##  Geospatial Data Abstraction Library extensions to R successfully loaded
##  Loaded GDAL runtime: GDAL 2.1.0, released 2016/04/25
##  Path to GDAL shared files: /usr/share/gdal/2.1
##  Loaded PROJ.4 runtime: Rel. 4.9.2, 08 September 2015, [PJ_VERSION: 492]
##  Path to PROJ.4 shared files: (autodetected)
##  Linking to sp version: 1.2-3
## [1] "GDAL 2.1.0, released 2016/04/25"

The previous function should return the version number of the current version of GDAL installed on your machine. 2.1.0 is the most recent stable release. Starting with GDAL 2.0 vector processing becomes incorporated into GDAL. In case the function above returns an error, or if you cannot install rgdal at all, you should verify that all required software and libraries are properly installed. Please refer to the system setup page.

Overview of the raster package

The raster package is the reference R package for raster processing, Robert J. Hijmans is the original developer of the package. The introduction of the raster package to R has been a revolution for geo-processing and analysis using R. Among other things the raster package allows to:

  • Read and write raster data of most commonly used formats (thanks to extensive use of rgdal).
  • Perform most raster operations, such ascreation of raster objects, performing spatial/geometric operations (re-projections, resampling, etc), filtering and raster calculations.
  • Work on large raster datasets thanks to its built-in block processing functionalities.
  • Perform fast operations thanks to optimized back-end C code.
  • Visualize and interact with the data.
  • etc...

Check the home page of the raster package, the package is extremely well documented, including vignettes and demos.

Explore the raster objects

The raster package produces and uses R objects of three different classes. The RasterLayer, the RasterStack and the RasterBrick. A RasterLayer is the equivalent of a single-layer raster, as an R workspace variable. The data themselves, depending on the size of the grid can be loaded in memory or on disk. The same stands for RasterBrick and RasterStack objects, which are the equivalent of multi-layer RasterLayer objects. RasterStack and RasterBrick are very similar, the difference being in the virtual characteristic of the RasterStack. While a RasterBrick has to refer to one multi-layer file or is in itself a multi-layer object with data loaded in memory, a RasterStack may ''virtually'' connect several raster objects written to different files or in memory. Processing will be more efficient for a RasterBrick than for a RasterStack, but RasterStack has the advantage of facilitating pixel based calculations on separate raster layers.

Let's take a look into the structure of these objects.
## Generate a RasterLayer object
r <- raster(ncol=40, nrow=20)
## [1] "RasterLayer"
## attr(,"package")
## [1] "raster"
# Simply typing the object name displays its general properties / metadata
## class       : RasterLayer 
## dimensions  : 20, 40, 800  (nrow, ncol, ncell)
## resolution  : 9, 9  (x, y)
## extent      : -180, 180, -90, 90  (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0

From the metadata displayed above, we can see that the RasterLayer object contains all the properties that geo-data should have; that is to say a projection, an extent and a pixel resolution.

RasterBrick and RasterStack objects can also fairly easily be generated directly in R, as shown in the example below. Being able to generate such objects without reading them from files is particularly important for the generation of reproducible examples which was covered in general terms in Lesson 3.

# Using the previously generated RasterLayer object
# Let's first put some values in the cells of the layer
r[] <- rnorm(n=ncell(r))
# Create a RasterStack object with 3 layers
s <- stack(x=c(r, r*2, r))
# The exact same procedure works for creating a RasterBrick
b <- brick(x=c(r, r*2, r))
# Let's look at the properties of of one of these two objects
## class       : RasterBrick 
## dimensions  : 20, 40, 800, 3  (nrow, ncol, ncell, nlayers)
## resolution  : 9, 9  (x, y)
## extent      : -180, 180, -90, 90  (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0 
## data source : in memory
## names       :   layer.1,   layer.2,   layer.3 
## min values  : -3.012138, -6.024277, -3.012138 
## max values  :  3.736007,  7.472015,  3.736007

The RasterBrick metadata displayed above are mostly similar to what we saw earlier for the RasterLayer object, with the exception that these are multi-layer objects.

Raster objects manipulations

Reading and writing from/to file

The actual data used in geo-processing projects often comes as geo-data, stored on files such as GeoTIFF or other commonly used file formats. Reading data directly from these files into the R working environment (as objects belonging to one of the 3 raster objects classes) is made possible thanks to the raster package. The three main commands for reading raster objects from files are the raster(), stack(), and brick() functions, refering to RasterLayer, RasterStack and RasterBrick objects respectively. Writing one of the three raster object classes to file is achieved with the writeRaster() function.

To illustrate the reading and writing of raster files, we will use data subsets that we have prepared for the course and need to be downloaded from the repository. For that, make sure your working directory is set properly and run the following line; it will handle the download:

# Start by making sure that your working directory is properly set
# If not you can set it using setwd()
download.file(url = 'https://raw.githubusercontent.com/GeoScripting-WUR/IntroToRaster/gh-pages/data/gewata.zip', destfile = 'gewata.zip', method = 'auto')
# In case the download code doesn't work, use method = 'wget'
## Unpack the archive

Gewata is the name of the data set added, it is a multi-layer GeoTIFF object, its file name is LE71700552001036SGS00_SR_Gewata_INT1U.tif, informing us that this is a subset from a scene acquired by the Landsat 7 sensor. Let's not worry about the region that the data covers for now, we will find a nice way to discover that later on in the tutorial. See the example below.

Now that we have downloaded and unpacked the GeoTIFF file, it should be present in our working directory. We can investigate the content of the working directory (or any directory) using the list.files() function.

# When passed without arguments, list.files() returns a character vector, listing the content of the working directory
# To get only the files with .tif extension
list.files(pattern = glob2rx('*.tif'))
# Or if you are familiar with regular expressions
list.files(pattern = '^.*\\.tif$')

We can now load this object in R, since it is a multi-layer raster object, we need to use the brick() function to do that.

gewata <- brick('LE71700552001036SGS00_SR_Gewata_INT1U.tif')
Let's take a look at the structure of this object.
## class       : RasterBrick 
## dimensions  : 593, 653, 387229, 6  (nrow, ncol, ncell, nlayers)
## resolution  : 30, 30  (x, y)
## extent      : 829455, 849045, 825405, 843195  (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=utm +zone=36 +ellps=WGS84 +units=m +no_defs 
## data source : /home/ubuntu/userdata/IntroToRaster/data/LE71700552001036SGS00_SR_Gewata_INT1U.tif 
## names       : LE7170055//ta_INT1U.1, LE7170055//ta_INT1U.2, LE7170055//ta_INT1U.3, LE7170055//ta_INT1U.4, LE7170055//ta_INT1U.5, LE7170055//ta_INT1U.6 
## min values  :                     4,                     6,                     3,                    18,                     6,                     2 
## max values  :                    39,                    56,                    71,                   102,                   138,                   408

The metadata above informs us that the gewata object is a relatively small (593x653 pixels) RasterBrick with 6 layers. Similarly, single-layer objects can be read using the raster() function. Or if you try using the raster() function on a multi-layer object, by default the first layer only will be read.

gewataB1 <- raster('LE71700552001036SGS00_SR_Gewata_INT1U.tif')
## class       : RasterLayer 
## band        : 1  (of  6  bands)
## dimensions  : 593, 653, 387229  (nrow, ncol, ncell)
## resolution  : 30, 30  (x, y)
## extent      : 829455, 849045, 825405, 843195  (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=utm +zone=36 +ellps=WGS84 +units=m +no_defs 
## data source : /home/ubuntu/userdata/IntroToRaster/data/LE71700552001036SGS00_SR_Gewata_INT1U.tif 
## names       : LE71700552001036SGS00_SR_Gewata_INT1U 
## values      : 4, 39  (min, max)

Note that in addition to supporting most commonly used geodata formats, the raster package has its own format. Saving a file using the .grd extension ('filename.grd') will automatically save the object to the raster package format. This format has some advantages when performing geo processing in R (one advantage for instance is that it conserves original filenames as layer names in multilayer objects), however, it also has disadvantages, since those files are not compressed and thus very large, and GDAL itself does not have drivers for that file format (it is only readable by raster).

Geo processing, in memory vs. on disk

When looking at the documentation of most functions of the raster package, you will notice that the list of arguments is almost always ended by .... These 'three dots' are called an ellipsis; it means that extra arguments can be passed to the function. Often these arguments are those that can be passed to the writeRaster() function; meaning that most geo-processing functions are able to write their output directly to file, on disk. This reduces the number of steps and is always a good consideration when working with big raster objects that tend to overload the memory if not written directly to file.

Data type is (still) important

When writing files to disk using writeRaster() or the filename = argument in most raster processing functions, you should set an appropriate data type. Use the datatype = argument, it will save some precious disk space, and increase read and write speed.

See details in ?dataType.

Cropping a raster object

crop() is the raster package function that allows you to crop data to smaller spatial extents. A great advantage of the crop function is that it accepts almost all spatial object classes in R as its extent input argument. But the extent argument also simply accepts objects of class extent. One way of obtaining such an extent object interactively is by using the drawExtent() function. In the example below, we will manually draw a regular extent that we will use later to crop the gewata RasterBrick.

## Plot the first layer of the RasterBrick
plot(gewata, 1)
e <- drawExtent(show=TRUE)

Now you have to define a rectangular bounding box that will define the spatial extent of the extent object. Click twice, for the two opposite corners of the rectangle. Now we can crop the data following the boundaries of this extent.

## Crop gewata using e
gewataSub <- crop(gewata, e)
## Now visualize the new cropped object
plot(gewataSub, 1)

You should see on the resulting plot that the original image has been cropped.

Creating layer stacks

To end this section on general files and raster object manipulations, we will see how multi-layer objects can be created from single-layer objects. The object created as part of the example below is the same that we will use later in the course to perform time series analysis on raster objects. It is composed of NDVI layers derived from Landsat acquisitions at different dates. The objective is therefore to create a multi-layer NDVI object, for which each layer corresponds to a different date. But first we need to fetch the data, similarly to how we did it for the gewata brick.

# Again, make sure that your working directory is properly set
## Download the data
download.file(url='https://raw.githubusercontent.com/GeoScripting-WUR/IntroToRaster/gh-pages/data/tura.zip', destfile='tura.zip', method='auto')
## Retrieve the content of the tura sub-directory
list <- list.files(path='tura/', full.names=TRUE)

The object list contains the file names of all the single layers we have to stack. Let's open the first one to visualize it.


We see an NDVI layer, with the clouds masked out. Now let's create the RasterStack, the function for doing that is called stack(). Looking at the help page of the function , you can see that it can accept a list of file names as argument, which is what the object list represents. So we can very simply create the layer stack by running the function.

turaStack <- stack(list)

Now that we have our 166 layers RasterStack in memory, let's write it to disk using the writeRaster() function. Note that we decide here to save it as .grd file (the native format of the raster package); the reason for that is that this file format conserves original file names (in which information on dates is written) in the individual band names. The data range is comprised between -10000 and +10000, therefore such a file can be stored as signed 2 byte integer (INT2S).

# Write this file at the root of the working directory
writeRaster(x=turaStack, filename='turaStack.grd', datatype='INT2S')

Now this object is stored on your computer, ready to be archived for later use.

Simple raster arithmetic

Adding, subtracting, multiplying and dividing RasterLayers

Performing simple raster operations with raster objects is fairly easy. For instance, if you want to subtract two RasterLayers of same extent, r1 and r2; simply doing r1 - r2 will give the expected output, which is, every pixel value of r2 will be subtracted from the matching pixel value of r1. These types of pixel-based operations almost always require a set of conditions to be met in order to be executed; the two RasterLayers need to be identical in term of extent, resolution, projection, etc.

Subsetting layers from RasterStack and RasterBrick

Different spectral bands of a same satellite scene are often stored in multi-layer objects. This means that you will very likely import them in your R working environment as RasterBrick or RasterStack objects. As a consequence, to perform calculations between these bands, you will have to write an expression refering to individual layers of the object. Referring to individual layers in a RasterBrick or RasterStack object is done by using double square brackets [[]]. Let's look for instance at how the famous NDVI index would have to be calculated from the gewata RasterBrick object read earlier, and that contains the spectral bands of the Landsat 7 sensor. And in case you have forgotten, the NDVI formula is as follows.

$$ NDVI=\frac{NIR-Red}{NIR+Red} $$

with NIR and Red being band 4 and 3 of Landsat 7 respectively.

ndvi <- (gewata[[4]] - gewata[[3]]) / (gewata[[4]] + gewata[[3]])

The plot() function automatically recognises the objects of Raster* classes and returns an appropriate spatial plot.


The resulting NDVI can be viewed in the above figure. As expected the NDVI ranges from about 0.2, which corresponds to nearly bare soils, to 0.9 which means that there is some dense vegetation in the area.

Although this is a quick way to perform the calculation, directly adding, subtracting, multiplying, etc, the layers of big raster objects is not recommended. When working with big objects, it is advisable to use the calc() function to perform these types of calculaions. The reason is that R needs to load all the data first into its internal memory before performing the calculation and then runs everything in one block. It is really easy to run out of memory when doing that. A big advantage of the calc() function is that it has a built-in block processing option for any vectorized function, allowing such calculations to be fully "RAM friendly". The example below illustrates how to calculate NDVI from the same date set using the calc() function.

## Define the function to calculate NDVI from 
ndvCalc <- function(x) {
    ndvi <- (x[[4]] - x[[3]]) / (x[[4]] + x[[3]])
ndvi2 <- calc(x=gewata, fun=ndvCalc)

Note that overlay() can also be used in that case to obtain the same result, with the same level of RAM friendlyness. The advantage of overlay() is that the number of input RasterLayers is less limiting. As a consequence specifying the layers does not happen in the function call but in the overlay() call instead.

ndvOver <- function(x, y) {
    ndvi <- (y - x) / (x + y)
ndvi3 <- overlay(x=gewata[[3]], y=gewata[[4]], fun=ndvOver)

We can verify that the three layers ndvi, ndvi2 and ndvi3 are actually identical using the all.equal() function from the raster package.

all.equal(ndvi, ndvi2)
## [1] TRUE
all.equal(ndvi, ndvi3)
## [1] TRUE

In the simple case of calculating NDVI, we were easily able to produce the same result with calc() and overlay(), however, it is often the case that one function is preferable to the other. As a general rule, a calculation that needs to refer to multiple individual layers separately will be easier to set up in overlay() than in calc().


By the way, we still don't know where this area is. In order to investigate that, we are going to try projecting it in Google Earth. As you know Google Earth is all in Lat/Long, so we have to get our data re-projected to Lat/Long first. The projectRaster() function allows re-projection of raster objects to any projection one can think of. As the function uses the PROJ.4 library (the reference library, external to R, that handles cartographic projections and performs projections transformations; the rgdal package is the interface between that library and R) to perform that operation, the crs= argument should receive a proj4 expression. proj4 expressions are strings that provide the projection parameters of cartographic projections. A central place to search for projections is the spatial reference website (http://spatialreference.org/), from this database you will be able to query almost any reference and retrieve it in any format, including its proj4 expression.

## One single line is sufficient to project any raster to any projection
ndviLL <- projectRaster(ndvi, crs='+proj=longlat')

Note that if re-projecting and mosaicking is really a large part of your project, you may want to consider using the gdalwarp command line utility (gdalwarp) directly. The gdalUtils R package provides utilities to run GDAL commands from R, including gdalwarp, for reprojection, resampling and mosaicking.

Now that we have our NDVI layer in Lat/Long, let's write it to a KML file, which is one of the two Google Earth formats.

# Since this function will write a file to your working directory
# you want to make sure that it is set where you want the file to be written
# It can be changed using setwd()
# Note that we are using the filename argument, contained in the ellipsis (...) of 
# the function, since we want to write the output directly to file.
KML(x=ndviLL, filename='gewataNDVI.kml')

Note that you need to have Google Earth installed on your system in order to perform the following step. Now let's find that file that we have just written and double click it, and watch how Google Earth brings us all the way to ... Ethiopia. More information will come later in the course about that specific area.

We are done with this data set for this lesson. So let's explore another data set, from the Landsat sensors. This dataset will allow us to find other interesting raster operations to perform.

More raster arithmetics: performing simple value replacements

Since 2014, the USGS has started releasing Landsat data processed to surface reflectance. This means that they are taking care of important steps such as atmospheric correction and conversion from sensor radiance to reflectance factors. Additionally, they provide a cloud mask with this product. The cloud mask is an extra raster layer, at the same resolution as the surface reflectance bands, that contains information about the presence or absence of cloud as well as shadowing effects from the clouds. The cloud mask of Landsat surface reflectance product is named cfmask, after the name of the algorithm used to detect the clouds. For more information about cloud detection, see the algorithm page, and the publication by Zhu and Woodcock (2012). In the following section we will use that cfmask layer to mask out remaining clouds in a Landsat scene.

About the area

The area selected for this exercise covers most of the South Pacific island of Tahiti, French Polynesia. It is a mountaneous, volcanic island, and according to Wikipedia about 180,000 people live on the island. For convenience, the Landsat scene was subsetted to cover only the area of interest and is stored online.

## Download the data
download.file(url='https://raw.githubusercontent.com/GeoScripting-WUR/IntroToRaster/gh-pages/data/tahiti.zip', destfile='tahiti.zip', method='auto')

## Load the data as a RasterBrick object and investigate its content
tahiti <- brick('LE70530722000126_sub.grd')

## Display names of each individual layer

## Visualize the data
plotRGB(tahiti, 3,4,5)
We can also visualize the cloud mask layer (layer 7).
plot(tahiti, 7)

According to the algorithm description, water is coded as 1, cloud as 4 and cloud shadow as 2.

Does the cloud mask fit with the visual interpretation of the RGB image we plotted before?

We can also plot the two on top of each other, but before that we need to assign no values (NA) to the 'clear land pixels' so that they appear transparent on the overlay plot.

## Extract cloud layer from the brick
cloud <- tahiti[[7]]

## Replace 'clear land' with 'NA'
cloud[cloud == 0] <- NA

## Plot the stack and the cloud mask on top of each other
plotRGB(tahiti, 3,4,5)
plot(cloud, add = TRUE, legend = FALSE)

Applying a cloud mask to a dataset simply consists in performing value replacement. In this case, a condition on the 7th layer of the stack (the fmask layer) will determine whether values in the other layers are kept, or replaced by NA, which is equivalent to masking them. It is more convenient to work on the cloud mask as a separate RasterLayer, we will therefore split the RasterBrick using the dropLayer() function.

## Extract cloud mask RasterLayer
fmask <- tahiti[[7]]
## Remove fmask layer from the Landsat stack
tahiti6 <- dropLayer(tahiti, 7)

We will first do the masking using simple vector arithmetic, as if tahiti6 and fmask were simple vectors. We want to keep any value with a 'clean land pixel' flag in the cloud mask; or rather, since we are assigning NAs, we want to discard any value of the stack which has a corresponding cloud mask pixel different from 0. This can be done in one line of code.

## Perform value replacement
tahiti6[fmask != 0] <- NA

However, this is possible here because both objects are relatively small and the values can all be loaded in the computer memory without any risk of overloading it. When working with very large raster objects, you will very likely run into problems if you do that. It is then preferable, as presented earlier in this tutorial to use calc() or overlay(). overlay() in this case is the appropriate function, since we are working with two distinct raster objects.

## First define a value replacement function
cloud2NA <- function(x, y){
    x[y != 0] <- NA

The value replacement function takes two arguments, x and y. Similarly to what we did earlier, x corresponds to the RasterBrick, and y to the cloud mask.

# Let's create a new 6 layers object since tahiti6 has been masked already
tahiti6_2 <- dropLayer(tahiti, 7)

## Apply the function on the two raster objects using overlay
tahitiCloudFree <- overlay(x = tahiti6_2, y = fmask, fun = cloud2NA)

## Visualize the output
plotRGB(tahitiCloudFree, 3,4,5)