WUR Geoscripting WUR logo

Week 2, Lesson 7: Advanced Raster Analysis

Learning outcomes of today:

Being able to:

  • carry out a supervised classification on a series of raster layers
  • construct a raster sieve using the clump() function
  • deal with thematic (categorical maps)

Advanced Raster Analysis

Introduction to Landsat data used here

Since being released to the public, the Landsat data archive has become an invaluable tool for environmental monitoring. With a historical archive reaching back to the 1970’s, the release of these data has resulted in a spur of time series based methods. In this tutorial, we will work with time series data from the Landsat 7 Enhanced Thematic Mapper (ETM+) sensor. Landsat scenes are delivered via the USGS as a number of image layers representing the different bands captured by the sensors. In the case of the Landsat 7 Enhanced Thematic Mapper (ETM+) sensor, the bands are shown in the figure below. Using different combination of these bands can be useful in describing land features and change processes.

Landsat 7 ETM+ bands

Part of a Landsat scene, including bands 2-4 are included in the data provided here. These data have been processed using the LEDAPS framework, so the values contained in this dataset represent surface reflectance, scaled by 10000 (ie. divide by 10000 to get a reflectance value between 0 and 1).

We will begin exploring these data simply by downloading and visualizing them. The data is the same Gewata data from Lesson 5, but in this case all the bands are saved as a separate file. We also have a VCF (Vegetation Continuous Field) data and training polygons which will be useful in this lesson. We will be making use of the RasterBrick object.

# check for raster package and install if missing
if(!"raster" %in% rownames(installed.packages())){install.packages("raster")}

# check for sf package and install if missing
if(!"sf" %in% rownames(installed.packages())){install.packages("sf")}

## Libraries
library(raster)
library(sf)
## Download, unzip and load the data
tempdir = "temp"
if (!dir.exists(tempdir)) dir.create(tempdir)
download.file(url = 'https://github.com/GeoScripting-WUR/AdvancedRasterAnalysis/archive/gh-pages.zip', destfile = file.path(tempdir, 'data.zip'), method = 'auto')
unzip(file.path(tempdir, 'data.zip'), exdir=tempdir)
# Move from the temp dir to your working directory and clean up
file.rename(file.path(tempdir, 'AdvancedRasterAnalysis-gh-pages', 'data'), 'data')
unlink(file.path(tempdir, 'AdvancedRasterAnalysis-gh-pages'), TRUE)
file.remove(file.path(tempdir, 'data.zip'))
file.remove(tempdir)

## Load data
load("data/GewataB2.rda")
load("data/GewataB3.rda")
load("data/GewataB4.rda")

## Check out the attributes
GewataB2
## Some basic statistics using cellStats() from the raster package
cellStats(GewataB2, stat=max)
cellStats(GewataB2, stat=mean)
# This is equivalent to:
maxValue(GewataB2)
## What is the maximum value of all three bands?
max(c(maxValue(GewataB2), maxValue(GewataB3), maxValue(GewataB4)))
## summary() is useful function for a quick overview
summary(GewataB2)

## Put the 3 bands into a RasterBrick object to summarize together
gewata <- brick(GewataB2, GewataB3, GewataB4)
# 3 histograms in one window (automatic, if a RasterBrick is supplied)
hist(gewata, maxpixels=1000)

When we plot the histogram of the RasterBrick, the scales of the axes and the bin sizes are not equivalent, which could be problematic. This can be solved by adjusting these paramters in hist(). The raster hist() function inherits arguments from the function of the same name from the graphics package. To view additional arguments, type:

?graphics::hist

To ensure that our histograms are of the same scale, we should consider the xlim, ylim and breaks arguments.

par(mfrow = c(1, 1)) # reset plotting window
hist(gewata, xlim = c(0, 5000), ylim = c(0, 750000), breaks = seq(0, 5000, by = 100))
## Warning in .hist1(raster(x, y[i]), maxpixels = maxpixels, main = main[y[i]], :
## 5% of the raster cells were used. 100000 values used.

## Warning in .hist1(raster(x, y[i]), maxpixels = maxpixels, main = main[y[i]], :
## 5% of the raster cells were used. 100000 values used.

## Warning in .hist1(raster(x, y[i]), maxpixels = maxpixels, main = main[y[i]], :
## 5% of the raster cells were used. 100000 values used.

Note that the values of these bands have been rescaled by a factor of 10,000. This is done for file storage considerations. For example, a value of 0.5643 stored as a float takes up more disk space than a value of 5643 stored as an integer. If you prefer reflectance values in their original scale (from 0 to 1), this can easily be done using raster algebra or calc(). We will do this later.

A scatterplot matrix can be helpful in exploring relationships between raster layers. This can be done with the pairs() function of the raster package, which (like hist()) is a wrapper for the same function found in the graphics packages.

pairs(gewata)

Note that both hist() and pairs() compute histograms and scatterplots based on a random sample of raster pixels. The size of this sample can be changed with the argument maxpixels in either function.

Calling pairs() on a RasterBrick reveals potential correlations between the layers themselves. In the case of bands 2-4 of the gewata subset, we can see that band 2 and 3 (in the visual part of the EM spectrum) are highly correlated, while band 4 contains significant non-redundant information.

Question 1: Given what we know about the location of these bands along the EM spectrum, how could these scatterplots be explained?

ETM+ band 4 (nearly equivalent to band 5 in the Landsat 8 OLI sensor) is situated in the near infrared (NIR) region of the EM spectrum and is often used to describe vegetation-related features.

We observe a strong correlation between two of the Landsat bands of the gewata subset, but a very different distribution of values in band 4 (NIR). This distribution stems from the fact that vegetation reflects very highly in the NIR range, compared to the visual range of the EM spectrum. However, note that NIR reflectance saturates in very dense vegetated areas. A commonly used metric for assessing vegetation dynamics, the normalized difference vegetation index (NDVI), explained in the previous lesson, takes advantage of this fact and is computed from Landsat bands 3 (visible red) and 4 (near infra-red).

In the previous lesson, we explored several ways to calculate NDVI, using direct raster algebra, calc() or overlay(). Since we will be using NDVI again later in this tutorial, let’s calculate it again and store it in our workspace using overlay().

par(mfrow = c(1, 1)) # reset plotting window
ndvi <- overlay(GewataB4, GewataB3, fun=function(x,y){(x-y)/(x+y)})
plot(ndvi)

Aside from the advantages of calc() and overlay() regarding memory usage, an additional advantage of these functions is the fact that the result can be written immediately to file by including the filename = "..." argument, which will allow you to write your results to file immediately, after which you can reload in subsequent sessions without having to repeat your analysis.

Classifying raster data

One of the most important tasks in analysis of remote sensing image analysis is image classification. In classifying the image, we take the information contained in the various bands (possibly including other synthetic bands such as NDVI or principal components). There are two approaches for image classification: supervised and unsupervised. In this tutorial we will explore supervised classification based on the Random Forest method.

Supervised classification: Random Forest

The Random Forest classification algorithm is an ensemble learning method that is used for both classification and regression. In our case, we will use the method for classification purposes. Here, the Random Forest method takes random subsets from a training dataset and constructs classification trees using each of these subsets. Trees consist of branches and leaves.

Branches represent nodes of the decision trees, which are often thresholds defined for the measured (known) variables in the dataset. Leaves are the class labels assigned at the termini of the trees. Sampling many subsets at random will result in many trees being built. Classes are then assigned based on classes assigned by all of these trees based on a majority rule, as if each class assigned by a decision tree were considered to be a vote.

The figure below gives a simple demonstration of how the random forest method works in principle. For an introduction to the Random Forest algorithm, see this presentation.

One major advantage of the Random Forest method is the fact that an Out Of the Bag (OOB) error estimate and an estimate of variable performance are performed. For each classification tree assembled, a fraction of the training data are left out and used to compute the error for each tree by predicting the class associated with that value and comparing with the already known class. This process results in a confusion matrix, which we will explore in our analysis. In addition an importance score is computed for each variable in two forms: the mean decrease in accuracy for each variable, and the Gini impurity criterion, which will also be explored in our analysis.

We should first prepare the data on which the classification will be done. So far, we have prepared three bands from an ETM+ image in 2001 (bands 2, 3 and 4) as a RasterBrick, and have also calculated NDVI. In addition, there is a Vegetation Continuous Field (VCF) product available for the same period (2000).

For more information on the Landsat VCF product, see here. This product is also based on Landsat ETM+ data, and represents an estimate of tree cover (in %). Since this layer could also be useful in classifying land cover types, we will also include it as a potential covariate in the Random Forest classification.

## Load the data and check it out
load("data/vcfGewata.rda")
vcfGewata
## class      : RasterLayer 
## dimensions : 1177, 1548, 1821996  (nrow, ncol, ncell)
## resolution : 30, 30  (x, y)
## extent     : 808755, 855195, 817635, 852945  (xmin, xmax, ymin, ymax)
## crs        : +proj=utm +zone=36 +datum=WGS84 +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0 
## source     : memory
## names      : vcf2000Gewata 
## values     : 0, 254  (min, max)
plot(vcfGewata)
summary(vcfGewata)
##         vcf2000Gewata
## Min.                0
## 1st Qu.            32
## Median             64
## 3rd Qu.            75
## Max.              254
## NA's             8289
hist(vcfGewata)

In the vcfGewata rasterLayer there are some values much greater than 100 (the maximum tree cover), which are flags for water, cloud or cloud shadow pixels. To avoid these values, we can assign a value of NA to these pixels so they are not used in the classification.

vcfGewata[vcfGewata > 100] <- NA
plot(vcfGewata)
summary(vcfGewata)
##         vcf2000Gewata
## Min.                0
## 1st Qu.            32
## Median             64
## 3rd Qu.            75
## Max.              100
## NA's            13712
hist(vcfGewata)

To perform the classification in R, it is best to assemble all covariate layers (ie. those layers contaning predictor variable values) into one RasterBrick object. In this case, we can simply append these new layers (NDVI and VCF) to our existing RasterBrick (currently consisting of bands 2, 3, and 4).

First, let’s rescale the original reflectance values to their original scale. This step is not required for the RF classification, but it might help with the interpretation, if you are used to thinking of reflectance as a value between 0 and 1. (On the other hand, for very large raster bricks, it might be preferable to leave them in their integer scale, but we won’t go into more detail about that here.)

gewata <- calc(gewata, fun=function(x) x / 10000)
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj =
## prefer_proj): Discarded datum Unknown based on WGS84 ellipsoid in CRS definition
## Make a new RasterStack of covariates by 'stacking' together the existing gewata brick and NDVI and VCF layers
covs <- stack(gewata, ndvi, vcfGewata)
plot(covs)

You’ll notice that we didn’t give our NDVI layer a name yet. It’s good to make sure that the raster layer names make sense, so you don’t forget which band is which later on. Let’s change all the layer names (make sure you get the order right!).

names(covs) <- c("band2", "band3", "band4", "NDVI", "VCF")
plot(covs)

Training data preparation

For this exercise, we will do a very simple classification for 2001 using three classes: forest, cropland and wetland. While for other purposes it is usually better to define more classes (and possibly fuse classes later), a simple classification like this one could be useful, for example, to construct a forest mask for the year 2001.

## Load the training polygons as a simple features object
trainingPoly <- readRDS("data/trainingPoly.rds")
# Note that alternatively we could load a file that is not specific to R using st_read:
#trainingPoly <- st_read("data/trainingPoly.csv")
## Superimpose training polygons onto NDVI plot
plot(ndvi)
plot(trainingPoly, add = TRUE)
## Warning in plot.sf(trainingPoly, add = TRUE): ignoring all but the first
## attribute

The training classes are labelled as string labels. For this exercise, we will need to work with integer classes, so we will need to first ‘relabel’ our training classes. There are several approaches that could be used to convert these classes to integer codes. In this case, we will first make a function that will reclassify the character strings representing land cover classes into integers based on the existing factor levels.

## Inspect the trainingPoly object
trainingPoly
## Simple feature collection with 16 features and 2 fields
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: 813300.2 ymin: 822891.5 xmax: 846719.1 ymax: 848830.4
## projected CRS:  UTM Zone 36, Northern Hemisphere
## First 10 features:
##   OBJECTID   Class                       geometry
## 0        1 wetland POLYGON ((821935.6 833523.3...
## 1        2 wetland POLYGON ((824714 831022.8, ...
## 2        3 wetland POLYGON ((830913.4 832637.4...
## 3        4 wetland POLYGON ((832451.1 833023.5...
## 4        5 wetland POLYGON ((834905.6 837919.2...
## 5        6  forest POLYGON ((833159.4 846635.4...
## 6        7  forest POLYGON ((831922.1 848830.4...
## 7        8  forest POLYGON ((842202 832724.6, ...
## 8        9  forest POLYGON ((840860.9 829199.5...
## 9       10  forest POLYGON ((839926.8 824919.4...
# The 'Class' column is actually a factor
str(trainingPoly$Class)
##  Factor w/ 3 levels "cropland","forest",..: 3 3 3 3 3 2 2 2 2 2 ...
# If you had to load the training polygons from a generic vector file and not an .RDS,
# you would need to manually convert the 'Class' column to a factor, i.e.
#trainingPoly$Class <- as.factor(trainingPoly$Class)
# We can make a new 'Code' column by converting the factor levels to integer by using the as.numeric() function,
trainingPoly$Code <- as.numeric(trainingPoly$Class)
# Inspect the new 'Code' column
trainingPoly
## Simple feature collection with 16 features and 3 fields
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: 813300.2 ymin: 822891.5 xmax: 846719.1 ymax: 848830.4
## projected CRS:  UTM Zone 36, Northern Hemisphere
## First 10 features:
##   OBJECTID   Class                       geometry Code
## 0        1 wetland POLYGON ((821935.6 833523.3...    3
## 1        2 wetland POLYGON ((824714 831022.8, ...    3
## 2        3 wetland POLYGON ((830913.4 832637.4...    3
## 3        4 wetland POLYGON ((832451.1 833023.5...    3
## 4        5 wetland POLYGON ((834905.6 837919.2...    3
## 5        6  forest POLYGON ((833159.4 846635.4...    2
## 6        7  forest POLYGON ((831922.1 848830.4...    2
## 7        8  forest POLYGON ((842202 832724.6, ...    2
## 8        9  forest POLYGON ((840860.9 829199.5...    2
## 9       10  forest POLYGON ((839926.8 824919.4...    2

To train the raster data, we need to convert our training data to the same type using the rasterize() function. This function takes a spatial object (in this case a polygon object) and transfers the values to raster cells defined by a raster object. Here, we will define a new raster containing those values.

## Assign 'Code' values to raster cells (where they overlap)
classes <- rasterize(trainingPoly, ndvi, field='Code')

Note: there is a handy progress=“text” argument, which can be passed to many of the raster package functions and can help to monitor processing. Try passing this argument to therasterize() command above.

## Plotting
# Define a colour scale for the classes (as above)
# corresponding to: cropland, forest, wetland
cols <- c("orange", "dark green", "light blue")
## Plot without a legend
plot(classes, col=cols, legend=FALSE)
## Add a customized legend
legend("topright", legend=c("cropland", "forest", "wetland"), fill=cols, bg="white")

Our goal in preprocessing these data is to have a table of values representing all layers (covariates) with known values/classes. To do this, we will first need to create a version of our RasterBrick only representing the training pixels. Here, we use the mask() function from the raster package.

covmasked <- mask(covs, classes)
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj =
## prefer_proj): Discarded datum Unknown based on WGS84 ellipsoid in CRS definition
plot(covmasked)
## Combine this new brick with the classes layer to make our input training dataset
names(classes) <- "class"
trainingstack <- addLayer(covmasked, classes)
plot(trainingstack)