Ben Devries, Jan Verbesselt, Loïc Dutrieux, Dainius Masiliūnas, Astrid Bos, Nandika Tsendbazar

2024-08-22

WUR Geoscripting

Advanced Raster Analysis

Learning objectives

  • Carry out a supervised classification on a SpatRaster
  • Construct a raster sieve using the patches() function
  • Deal with thematic (categorical maps)

Advanced Raster Analysis

Introduction to Sentinel-2 data used here

We will carry out a supervised classification using Sentinel 2 data for the Gewata region in Ethiopia. To do this, we use atmospherically corrected Level 2A data acquired on December 27, 2020. These data were downloaded from ESA’s online data hub, a part of the Copernicus European Programme. As it is freely available, Sentinel data has been commonly used next to Landsat data for environmental monitoring. Sentinel bands in comparison to Lansat bands

Data exploration

Download the data to your computer and open your preferred R IDE to the directory of this tutorial.

After downloading the data we begin with visualization. The data consists of all the Sentinel-2 bands at a spatial resolution (or pixel size) of 20 m, meaning that each pixel on the scene corresponds to a ground distance of 20 m by 20 m. We will also make use of training polygons for the land cover classification, which will be introduced later.

# Check for packages and install if missing
if(!"terra" %in% installed.packages()){install.packages("terra")}
if(!"sf" %in% installed.packages()){install.packages("sf")}
if(!"ranger" %in% installed.packages()){install.packages("ranger")}

library(terra)
library(sf)
library(ranger)

# Create data directory,
if (!dir.exists("data")) {
  dir.create("data")
}

# Create output directory
if (!dir.exists("output")) {
  dir.create("output")
}

# Download data and unzip
download.file("https://github.com/GeoScripting-WUR/AdvancedRasterAnalysis/releases/download/advrast-data/AdvancedRaster.zip", "data/AdvancedRaster.zip")
unzip("data/AdvancedRaster.zip", exdir="data")

# Load data and rename the layers
Gewata <- rast("data/S2B2A_T36NZP_20201227T075239_20m_gewata_crop.tif")
names(Gewata) <- readLines("data/S2B2A_T36NZP_20201227T075239_20m_gewata_bands.txt")

# The image is cloud-free, so drop the cloud mask layer
Gewata <- subset(Gewata, "SCL", negate = TRUE)

# Check out the attributes
Gewata$B02

# Some basic statistics using global() from the terra package
global(Gewata$B02, fun = "max")$max
global(Gewata$B02, fun = "mean")$mean

# What is the maximum value of all three bands?
global(Gewata, fun = "max")$max

# summary() is useful function for a quick overview
summary(Gewata$B02)

# Histograms for all the bands in one window (automatic, if a SpatRaster is supplied)
hist(Gewata, maxpixels = 1000)

Note that the values of these bands have been rescaled by a factor of 10,000. This is done for file storage considerations. For example, a value of 0.5643 stored as a float takes up more disk space than a value of 5643 stored as an integer. If you prefer reflectance values in their original scale (from 0 to 1), this can easily be done using raster algebra or app(). We will do this later.

A scatterplot matrix can be helpful in exploring relationships between the layers in a SpatRaster. This can be done with the pairs() function of the terra package, which (like hist()) is a wrapper for the same function found in the graphics packages.

pairs(Gewata, maxpixels = 1000)

Note that both hist() and pairs() compute histograms and scatterplots based on a random sample of raster pixels. The size of this sample can be changed with the argument maxpixels.

Calling pairs() on a SpatRaster reveals potential correlations between the layers themselves, that give an indication of which information may be redundant.

Question 1: In the case of the bands of the Gewata subset, list two pairs of bands that contain redundant information and two bands that have significant non-redundant information.

In the previous tutorial, we explored two ways to calculate NDVI, using direct raster algebra or using app(). Since we will be using NDVI again later in this tutorial, let’s calculate it again and store it in our workspace using app().

par(mfrow = c(1, 1)) # reset plotting window
ndvi <- app(c(Gewata$B8A, Gewata$B04), fun = function(x){(x[1] - x[2]) / (x[1] + x[2])})
plot(ndvi)

Aside from the advantages of app() regarding memory usage, an additional advantage of this function is the fact that the result can be written immediately to the file by including the filename = "..." argument, which will allow you to write your results to file immediately, after which you can reload it in subsequent sessions without having to repeat your analysis.

Question 2: What is the advantage of including the NDVI layer in the land cover classification? Hint: For information on NDVI, check out this source.

Classifying raster data

One of the most important tasks in analysis of remote sensing image analysis is image classification. In classifying the image, we take the information contained in the various bands (possibly including other synthetic bands such as NDVI or principal components). There are two approaches for image classification: supervised and unsupervised. In this tutorial we will explore supervised classification based on the Random Forest method.

Supervised classification: Random Forest

The Random Forest classification algorithm is an ensemble learning method that is used for both classification and regression. In our case, we will use the method for classification purposes. Here, the Random Forest method takes random subsets from a training dataset and constructs classification trees using each of these subsets. Trees consist of branches and leaves.

Branches represent nodes of the decision trees, which are often thresholds defined for the measured (known) variables in the dataset. Leaves are the class labels assigned at the termini of the trees. Sampling many subsets at random will result in many trees being built. Classes are then assigned based on classes assigned by all of these trees based on a majority rule, as if each class assigned by a decision tree were considered to be a vote.

The figure below gives a simple demonstration of how the random forest method works in principle. For an introduction to the Random Forest algorithm, see this presentation. For more information on random forest implementation in R see this tutorial.

Schematic showing how the Random Forest method constructs classification trees from random subsets of a training dataset. Each tree determines the labels assigned based on the training dataset. Once all trees are assembled, classes are assigned to unknown pixels based on the class which receives the majority of votes based on all the decision trees constructed.
Schematic showing how the Random Forest method constructs classification trees from random subsets of a training dataset. Each tree determines the labels assigned based on the training dataset. Once all trees are assembled, classes are assigned to unknown pixels based on the class which receives the majority of votes based on all the decision trees constructed.

One major advantage of the Random Forest method is the fact that an Out Of the Bag (OOB) cross-validation error estimate and an estimate of variable performance are performed. For each classification tree assembled, a fraction of the training data are left out and used to compute the error for each tree by predicting the class associated with that value and comparing with the already known class. This process results in a confusion matrix, which we will explore in our analysis. In addition an importance score is computed for each variable in two forms: the mean decrease in accuracy for each variable, and the Gini impurity criterion, which will also be explored in our analysis.

To perform the classification in R, it is best to assemble all covariate layers (ie. those layers containing predictor variable values) into one SpatRaster object. In this case, we can simply append the new layer (NDVI) to our existing SpatRaster (currently consisting of different bands).

First, let’s rescale the original reflectance values to their original scale. This step is not required for the RF classification, but it might help with the interpretation, if you are used to thinking of reflectance as a value between 0 and 1. (On the other hand, for very large raster bricks, it might be preferable to leave them in their integer scale, but we won’t go into more detail about that here.)

# Rescale to original scale
gewata <- app(Gewata, fun = function(x) x / 10000)

# Make a new SpatRaster by combining the Gewata and NDVI SpatRasters
covs <- c(gewata, ndvi)
plot(covs)

You’ll notice that we didn’t give our NDVI layer a name yet, and it automatically assigned the layer name ‘lyr.1’ as a result. It’s good to make sure that the SpatRaster names make sense, so you don’t forget which band is which later on. Let’s change all the layer names (make sure you get the order right!).

names(covs) <- c(names(Gewata), "NDVI")
names(covs)
##  [1] "B02"  "B03"  "B04"  "B05"  "B06"  "B07"  "B11"  "B12"  "B8A"  "NDVI"

Training data preparation

For this exercise, we will do a very simple classification for 2020 using three classes: forest, cropland and wetland. While for other purposes it is usually better to define more classes (and possibly fuse classes later), a simple classification like this one could be useful, for example, to construct a forest mask for the year 2020.

# Download training polygons
download.file("https://github.com/GeoScripting-WUR/AdvancedRasterAnalysis/releases/download/advrast-data/trainingPoly.csv", "data/trainingPoly.csv")

# Load the training polygons from a csv file using st_read:
trainingPoly <- st_read("data/trainingPoly.csv")
## Reading layer `trainingPoly' from data source 
##   `/home/osboxes/Documents/Tutorials/AdvancedRasterAnalysis/data/trainingPoly.csv' 
##   using driver `CSV'
## Simple feature collection with 16 features and 3 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: 813300.2 ymin: 822891.5 xmax: 846719.1 ymax: 848830.4
## CRS:           NA
# Superimpose training polygons onto NDVI plot
par(mfrow = c(1, 1)) # reset plotting window
plot(ndvi)
plot(trainingPoly, add = TRUE)

The training classes are labelled as string labels. For this exercise, we will need to work with integer classes, so we will need to first ‘relabel’ our training classes. There are several approaches that could be used to convert these classes to integer codes. In this case, we will first make a function that will reclassify the character strings representing land cover classes into integers based on the existing factor levels.

# Inspect the trainingPoly object
trainingPoly <- trainingPoly[, c(2:4)] #remove an unused column
trainingPoly
## Simple feature collection with 16 features and 2 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: 813300.2 ymin: 822891.5 xmax: 846719.1 ymax: 848830.4
## CRS:           NA
## First 10 features:
##    OBJECTID   Class                       geometry
## 1         1 wetland POLYGON ((821935.6 833523.3...
## 2         2 wetland POLYGON ((824714 831022.8, ...
## 3         3 wetland POLYGON ((830913.4 832637.4...
## 4         4 wetland POLYGON ((832451.1 833023.5...
## 5         5 wetland POLYGON ((834905.6 837919.2...
## 6         6  forest POLYGON ((833159.4 846635.4...
## 7         7  forest POLYGON ((831922.1 848830.4...
## 8         8  forest POLYGON ((842202 832724.6, ...
## 9         9  forest POLYGON ((840860.9 829199.5...
## 10       10  forest POLYGON ((839926.8 824919.4...
# The 'Class' column is a character but should be converted to factor 
summary(trainingPoly$Class)
##    Length     Class      Mode 
##        16 character character
trainingPoly$Class <- as.factor(trainingPoly$Class)
summary(trainingPoly$Class)
## cropland   forest  wetland 
##        6        5        5
# We can make a new 'Code' column by converting the factor levels to integer by using the as.numeric() function,
trainingPoly$Code <- as.numeric(trainingPoly$Class)

# Inspect the new 'Code' column
summary(trainingPoly$Code)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   2.000   1.938   3.000   3.000
# Define a colour scale for the classes (as above) corresponding to: cropland, forest, wetland
cols <- c("orange", "dark green", "light blue")

# Superimpose training polygons (colour by class) onto NDVI plot
plot(ndvi)
plot(trainingPoly["Class"], add = TRUE, pal = cols)

# Add a customised legend
legend("topright", legend = c("cropland", "forest", "wetland"), fill = cols, bg = "white")