Ben Devries, Jan Verbesselt, Loïc Dutrieux, Dainius Masiliūnas, Astrid Bos, Nandika Tsendbazar

2023-09-05

WUR Geoscripting WUR logo

Advanced Raster Analysis

Learning objectives

  • Carry out a supervised classification on a SpatRaster
  • Construct a raster sieve using the patches() function
  • Deal with thematic (categorical maps)

Advanced Raster Analysis

Introduction to Sentinel-2 data used here

We will carry out a supervised classification using Sentinel 2 data for the Gewata region in Ethiopia. To do this, we use atmospherically corrected Level 2A data acquired on December 27, 2020. These data were downloaded from ESA’s online data hub, a part of the Copernicus European Programme. As it is freely available, Sentinel data has been commonly used next to Landsat data for environmental monitoring. Sentinel bands in comparison to Lansat bands

Data exploration

Download the data to your computer and open your preferred R IDE to the directory of this tutorial.

After downloading the data we begin with visualization. The data consists of all the Sentinel-2 bands at a spatial resolution (or pixel size) of 20 m, meaning that each pixel on the scene corresponds to a ground distance of 20 m by 20 m. We will also make use of training polygons for the land cover classification, which will be introduced later.

# Check for packages and install if missing
if(!"terra" %in% installed.packages()){install.packages("terra")}
if(!"sf" %in% installed.packages()){install.packages("sf")}
if(!"ranger" %in% installed.packages()){install.packages("ranger")}

library(terra)
library(sf)
library(ranger)

# Create data directory,
if (!dir.exists("data")) {
  dir.create("data")
}

# Create output directory
if (!dir.exists("output")) {
  dir.create("output")
}

# Download data and unzip
download.file("https://github.com/GeoScripting-WUR/AdvancedRasterAnalysis/releases/download/advrast-data/AdvancedRaster.zip", "data/AdvancedRaster.zip")
unzip("data/AdvancedRaster.zip", exdir="data")

# Load data and rename the layers
Gewata <- rast("data/S2B2A_T36NZP_20201227T075239_20m_gewata_crop.tif")
names(Gewata) <- readLines("data/S2B2A_T36NZP_20201227T075239_20m_gewata_bands.txt")

# The image is cloud-free, so drop the cloud mask layer
Gewata <- subset(Gewata, "SCL", negate = TRUE)

# Check out the attributes
Gewata$B02

# Some basic statistics using global() from the terra package
global(Gewata$B02, fun = "max")$max
global(Gewata$B02, fun = "mean")$mean

# What is the maximum value of all three bands?
global(Gewata, fun = "max")$max

# summary() is useful function for a quick overview
summary(Gewata$B02)

# Histograms for all the bands in one window (automatic, if a SpatRaster is supplied)
hist(Gewata, maxpixels = 1000)

Note that the values of these bands have been rescaled by a factor of 10,000. This is done for file storage considerations. For example, a value of 0.5643 stored as a float takes up more disk space than a value of 5643 stored as an integer. If you prefer reflectance values in their original scale (from 0 to 1), this can easily be done using raster algebra or app(). We will do this later.

A scatterplot matrix can be helpful in exploring relationships between the layers in a SpatRaster. This can be done with the pairs() function of the terra package, which (like hist()) is a wrapper for the same function found in the graphics packages.

pairs(Gewata, maxpixels = 1000)

Note that both hist() and pairs() compute histograms and scatterplots based on a random sample of raster pixels. The size of this sample can be changed with the argument maxpixels.

Calling pairs() on a SpatRaster reveals potential correlations between the layers themselves, that give an indication of which information may be redundant.

Question 1: In the case of the bands of the Gewata subset, list two pairs of bands that contain redundant information and two bands that have significant non-redundant information.

In the previous tutorial, we explored two ways to calculate NDVI, using direct raster algebra or using app(). Since we will be using NDVI again later in this tutorial, let’s calculate it again and store it in our workspace using app().

par(mfrow = c(1, 1)) # reset plotting window
ndvi <- app(c(Gewata$B8A, Gewata$B04), fun = function(x){(x[1] - x[2]) / (x[1] + x[2])})
plot(ndvi)

Aside from the advantages of app() regarding memory usage, an additional advantage of this function is the fact that the result can be written immediately to the file by including the filename = "..." argument, which will allow you to write your results to file immediately, after which you can reload it in subsequent sessions without having to repeat your analysis.

Question 2: What is the advantage of including the NDVI layer in the land cover classification? Hint: For information on NDVI, check out this source.

Classifying raster data

One of the most important tasks in analysis of remote sensing image analysis is image classification. In classifying the image, we take the information contained in the various bands (possibly including other synthetic bands such as NDVI or principal components). There are two approaches for image classification: supervised and unsupervised. In this tutorial we will explore supervised classification based on the Random Forest method.

Supervised classification: Random Forest

The Random Forest classification algorithm is an ensemble learning method that is used for both classification and regression. In our case, we will use the method for classification purposes. Here, the Random Forest method takes random subsets from a training dataset and constructs classification trees using each of these subsets. Trees consist of branches and leaves.

Branches represent nodes of the decision trees, which are often thresholds defined for the measured (known) variables in the dataset. Leaves are the class labels assigned at the termini of the trees. Sampling many subsets at random will result in many trees being built. Classes are then assigned based on classes assigned by all of these trees based on a majority rule, as if each class assigned by a decision tree were considered to be a vote.

The figure below gives a simple demonstration of how the random forest method works in principle. For an introduction to the Random Forest algorithm, see this presentation. For more information on random forest implementation in R see this tutorial.

Schematic showing how the Random Forest method constructs classification trees from random subsets of a training dataset. Each tree determines the labels assigned based on the training dataset. Once all trees are assembled, classes are assigned to unknown pixels based on the class which receives the majority of votes based on all the decision trees constructed.
Schematic showing how the Random Forest method constructs classification trees from random subsets of a training dataset. Each tree determines the labels assigned based on the training dataset. Once all trees are assembled, classes are assigned to unknown pixels based on the class which receives the majority of votes based on all the decision trees constructed.

One major advantage of the Random Forest method is the fact that an Out Of the Bag (OOB) cross-validation error estimate and an estimate of variable performance are performed. For each classification tree assembled, a fraction of the training data are left out and used to compute the error for each tree by predicting the class associated with that value and comparing with the already known class. This process results in a confusion matrix, which we will explore in our analysis. In addition an importance score is computed for each variable in two forms: the mean decrease in accuracy for each variable, and the Gini impurity criterion, which will also be explored in our analysis.

To perform the classification in R, it is best to assemble all covariate layers (ie. those layers containing predictor variable values) into one SpatRaster object. In this case, we can simply append the new layer (NDVI) to our existing SpatRaster (currently consisting of different bands).

First, let’s rescale the original reflectance values to their original scale. This step is not required for the RF classification, but it might help with the interpretation, if you are used to thinking of reflectance as a value between 0 and 1. (On the other hand, for very large raster bricks, it might be preferable to leave them in their integer scale, but we won’t go into more detail about that here.)

# Rescale to original scale
gewata <- app(Gewata, fun = function(x) x / 10000)

# Make a new SpatRaster by combining the Gewata and NDVI SpatRasters
covs <- c(gewata, ndvi)
plot(covs)

You’ll notice that we didn’t give our NDVI layer a name yet, and it automatically copied over the layer name B8A as a result. It’s good to make sure that the SpatRaster names make sense, so you don’t forget which band is which later on. Let’s change all the layer names (make sure you get the order right!).

names(covs) <- c(names(Gewata), "NDVI")
names(covs)
##  [1] "B02"  "B03"  "B04"  "B05"  "B06"  "B07"  "B11"  "B12"  "B8A"  "NDVI"

Training data preparation

For this exercise, we will do a very simple classification for 2020 using three classes: forest, cropland and wetland. While for other purposes it is usually better to define more classes (and possibly fuse classes later), a simple classification like this one could be useful, for example, to construct a forest mask for the year 2020.

# Download training polygons
download.file("https://github.com/GeoScripting-WUR/AdvancedRasterAnalysis/releases/download/advrast-data/trainingPoly.csv", "data/trainingPoly.csv")

# Load the training polygons from a csv file using st_read:
trainingPoly <- st_read("data/trainingPoly.csv")
## Reading layer `trainingPoly' from data source 
##   `/home/osboxes/Documents/Geoscripting/Tutorials2/AdvancedRasterAnalysis/data/trainingPoly.csv' 
##   using driver `CSV'
## Simple feature collection with 16 features and 3 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: 813300.2 ymin: 822891.5 xmax: 846719.1 ymax: 848830.4
## CRS:           NA
# Superimpose training polygons onto NDVI plot
par(mfrow = c(1, 1)) # reset plotting window
plot(ndvi)
plot(trainingPoly, add = TRUE)

The training classes are labelled as string labels. For this exercise, we will need to work with integer classes, so we will need to first ‘relabel’ our training classes. There are several approaches that could be used to convert these classes to integer codes. In this case, we will first make a function that will reclassify the character strings representing land cover classes into integers based on the existing factor levels.

# Inspect the trainingPoly object
trainingPoly <- trainingPoly[, c(2:4)] #remove an unused column
trainingPoly
## Simple feature collection with 16 features and 2 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: 813300.2 ymin: 822891.5 xmax: 846719.1 ymax: 848830.4
## CRS:           NA
## First 10 features:
##    OBJECTID   Class                       geometry
## 1         1 wetland POLYGON ((821935.6 833523.3...
## 2         2 wetland POLYGON ((824714 831022.8, ...
## 3         3 wetland POLYGON ((830913.4 832637.4...
## 4         4 wetland POLYGON ((832451.1 833023.5...
## 5         5 wetland POLYGON ((834905.6 837919.2...
## 6         6  forest POLYGON ((833159.4 846635.4...
## 7         7  forest POLYGON ((831922.1 848830.4...
## 8         8  forest POLYGON ((842202 832724.6, ...
## 9         9  forest POLYGON ((840860.9 829199.5...
## 10       10  forest POLYGON ((839926.8 824919.4...
# The 'Class' column is a character but should be converted to factor 
summary(trainingPoly$Class)
##    Length     Class      Mode 
##        16 character character
trainingPoly$Class <- as.factor(trainingPoly$Class)
summary(trainingPoly$Class)
## cropland   forest  wetland 
##        6        5        5
# We can make a new 'Code' column by converting the factor levels to integer by using the as.numeric() function,
trainingPoly$Code <- as.numeric(trainingPoly$Class)

# Inspect the new 'Code' column
summary(trainingPoly$Code)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   2.000   1.938   3.000   3.000
# Define a colour scale for the classes (as above) corresponding to: cropland, forest, wetland
cols <- c("orange", "dark green", "light blue")

# Superimpose training polygons (colour by class) onto NDVI plot
plot(ndvi)
plot(trainingPoly["Class"], add = TRUE, pal = cols)

# Add a customised legend
legend("topright", legend = c("cropland", "forest", "wetland"), fill = cols, bg = "white")

Our goal in preprocessing this data is to have a table of values representing all layers (covariates) with known values/classes. To do this, we will first need to know the values of the covariates at our training polygon locations. We can use extract function of terra package for this. Next we convert these data to a data.frame representing all training data.

# Extract pixel values below the polygons into a dataframe
trainingData <- extract(covs, trainingPoly)

# Add a column specifying the class based on the polygon ID
trainingData$Class <- trainingPoly$Class[trainingData$ID]

# Remove the training polygon ID's from the dataframe
trainingData$ID <- NULL

This data.frame will be used as an input into the RandomForest classification function. Let’s inspect the first and last 10 rows.

head(trainingData, n = 10)
##       B02    B03    B04    B05    B06    B07    B11    B12    B8A      NDVI
## 1  0.0492 0.0746 0.0680 0.1105 0.1999 0.2280 0.2425 0.1496 0.2608 0.5863747
## 2  0.0518 0.0759 0.0705 0.1168 0.2093 0.2397 0.2505 0.1576 0.2706 0.5866315
## 3  0.0523 0.0772 0.0710 0.1189 0.2073 0.2418 0.2501 0.1567 0.2717 0.5856434
## 4  0.0482 0.0714 0.0674 0.1120 0.1971 0.2265 0.2331 0.1394 0.2617 0.5903981
## 5  0.0476 0.0732 0.0675 0.1126 0.1959 0.2257 0.2364 0.1421 0.2586 0.5860166
## 6  0.0481 0.0716 0.0671 0.1117 0.1932 0.2231 0.2360 0.1433 0.2550 0.5833592
## 7  0.0483 0.0726 0.0678 0.1128 0.1997 0.2287 0.2398 0.1483 0.2559 0.5810936
## 8  0.0517 0.0760 0.0700 0.1161 0.2057 0.2356 0.2484 0.1558 0.2678 0.5855536
## 9  0.0528 0.0757 0.0717 0.1176 0.2049 0.2360 0.2501 0.1570 0.2699 0.5802108
## 10 0.0502 0.0745 0.0679 0.1145 0.2065 0.2351 0.2480 0.1502 0.2676 0.5952310
##      Class
## 1  wetland
## 2  wetland
## 3  wetland
## 4  wetland
## 5  wetland
## 6  wetland
## 7  wetland
## 8  wetland
## 9  wetland
## 10 wetland
tail(trainingData, n = 10)
##          B02    B03    B04    B05    B06    B07    B11    B12    B8A      NDVI
## 87195 0.0352 0.0544 0.0563 0.0986 0.1997 0.2351 0.2193 0.1324 0.2652 0.6497667
## 87196 0.0377 0.0586 0.0651 0.1053 0.1789 0.2219 0.2357 0.1534 0.2433 0.5778210
## 87197 0.0400 0.0594 0.0736 0.1108 0.1637 0.1842 0.2491 0.1729 0.2144 0.4888889
## 87198 0.0445 0.0750 0.0807 0.1316 0.2074 0.2370 0.2476 0.1602 0.2707 0.5406944
## 87199 0.0334 0.0537 0.0528 0.0960 0.2105 0.2668 0.2187 0.1242 0.2930 0.6946212
## 87200 0.0385 0.0653 0.0605 0.1127 0.2108 0.2402 0.2329 0.1482 0.2709 0.6348823
## 87201 0.0475 0.0758 0.0835 0.1279 0.2009 0.2237 0.2624 0.1716 0.2552 0.5069383
## 87202 0.0578 0.0941 0.1053 0.1617 0.2302 0.2592 0.2904 0.1873 0.2959 0.4750748
## 87203 0.0342 0.0559 0.0585 0.0957 0.1683 0.1825 0.2304 0.1466 0.2152 0.5725247
## 87204 0.0364 0.0588 0.0689 0.1169 0.1619 0.1887 0.2472 0.1672 0.2074 0.5012667
##          Class
## 87195 cropland
## 87196 cropland
## 87197 cropland
## 87198 cropland
## 87199 cropland
## 87200 cropland
## 87201 cropland
## 87202 cropland
## 87203 cropland
## 87204 cropland

We have our training dataset as a data.frame with the class column as a factor. If it is integer, random forest regression will be run, instead of classification. So, good to check on that.

Now we have a convenient training data table which contains, for each of the three defined classes, values for all covariates. Let’s visualize the distribution of some of these covariates for each class. To make this easier, we will create 3 different data.frames for each of the classes. This is just for plotting purposes, and we will not use these in the actual classification.

val_crop <- subset(trainingData, Class == "cropland")
val_forest <- subset(trainingData, Class == "forest")
val_wetland <- subset(trainingData, Class == "wetland")

# NDVI
par(mfrow = c(3, 1))
hist(val_crop$NDVI, main = "cropland", xlab = "NDVI", xlim = c(0, 1), col = "orange")
hist(val_forest$NDVI, main = "forest", xlab = "NDVI", xlim = c(0, 1), col = "dark green")
hist(val_wetland$NDVI, main = "wetland", xlab = "NDVI", xlim = c(0, 1), col = "light blue")

par(mfrow = c(1, 1))

Note that other covariates such as the bands can also be plotted like above.

We can also create useful scatterplots this way:

# Scatterplot of bands 8a and 11 for the three classes
plot(B8A ~ B11, data = val_crop, pch = ".", col = "orange", xlim = c(0, 0.4), ylim = c(0, 0.5))
points(B8A ~ B11, data = val_forest, pch = ".", col = "dark green")
points(B8A ~ B11, data = val_wetland, pch = ".", col = "light blue")
legend("topright", legend = c("cropland", "forest", "wetland"), fill = c("orange", "dark green", "light blue"), bg = "white")

Question 3: Try to produce the same scatterplot plot as above looking at the relationship between other bands. Try B02 & B05, B07 & NDVI (you might have adjust the xlim to incorporate the NDVI values) and another of your choice. What can you say about the relationships between these bands? Which ones give a clear distinction between classes, and where is this less clear?

We can see from these distributions that these covariates may do well in classifying forest pixels, but we may expect some confusion between cropland and wetland (although the individual bands may help to separate these classes). You can save the training data using the write.csv() command, in case something goes wrong after this point and you need to start over again.

Run Random Forest classification

We build the Random Forest model using the training data. For this, we will use the ranger package in R. There is also randomForest package available in R. However, ranger is is implemented in C++ with multithreading and thus is much faster. Using the ranger() function, we will build a model based on a matrix of predictors or covariates (ie. the first 10 columns of trainingData) related to the response (the Class column of trainingData).

Construct a random forest model.

Covariates (x) are found in columns 1 to 10 of trainingData. Training classes (y) are found in the ‘Class’ column of trainingData. Caution: this step takes fairly long! but can be shortened by limiting the number of trees to 100 and by setting importance = FALSE.

library(ranger)

modelRF <- ranger(x = trainingData[, 1:ncol(trainingData)-1], y = trainingData$Class, num.trees = 100,
                  importance = "permutation", seed = 0xfedbeef)

Since the random forest method involves the building and testing of many classification trees (the ‘forest’), it is a computationally expensive step (and could take a lot of memory for especially large training datasets). When this step is finished, it would be a good idea to save the resulting object with the saveRDS() command. Any R object can be saved as an .rds file and reloaded into future sessions using readRDS().

Note: there is a similar functionality using the save() and load() commands, but those can save more than one object and don’t tell you their names, you have to know them. That is why saveRDS()/readRDS() is preferred, but in this tutorial in a lot of cases load is still being used.

The resulting object from the ranger() function is a specialized object of class ranger, which is a large list-type object packed full of information about the model output. Elements of this object can be called and inspected like any list object.

# Inspect the structure and element names of the resulting model
modelRF
class(modelRF)
str(modelRF)
names(modelRF)

# Inspect the prediction error
modelRF$prediction.error

# Calculate the overall accuracy
1 - modelRF$prediction.error

# Inspect the confusion matrix of the OOB error assessment
modelRF$confusion.matrix

The overall accuracy and the confusion matrix are often used to evaluate the results of a supervised classification. However, the confusion matrix can provide more detailed information by giving per-class accuracies.

Note: If you wish to learn how to read a confusion matrix, check out this tutorial.

Earlier we provided a brief explanation of OOB error, though it can be a valuable metric for evaluating your model, it can overestimate the true prediction error depending on the parameters presents in the model.

Since we set importance = "permutation", we now also have information on the statistical importance of each of our covariates which we can retrieve using the importance() command.

importance(modelRF)
##        B02        B03        B04        B05        B06        B07        B11 
## 0.34769114 0.17501099 0.20448235 0.09613555 0.08109066 0.08211339 0.12176549 
##        B12        B8A       NDVI 
## 0.21966728 0.08925907 0.13129564

The above shows the variable importance for a Random Forest model showing the mean decrease in accuracy for each variable.

The mean decrease in accuracy indicates the amount by which the classification accuracy decreased based on the OOB assessment. In this case, it seems that Gewata bands 2, 4 and 12 have the highest impact on accuracy. For large datasets, it may be helpful to know this information, and leave out less important variables for subsequent runs of the ranger() function.

Since the NDVI layer scores relatively low according to the mean accuracy decrease criterion, try to construct an alternate Random Forest model as above, but excluding this layer, you can name it something like ‘modelRF2’.

Question 4: What effect does this have on the accuracy of the results? Hint: Compare the overall accuracies (or the confusion matrices) of the original and new outputs.

Question 5: What effect does leaving this variable out have on the processing time? Hint: Use system.time().

Now we apply this model to the rest of the image and assign classes to all pixels. Note that for this step, the names of the layers in the input SpatRaster (here covs) must correspond exactly to the column names of the training table. We will use the predict() function from the terra package. This function uses a pre-defined model to predict values of raster cells based on a SpatRaster. This model can be derived by a linear regression, for example. In our case, we will use the model provided by the ranger() function.

# Double-check layer and column names to make sure they match
names(covs)
##  [1] "B02"  "B03"  "B04"  "B05"  "B06"  "B07"  "B11"  "B12"  "B8A"  "NDVI"
names(trainingData)
##  [1] "B02"   "B03"   "B04"   "B05"   "B06"   "B07"   "B11"   "B12"   "B8A"  
## [10] "NDVI"  "Class"
# Predict land cover using the RF model
predLC <- predict(covs, modelRF, fun = function(...) predict(...)$predictions)
# Plot the results
# Recall: 1 = cropland, 2 = forest, 3 = wetland
cols <- c("orange", "dark green", "light blue")
plot(predLC, col = cols, legend = FALSE)
legend("bottomright",
       legend = c("cropland", "forest", "wetland"),
       fill = cols, bg = "white")

Note that the predict() function also takes arguments that can be passed to writeRaster() (eg. filename = "", so it is a good idea to write to file as you perform this step (rather than keeping all output in memory).

Applying a raster sieve by identifying patches

Although the land cover raster resulted from the Random Forest has limited number of thematic classes, and we observed some confusion between wetland and cropland classes, it could be useful for constructing a forest mask (since that class performed quite well). To do so, we have to fuse (and remove) non-forest classes, and then clean up the remaining pixels by applying a sieve. We will make use of the patches() function (detecting patches of connected cells) in the terra package.

# Make an NA-value raster based on the LC raster attributes
formask <- setValues(predLC, NA)

# Assign 1 to formask to all cells corresponding to the forest class
formask[predLC == 2] <- 1
plot(formask, col = "dark green", legend = FALSE)

The forest mask here can be used to isolate forest pixels for further analysis. Forest pixels (from the Random Forest classification) have a value of 1, and non-forest pixels have a value of NA.

For some applications, we may only be interested in larger forest areas. We may especially want to remove single forest pixels, as they may be a result of errors, or may not fit our definition of forest.

In this section, we will construct 2 types of sieves to remove these types of pixels, following 2 definitions of adjacency. In the first approach, the so-called Queen’s Case, neighbours in all 8 directions are considered to be adjacent. If any pixel cell has no neighbours in any of these 8 directions, we will remove that pixel by assigning an NA value.

First, we will use the patches() function in the terra package to identify patches of raster cells. This function arbitrarily assigns an ID to these patches.

# Group raster cells into patches based on the Queen's Case
if(!file.exists(fn <- "output/clumformask.tif")) {
  forestpatches <- patches(formask, directions = 8, filename = fn)
} else {
  forestpatches <- rast(fn)
}

plot(forestpatches, col = topo.colors(nrow(forestpatches)))

When we inspect the frequency table with freq(), we can see the number of raster cells included in each of these patch IDs.

# Assign frequency table (to a dataframe)
patchFreq <- freq(forestpatches)

# Inspect the dataframe
head(patchFreq)
tail(patchFreq)

We can use the count column of this frequency table to select patch IDs with only 1 pixel - these are the pixel “islands” that we want to remove from our original forest mask.

# Which rows of the data.frame are only represented by one cell?
str(which(patchFreq$count == 1))

# Which values do these correspond to?
str(patchFreq$value[which(patchFreq$count == 1)])

# Put these into a vector of patch ID's to be removed
excludeID <- patchFreq$value[which(patchFreq$count == 1)]

# Make a new forest mask to be sieved
formaskSieve <- formask

# Assign NA to all patches whose IDs are found in excludeID
formaskSieve[forestpatches %in% excludeID] <- NA

## Zoom in to a small extent to check the results
# Note: you can also define your own zoom by using e <- drawExtent()
e <- ext(c(830000, 834000, 830000, 834000))
par(mfrow = c(1, 2)) # allow 2 plots side-by-side
plot(formask, ext = e, col="dark green", legend = FALSE, main = 'formask')
plot(formaskSieve, ext = e, col="dark green", legend = FALSE, main = 'formaskSieve')

# Reset plotting window
par(mfrow = c(1, 1)) 

We have successfully removed all island pixels from the forest mask using the patches() function. We can adjust our sieve criteria to only directly adjacent (NESW) neighbours: the so-called Rook’s Case. To accomplish this, simply repeat the code above, but supply the argument directions = 4 when calling patches().

We could take this approach further and apply a minimum mapping unit (MMU) to our forest mask.

Question 6: How could you adjust the above sieve to remove all forest pixels with an area below 0.5 hectares? Consider the fact that the pixels in formask are 20m by 20m (see res(formask)), and that one hectare is equal to 10000m2.

Working with thematic rasters

As we have seen with the land cover rasters we derived using the random forest above, the values of a raster may be categorical, meaning they relate to a thematic class (e.g. ‘forest’ or ‘wetland’) rather than a quantitative value (e.g. NDVI or % Tree Cover). The raster dataset ‘lulcGewata’ is a raster with integer values representing Land Use and Land Cover (LULC) classes from a 2011 classification (using SPOT5 and ASTER source data).

# Download and unzip data
download.file("https://github.com/GeoScripting-WUR/AdvancedRasterAnalysis/releases/download/thematic-data/lulcGewata.zip", "data/lulcGewata.zip")
unzip("data/lulcGewata.zip", exdir = "data")

lulcGewata <- rast("data/lulcGewata.tif")

# Check out the distribution of the values
freq(lulcGewata)
##   layer value  count
## 1     1     1 396838
## 2     1     2  17301
## 3     1     3    943
## 4     1     4  13645
## 5     1     5 470859
## 6     1     6 104616
hist(lulcGewata)
## Warning: [hist] a sample of55% of the cells was used (of which 45% was NA)

This is a raster with integer values between 1 and 6, but for this raster to be meaningful at all, we need a lookup table (LUT) to identify these classes. A data.frame defining these classes is also included in the download above:

LUTGewata <- read.csv("data/LUTGewata.csv")
LUTGewata[[1]] <- NULL # Remove the name column
LUTGewata
##   ID             Class
## 1  1          cropland
## 2  2            bamboo
## 3  3         bare soil
## 4  4 coffee plantation
## 5  5            forest
## 6  6           wetland

This data.frame represents a lookup table for the raster we just loaded. The $ID column corresponds to the values taken on by the lulc raster, and the $Class column describes the LULC classes assigned. In R it is possible to add an attribute table to a raster. In order to do this, we need to coerce the raster values to a factor from an integer and add a raster attribute table.

# Set SpatRaster to categorical
lulc <- as.factor(lulcGewata)

# Assign category names to the raster values
levels(lulc) <- LUTGewata
lulc
## class       : SpatRaster 
## dimensions  : 1177, 1548, 1  (nrow, ncol, nlyr)
## resolution  : 30, 30  (x, y)
## extent      : 808755, 855195, 817635, 852945  (xmin, xmax, ymin, ymax)
## coord. ref. : WGS 84 / UTM zone 36N (EPSG:32636) 
## source      : lulcGewata.tif 
## categories  : Class 
## name        :    Class 
## min value   : cropland 
## max value   :  wetland

In some cases it might be more useful to visualize only one class at a time. The segregate() function in the terra package does this by producing a SpatRaster object with each layer representing the class membership of each class as a boolean.

classes <- segregate(lulc)

# Layer names follow the order of classes in the LUT
names(classes) <- LUTGewata$Class
plot(classes, legend = FALSE)

Now each class is represented by a separate layer representing class membership of each pixel with 0’s and 1’s. If we want to construct a forest mask as we did above. This is easily done by extracting the fifth layer of this SpatRaster and replacing 0’s with NA’s.

# Subset the fifth layer
forest <- subset(classes, 5)

# note that this is equivalent to:
forest <- classes[[5]]

# or (since the layers are named):
forest <- classes$forest

# Replace 0's (non-forest) with NA's and plot the result
forest[forest == 0] <- NA
plot(forest, col = "dark green", legend = FALSE)

Today’s summary

Today you performed a supervised classification, you identified patches and sieve connected cells, and you learned to deal with thematic raster data. Some functions to remember:

Data exploration

  • hist(): Create a histogram for each layer of a SpatRaster.
  • pairs(): Create a scatterplot for each pair of layers of a SpatRaster.
  • app(): Apply a (custom) function to all pixels of a SpatRaster more efficiently.

Training data preparation

  • extract(): Retrieve a value for the raster below a vector (polygon, line, or point).

Contruct a Random Forest model

  • ranger(): Build a Random Forest model based on a data frame of predictors and a response variable. importance = "permutation" and importance() to get the statistical importance of each predictor.

Run a model on the data

  • predict(): Predict raster values based on a trained model.

Applying a raster sieve by identifying patches

  • setValues(): Set all values of a raster to a certain value or certain values. This is equivalent to MySpatRaster[] = MyValue.
  • patches(): Identify patches of raster cells and attribute an ID to each. Specify neighbors with direction = 4or direction = 8.
  • freq(): Count the number of raster cells per ID.

Working with thematic rasters

  • segregate(): Create a SpatRaster object for each layer or class.