Intro to functions and refresher on R
Learning objectives
- Learn to write a function
- Know how to visualize data via a spatial map using an R script
- Use control flow for efficient function writing
Basic R knowledge useful for Geo-Scripting
Scripting with R to handle spatial problems requires a core set of basic R skills. To prepare you well for the coming weeks, we’ve summarized what we think are important skills for geoscripting below.
Note that this tutorial uses the basic programming structure with lists, strings, and more. For a refresher on these elements click here.
Package installation and creating new directories
To code efficiently, we use packages that contain ready-made features for specific applications. The relevant packages must first be installed before being able to use their features. In this tutorial, we will be using the terra and sf packages. We can therefore go ahead and install them now.
# Check for the terra and sf packages and install if missing
if(!"terra" %in% installed.packages()){install.packages("terra")}
if(!"sf" %in% installed.packages()){install.packages("sf")}
We first check if the package has already been installed, to avoid unnecessary steps. If the package is missing, then we install it.
## terra 1.7.78
## Linking to GEOS 3.12.1, GDAL 3.8.4, PROJ 9.4.0; sf_use_s2() is TRUE
Similarly, we can create new directories for storing our data using
the functions dir.exists
and dir.create
.
Again, we check if these directories are missing before creating them,
to avoid redundancy.
Vector handling and vector arithmetic
In R we call a one-dimensional array of values a vector, and a two-dimensional array of values a matrix.
This can be a bit confusing, because R vectors have nothing to do
with vectors as you know them in GIS. The latter will be introduced in a
later tutorial. You’ll also be introduced in a later tutorial to the
terra package and the SpatRaster
class that has
been developed as part of that package. Since rasters are 2D grids of
values, all objects that belong to the class SpatRaster
are
matrices with a few more spatial attributes.
As a consequence, handling vectors and matrices is a crucial skill for processing raster data in R.
A reference manual for vector handling and vector arithmetic can be found here.
## [1] 3 6 8 1
## [1] 6 12 16 2
## [1] 9 12 14 7
## [1] 9 18 24 3
## [1] 9 18 14 13
Vectors of numbers are called numeric vectors, and matrices of numbers are numeric matrices. Calculations with matrices work the same way as with vectors, i.e. element-wise. So a multiplication between two matrices is element-wise value multiplication, not mathematical matrix multiplication.
Value replacement
## Values in a vector that satisfy a certain condition can be replaced by other values
a <- c(2, 5, 2, 5, 6, 9, 2, 12)
## Values inferior or equal to 5 are replaced by 0
a[a <= 5] <- 0
a
## [1] 0 0 0 0 6 9 0 12
## Condition can be defined using another vector of equal length
a <- c(2, 5, 2, 5, 6, 9, 2, 12)
b <- c(1, 1, 0, 1, 0, 0, 1, 0)
## Change the values of a based on b values
a[b == 0] <- NA
a
## [1] 2 5 NA 5 NA NA 2 NA
More complex value replacement:
a <- c(2, 5, 2, 5, 6, 9, 2, 12)
b <- c(1, 1, 2, 1, 0, 0, 1, 2)
## a values at which b is equal to either 0 or 1 are replaced by NA
a[b %in% c(0, 1)] <- NA
a
## [1] NA NA 2 NA NA NA NA 12
Question 1: How can you replace the values in
b
with 3 at the positions where the corresponding values ina
are either 6, 9, or 12?
Character vector handling
When working with real data, such as satellite imagery or KML files, the data always needs to be read from files. File names and file paths define the location of a file on a computer hard drive. A great advantage of scripting is that locating data, reading and writing data can be fairly easily automated (generate automatic file names, recognize patterns in file names, etc). That requires from the user some basic character vector (string) handling skills.
Key functions for handling character strings are listed below:
list.files()
: to list files in a directory.glob2rx()
: to select files using a wildcard. Note: if you know regular expressions, you do not need that function. But most people are more comfortable working with wildcards. For a list of wildcards click here.paste()
,paste0()
,sprintf()
: useful to combine vectors e.g. to create a file name.strsplit()
: to split up a string.switch()
: to return a value based on a match.
Example of list.files()
Question 2: List the directories in your working directory.
Example of glob2rx()
Example of paste()
and paste0()
## [1] "Today is Wed Sep 11 14:18:09 2024"
## [1] "A1" "A2" "A3" "A4" "A5" "A6"
Question 3: Create one variable containing a sentence
“geoscripting is fun”
, by combininga
,b
, andc
:
Example of strsplit()
# I have the following character string
name <- 'today_is_friday_12-12-2014'
# I want to extract the date contained in it, I can split it based on the underscores and the fourth element should be the date
date0 <- unlist(strsplit('today_is_friday_12-12-2014', split = '_'))[4]
# Which can then be formatted as a date object (until now it is a character string)
class(date0)
## [1] "character"
## [1] "2014-12-12"
## [1] "Date"
Question 4: How do we select
friday
from thename
variable?
Reading and writing data
In later tutorials we will show you how you can read and write different spatial objects (e.g. vector and raster files).
Here, an example is given how you can read (import into R) and write (export from R) a text file.
The most common way to read in spreadsheet tables is with the
read.csv()
command. However, you can read in virtually any
type of text file. Type ?read.table
in your console for
some other examples.
## [1] "/home/osboxes/Scripting4GeoIntro"
## [1] "1" "2" "3" "4" "5" "6, 7" "8, 9, 10"
# Write to your working directory
write.csv(test, file = "testing.csv")
# Remove the variable "test" from the R working environment
rm(test)
# Check that the object is no longer in the working environment
ls()
## [1] "a" "b" "country" "countrycode" "d"
## [6] "date" "date0" "e" "f" "name"
## X x
## 1 1 1
## 2 2 2
## 3 3 3
## 4 4 4
## 5 5 5
## 6 6 6, 7
## 7 7 8, 9, 10
Writing a function
It is hard to unleash the full potential of R without writing your own functions. Luckily it’s very easy to do. Here are some trivial examples:
## [1] 5
An example of setting the default argument values for your function:
## [1] 6
## [1] 7
That’s about all there is to it. The function will generally return the result of the last line that was evaluated.
Question 5: How do you write a function that returns
x
andz
?
Now, let’s declare a new object, a new function, newfunc
(this is just a name and if you like you can give this function another
name). Appearing in the first set of brackets is an argument list that
specifies (in this case) two names. The value of the function appears
within the second set of brackets where the process applied to the named
objects from the argument list is defined.
Next, a new object a2b
is created which contains the
result of applying newfunc
to the two objects you have
defined earlier. The second R command below prints this new object to
the console.
## [1] 8 2 4
Finally, you can now remove the objects you have created to make room for the next exercise by selecting and running the last line of the code.
Good scripting/programming habits in R
In previous tutorials we have gone over good programming habits in R and Python. Here’s a refresher, with some additional points specific to R:
- Comment your code.
- Write functions for code you need
more than once:
- Make your functions generic and flexible, using control flow.
- Document your functions.
- Follow a R style
guide. This will make your code more readable! Most important are:
- Meaningful and consistent naming of files, functions, variables…
- Indentation (like in Python: use spaces or tab to indent code in functions or loops etc.).
- Consistent use of the assignment operator: either
<-
or=
in all your code. The former is used by core R and allows assigning in function calls, the latter is shorter and consistent with most other programming languages. - Consistent placement of curly braces.
- Never place
rm(list = ls())
anywhere in your script. If you do so, you may find your computer set on fire. - Set your working directory relative to where you saved your R script: go to Session in the RStudio menu bar → Set Working Directory → To Source File Location
Warning: The use of setwd()
is not a good
practice in scripts without a very, very good reason. The users are
aware of the environment and working directories and scripts should not
try to outsmart the users. Never set either of the following in your
script! This would make the script not reproducible by others!
setwd(“/some/path/on/your/computer”)
setwd(“../some/path/outside/your/repository”)
Tip: You can use relative paths to load your data. For example by setting the path to a file “my-table.csv” in the directory “data”, relative to your script file: datafile <- file.path(“data”, “my-table.csv”) Then you can for example read in the file using: mytable <- read.csv(datafile) # Do something with that file name
Note that R IDEs like RStudio make a lot of these good practices a lot easier and you should try to take maximum advantage of them.
Below is an example of a function written with good practices and without. First the good example:
ageCalculator <- function(x) {
# Function to calculate age from birth year
# x (numeric) is the year you were born
if(!is.numeric(x)) {
stop("x must be of class numeric")
} else { # x is numeric
# Get today's date
date <- Sys.Date()
# extract year from date and subtract
year <- as.numeric(format(date, "%Y"))
if(year <= x) {
stop("You aren't born yet")
}
age <- year - x
}
return(age)
}
ageCalculator(1985)
## [1] 39
What a beautiful age for learning geoscripting!
Then the bad example:
# DON'T DO THIS, BAD EXAMPLE!!!
funTest_4 <- function(x) {
if( !is.numeric(x))
{
stop("x must be of class numeric" )
}
else {
a = Sys.Date()
b<- as.numeric( format( a,"%Y"))
b-x
}
}
funTest_4(1985)
## [1] 39
Note that this also does work. But which of the two is the easiest to read, understand, and modify if needed? … Exactly, the first one. So let’s look back at the examples and identify some differences:
- Function name: Not very self descriptive in the second example.
- Function description: Missing in the second example.
- Arguments description: Missing in the second example.
- Comments: The second example has none (okay, the first one really has a lot, but that’s for the example).
- Variables naming: use of
a
andb
not very self descriptive in second example. - Indentation: Missing in the second example.
- Control flow: Second example does not check for implausible dates.
- Consistency: Second example uses spaces, assignment operators and curly braces inconsistently.
You haven’t fully understood what control flow is or you are not fully comfortable with function writing yet? We’ll see more of that in the following sections.
Function writing
A function is a sequence of program instructions that perform a specific task, packaged as a unit. This unit can then be used in programs wherever that particular task should be performed. -Wikipedia
The objective of this section is to provide some help on effective function writing. That is functions that are:
- simple
- generic
- flexible
They should integrate well in a processing/analysis chain and be easily be re-used in a slightly different chain if needed. More flexibility in your function can be achieved through some easy control flow tricks. The following section develops this concept and provides examples.
Control flow
Control flow refers to the use of conditions in your code that redirect the flow to different directions depending on variables values or class. Make use of that in your code, as this will make your functions more flexible and generic.
Object classes and Control flow
You have seen in a previous tutorial already that every variable in your R working environment belongs to a class. You can take advantage of that, using control flow, to make your functions more flexible. First, let’s introduce a new class.
# Check for the terra package and install if missing
if(!"terra" %in% installed.packages()){install.packages("terra")}
library(terra)
c <- rast(ncol = 10, nrow = 10)
class(c)
## [1] "SpatRaster"
## attr(,"package")
## [1] "terra"
Here we used the function rast
from the terra
package to create an object of class SpatRaster
.
SpatRaster
is a class for rasters (or matrices) that
comprise the necessary elements for spatial data. With terra
and SpatRaster
, complex spatial data will be processed more
efficiently by using specifically tailored functions for
SpatRaster
. This will be studied further in another tutorial.
Controlling the class of input variables of a function
One way of making functions more auto-adaptive is by adding checks of the input variables. Using object class can greatly simplify this task. For example let’s imagine that you just wrote a simple Hello World function.
HelloWorld <- function (x) {
hello <- sprintf('Hello %s', x)
return(hello)
}
# Let's test it
HelloWorld('john')
## [1] "Hello john"
## [1] "Hello 2.5"
## [1] "Hello Devis" "Hello Martin"
## character(0)
## [1] "Hello 1:10"
## [2] "Hello c(\"Name\", \"Name\", \"Name\", \"Name\", \"Name\", \"Name\", \"Name\", \"Name\", \"Name\", \"Name\")"
Surprisingly enough, R is smart enough to give intuitive output in
most of the tested cases, since the sprintf
function
automatically casts non-character variables into character ones, and is
also vectorised to produce output when the input is a vector. However,
in the last two cases, the output is not intuitive. We may want to only
allow passing character vectors to this function. We can do this with a
small change:
HelloWorld <- function (x) {
if (!is.character(x))
stop('Object of class "character" expected for x')
hello <- sprintf('Hello %s', x)
return(hello)
}
HelloWorld(21)
## Error in HelloWorld(21): Object of class "character" expected for x
The function now throws an informative error when something not
supported is requested. These function argument “sanity checks” are
useful to avoid lengthy processing, when we know that the output of the
function is going to be wrong anyway because of invalid arguments.
Alternatively to stop()
, the function could throw a
warning()
but still return some value.
Question 6: In which cases should you use
stop()
,warning()
,message()
, and when should you return a string?
Note that most common object classes have their own logical function
(that returns TRUE
or FALSE
) to check what
class it is. For example:
## [1] TRUE
## [1] TRUE
## [1] FALSE
## [1] TRUE
You should always try to take maximum advantage of these small utilities and check for classes and properties of your objects. This is important in some cases that you might not think of in advance, for instance, consider an object with more than one class:
a = list(a = 1:10, b = "b")
# We can also set a class (only do that if you make your own class!)
class(a) = c("myclass", "list")
## Error in if (class(a) == "list") {: the condition has length > 1
## [1] "a is a list"
Also note that is.character(32) == TRUE
is equivalent to
is.character(32)
. Therefore when checking logical
arguments, you don’t need to use the == TRUE
. As an
example, a function may have an argument (say, plot
) that,
if set to TRUE
will generate a plot, and if set to
FALSE
does not generate a plot. It means that the function
certainly contains an if statement. if(plot)
in that case
is equivalent to if(plot == TRUE)
, it’s just shorter (and
very slightly faster).
An example, with a function that subtracts 2 SpatRasters, with the option to plot the resulting SpatRaster, or not.
library(terra)
# Function to subtract 2 SpatRasters
minusRaster <- function(x, y, plot=FALSE) {
z <- x - y
if (plot) {
plot(z, 1) # Plots the first layer of the resulting SpatRaster
}
return(z)
}
# Let's generate 2 SpatRasters. The first one is the R logo raster
# converted to the terra package file format
r <- rast(system.file("ex/logo.tif", package = "terra"))
# The second SpatRaster is derived from the initial SpatRaster in order
# to avoid issues of non matching extent or resolution, etc
r2 <- r
# Now we fill the second SpatRaster with new values
# The /10 simply makes the result more spectacular
r2[] <- (1:ncell(r2)) / 10
# Simply performs the calculation
r3 <- minusRaster(r, r2)
# Now performs the calculation and plots the resulting SpatRaster
r4 <- minusRaster(r, r2, plot=TRUE)
Vectorised functions
A lot of core R functions are vectorised, i.e. they are capable of taking vectors, rather than individual values, as input. This allows very simple and powerful syntax without needing to use loops, for instance:
NumVec = 1:10
# We do not need to run the function on each element individually
as.character(NumVec)
## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"
## [1] 5.5
## [1] 1 10
## [1] 101 102 103 104 105 106 107 108 109 110
## [1] 1 4 9 16 25 36 49 64 81 100
Because most base functions are already vectorised, it is easy to write new functions that are themselves vectorised. For the most part, all you need to do is test your functions not just on single values, but also vectors of values, and think whether the result makes sense.
Vectorisation also allows us to write short and versatile code. For instance, to check whether the input is a positive number:
## [1] TRUE
## [1] FALSE TRUE NA
## [1] FALSE FALSE
Question 7: Why does the following function call return different and counter-intuitive results?
## [1] TRUE FALSE TRUE
Type consistency
We just became familiar with functions for checking the types
(classes) of variables. You may have noticed that there is one way in
which they are all consistent: no matter the input, they return a
logical
value (TRUE
, FALSE
, or
also NA
). This is called type consistency and is a useful
property. Consider code like this:
AddNewSubject = function(NewBirthYear) {
# Birth years of previous subjects
BirthYears = c(1980, 1985, 1987, 1990, 1993, 1994, 1998, 2000)
return(c(BirthYears, NewBirthYear))
}
# Works as expected
NewSubjects = AddNewSubject(c(1995, 1998, 1999, 2000))
sum(NewSubjects >= 2000) # How many subjects are born on or after 2000
## [1] 2
## [1] 3
Question 8: What happened, and what would happen if we didn’t include the Roman numerals?
AddNewSubject
is not type consistent: depending on the
input, the output class changes. This is sometimes convenient, but in
the case above, the function succeeds but gives an output that is
completely misleading. This has gone horribly right: a lot of code in R
is flexible and can handle multiple input types, and so you may only
notice that your output is wrong when you inspect it yourself. If the
function AddNewSubject
checked the input and always
returned integers, we would be sure that comparing its output against
numbers and summing them is safe. Similarly, the
is.numeric()
etc. functions are type consistent, and thus
it is safe to use them in if
statements which always
require a logical input. This allows us to not need extra type checking
after running such a function. In addition, the function name is hinting
towards the type consistency: is.
makes us realise that it
will give a yes/no answer.
try()
and debugging
Use of try()
for error handling
The try()
function may help you writing functions that
do not stop with a cryptic error whenever they encounter an unknown of
any kind. Anything (sub-function, piece of code) that is wrapped into
try()
will not interrupt the bigger function that contains
try()
. So for instance, this is useful if you want to apply
a function sequentially but independently over a large set of raster
files, and you already know that some of the files are corrupted and
might return an error. By wrapping your function into try()
you allow the overall process to continue until its end, regardless of
the success of individual layers. So try()
is a perfect way
to deal with heterogeneous/unpredictable input data.
Also try()
returns an object of different class when it
fails. You can take advantage of that at a later stage of your
processing chain to make your function more adaptive. See the example
below that illustrate the use of try()
for sequentially
calculating frequency on a list of auto-generated SpatRasters.
library(terra)
# Create a SpatRaster and fill it with "randomly" generated integer values
a <- rast(nrow = 50, ncol = 50)
a[] <- floor(rnorm(n = ncell(a)))
# The freq() function returns the frequency of a certain value in a SpatRaster
# We want to know how many times the value -2 is present in the SpatRaster
freq(a, value = -2)$count
## [1] 319
# Let's imagine that you want to run this function over a whole list of SpatRaster
# but some elements of the list are impredictibly corrupted, so the list looks as follows
b <- a
c <- NA
rasterList <- list(a, b, c)
# Now, b and a are SpatRasters, and c is ''corrupted''
# Running freq(c) would return an error and stop the whole process
out <- list()
for(i in 1:length(rasterList)) {
out[i] <- freq(rasterList[[i]], value = -2)$count
}
## Error: unable to find an inherited method for function 'freq' for signature 'x = "logical"'
# If you wrap the call in a try(), you still get an error, but it's non-fatal
out <- list()
for(i in 1:length(rasterList)) {
out[i] <- try(freq(rasterList[[i]], value = -2)$count)
}
## Error : unable to find an inherited method for function 'freq' for signature 'x = "logical"'
## [[1]]
## [1] 319
##
## [[2]]
## [1] 319
##
## [[3]]
## [1] "Error : unable to find an inherited method for function 'freq' for signature 'x = \"logical\"'\n"
# By building a function that includes a try() we are able to catch the error
# without having it printed, allowing the process to handle the error gracefully.
fun <- function(x, value) {
tr <- try(freq(x = x, value = value)$count, silent=TRUE)
if (class(tr) == 'try-error') {
return('This object returned an error')
} else {
return(tr)
}
}
# Let's try to run the loop again
out <- list()
for(i in 1:length(rasterList)) {
out[i] <- fun(rasterList[[i]], value = -2)
}
out
## [[1]]
## [1] 319
##
## [[2]]
## [1] 319
##
## [[3]]
## [1] "This object returned an error"
# Note that using a function of the apply family would be a more
# elegant/shorter way to obtain the same result
(out <- sapply(X = rasterList, FUN = fun, value = -2))
## [1] "319" "319"
## [3] "This object returned an error"
Function debugging
Debugging a single line of code is usually relatively easy; simply double checking the classes of all input arguments often gives good pointers to why the line crashes. But when writing more complicated functions where objects created within the function are reused later on in that same function or in a nested function, it is easy to lose track of what is happening, and debugging can then become a nightmare. A few tricks can be used to make that process less painful.
traceback()
and debugonce()
Here are the manual commands, which also work with RStudio and other IDEs:
- The first thing to investigate right after an error occurs is to run
the
traceback()
function; just like that without arguments. - Carefully reading the return of that function will tell you where exactly in your function the error occurred.
foo <- function(x) {
x <- x + 2
print(x)
bar(2)
}
bar <- function(x) {
x <- x + a.variable.which.does.not.exist
print(x)
}
foo(2)
# gives an error
traceback()
## 2: bar(2) at #1
## 1: foo(2)
# Ah, bar() is the problem
# Debug it by declaring what to debug and running it
debugonce(bar)
foo(2)
Depending on the IDE you are using, you may be presented with tools for stepping through the function line by line, as well as a Browse console, which allows you to query the state of the variables involved so that you can identify exactly what is going on in the function call. For instance, in RKWard, the Debugging Frames pane on the right shows which line you are stepping through.
RStudio
RStudio has integration with the debugging tools in R, so you can use a point-and-click interface. However, some parts of it are specific to the RStudio IDE.
- To force them to catch every error, select Debug - On Error - Break in Code in the main menu.
- Run again
foo(2)
. - RStudio will stop the execution where the error happened. The traceback appears in a separate pane on the right.
- You can and use the little green “Next” button to go line by line through the code, or the red Stop button to leave the debugging mode.
- Reset the On Error behaviour to Error Inspector. In this default setting, RStudio will try to decide whether the error is complex enough for debugging, and then offer the options to “traceback” or “rerun the code with debugging” with two buttons in the console.
Finally, solve the problem:
Refer to the reference section of this document for further information on function debugging.
Creating a map within R - a simple demo
Here is an example of how you can create a map in R. We will make use of a subset of the Global Adminstrative Areas database (GADM):
# Create data and output directories and download data from URL
data_URL <- "https://github.com/GeoScripting-WUR/Scripting4GeoIntro/releases/download/gadm-data/gadm41_PHL_2.zip"
data_dir <- "data"
if (!dir.exists(data_dir)) {
dir.create(data_dir)
}
if (!file.exists('data/data.zip')) {
download.file(url = data_URL, destfile = file.path(data_dir, "data.zip"))
unzip('data/data.zip', exdir = 'data')
}
We load the administrative boundaries of the Philippines, using the
sf
package, to which you will be introduced more in-depth
in a later tutorial:
# Check for the sf package and install if missing
if(!"sf" %in% installed.packages()){install.packages("sf")}
library(sf)
# Load the data and subset its geometry
adm <- st_read(file.path(data_dir, "gadm41_PHL_2.json"), quiet = TRUE)
adm_geom <- st_geometry(adm)
# Create and example plot
plot(adm_geom[adm$NAME_1 == "Tarlac"])
Try to understand the code below, and let us know if you have questions. Feel free to use this code as an example for exercise 5.
mar <- adm_geom[adm$NAME_1 == "Marinduque"]
plot(mar, bg = "dodgerblue", axes = TRUE)
plot(mar, lwd = 10, border = "skyblue", add = TRUE)
plot(mar, col = "green4", add = TRUE)
grid()
box()
invisible(text(st_coordinates(st_centroid(mar)),
labels = as.character(adm$NAME_2[adm$NAME_1 == "Marinduque"]), cex = 1.1, col = "white", font = 2))
mtext(side = 3, line = 1, "Provincial Map of Marinduque", cex = 2)
mtext(side = 1, "Longitude", line = 2.5, cex = 1.1)
mtext(side = 2, "Latitude", line = 2.5, cex = 1.1)
mtext(side = 1, line = -2,
"Projection: Geographic\n
Coordinate System: WGS 1984 \n
Data Source: GADM.org ", adj = 1, cex = 0.5, col = "grey20")