Week 1, Lesson 3: Carrying out your R project

WUR Geoscripting WUR logo

Week 1, Lesson 3: Carrying out your R project

Good morning! Here is what you will do today:

Time Activity
Morning Self-study: go through the following tutorial
13:30 to 14:30 Presentation and discussion
Rest of the afternoon Do/finalise the exercise.

Introduction

During the previous lecture, you saw some general aspects of the R language, such as an introduction to the syntax, object classes, reading of external data and function writing.

Today it's about carrying out a geoscripting project. This tutorial is about R, but a lot of it can be applied to other languages!

Scripting means that you often go beyond easy things and therefore face challenges. It is normal you will have to look for help. This lesson will guide you through ways of finding help. It continues with a couple "good practices" for scripting, debugging and geoscripting projects. This includes using version control and project management.

Learning objectives

At the end of the lecture, you should be able to

  • Use version control to develop, maintain, and share your code with others
  • Find help for R related issues
  • Produce a reproducible example
  • Adopt some good scripting/programming habits
  • Use control flow for efficient function writing

Version control

Important note: you need to have git installed and properly configured on your computer to do the following. Visit the system setup page for more details. Git is preinstalled in the PC lab and on virtual machines already.

What is version control?

Have you ever worked on a project and ended up having so many versions of your work that you didn't know which one was the latest, and what were the differences between the versions? Does the image below look familiar to you? Then you need to use version control (also called revision control). You will quickly understand that although it is designed primarily for big software development projects, being able to work with version control can be very helpful for scientists as well.

file name

The video below explains some basic concepts of version control and what the benefits of using it are.

What is VCS? (Git-SCM) • Git Basics #1 from GitHub on Vimeo.

So to sum up, version control allows to keep track of:

  • When you made changes to your files
  • Why you made these changes
  • What you changed

Additionally, version control:

  • Facilitates collaboration with others
  • Allows you to keep your code archived in a safe place (the cloud)
  • Allows you to go back to previous version of your code
  • Allows you to find out what changes broke your code
  • Allows you to have experimental branches without breaking your code
  • Allows you to keep different versions of your code without having to worry about file names and archiving organization

The three most popular version control software are Git, Mercurial (abbreviated as hg) and Subversion (abbreviated as svn). Git is by far the most modern and popular one, so we will only use Git in this course.

Git git

What git does

Git keeps track of changes in a local repository you set up on your computer. Typically that is a folder that contains all your code and optionally the data your code needs in order to run. The local repository contains all your files, but also (in a hidden folder) all the changes to the files you have made. It does not keep track of all files automatically: you need to tell git which files to track and which not. Therefore a repository contains your current tracked files (workspace), an index of files that are being tracked, and the version history.

Every time you make significant changes to the files in your workspace, you have to add the changed files to the index, which selects the files whose changes you want to save, and commit them, which means saving the changes to the history tracking of your local repository.

Often you also setup a remote repository, stored on an online platform like github or other platform. It is simply a remotely-hosted mirror of your local repository and allows you to have your work stored in a safe place and accessible from your other computers and potential collaborators. Once in a while (at the end of the day, or every new commit if you want) you can push your commits, which means sending them to the remote repository so it keeps in sync with your local one. When you want to update your local repository based on the content of a remote repository, you have to pull the commits from the remote repository.

Summary of git semantics

  • add: Tell git that you want a file or changes to be tracked. These files/changes are not yet saved in the repository! They are listed as "staged" in the index or staging area for the next commit.
  • commit: Save the staged changes to your local repository. This is like putting a milestone or taking a snapshot of your project at that moment. A commit describes what has been changed, why and when. In the future you can always revert all tracked files to the state they were at when you created the commit.
  • push: Send previous changes you committed to the local repository to the remote repository.
  • pull: Update your local repository (and your workspace) with all new stuff from the remote repository. This command is simple, but potentially destructive, since it overwrites your files with the ones in the remote server. Hence it is not available in the Git GUI.
    • fetch: Get information about the latest commits from the remote repository, but do not apply them to your local repository automatically. This is always safe as it does not change your workspace.
    • merge: Merges two versions (branches) into one, applying the result to the workspace. This includes merging commits from the remote repository with the commits of the local repository. In effect, a fetch followed by a merge is the same as a pull, but it allows you more fine-grained control and is available through the Git GUI.
  • clone : Copy the content of a remote repository locally for the first time.
  • more advanced:
    • branch : Create a branch (a parallel version of the code in the repository)
    • checkout: load the status of a branch into your workspace
git flows

Seting up Git

Using GitHub and Git GUI

The easiest way to start with Git is to let a git repository host create a repository for you. To do that, we will use GitHub.

  1. Create a GitHub account

Go to GitHub and create an account if you don't have one yet (it's free).

  1. Create remote repository

In GitHub, press the Create new... button ("+" at the top right corner) and select New repository. Give it a descriptive name and a short description, check the Initialize this repository with a README box and then finalize the creation of that repository.

Note: You are also asked to provide a license for your code. Note that all code on GitHub is public, so a license is crucial to let others know what you allow them to do with your code. Code without a license is copyright by default, and thus nobody is allowed to make use of your code or contribute to it. For real projects, you will want to set a more permissive license so that others could make use of your code. See Choose a License for a quick overview of what licenses are available.

  1. Launch Git GUI

Launch a program called Git GUI from your start menu (or command line). Git GUI is a graphical interface to Git that comes with Git itself, and is thus cross-platform and always available. When launched, it looks something like this:

Main screen of Git GUI

  1. Create an SSH key

This step is crucial for both security and the use of virtual machines in the next lesson. Do not skip this!

In order for GitHub (and other services) to identify that the machine connecting to it is indeed owned by you, there are two options: using a password, or using an SSH key. SSH keys are much more secure than passwords, and it doesn't require you to enter a password every time you try to do something, too. Therefore we will use SSH keys.

You can generate a new SSH key by going to HelpShow SSH Key and pressing the Generate Key button. It will ask you for a passphrase. This is optional: it is useful in the case your SSH key is stolen, for instance by a thief stealing your laptop or a virus, but since SSH keys are specific to a machine, it is most of the time completely fine to leave the passphrase empty, so you do not need to enter it every time.

Once done, you will see your new key:

SSH public key generated

  1. Enrol the key to your user account

The SSH key is used to identify that you own the machine. In the case of the PC lab, the keys are stored on your M: drive, so it will follow you on any computer in the PC lab. Now you need to tell GitHub about your new key.

To do that, copy the public key from the dialog, then in GitHub click on your avatar in the top right, go to SettingsSSH and GPG keys and press New SSH key. Give it a title describing your machine, and paste the key in the box, and press Add SSH key. You might need to confirm the key by email for added security.

This only has to be done once (per machine/OS you use GitHub on).

  1. Clone your new repository

Go back to the repository you just created, and copy its SSH address by clicking on the Clone or Download button and copying the URL in the Clone with SSH box. If you don't see it, you might need to click on the Use SSH button.

Clone or Download → Clone with SSH

Go back to Git GUI, and press Clone Existing Repository. Paste the URL you just coped to the Source Location field, and choose a folder you want to store your code in in the Target Directory field. Note: the Target Directory must not already exist! Git GUI will create it for you.

Once you click Clone, you will get a question about whether you trust the remote machine (if you ran git gui from a command line, the question will appear there). You need to answer this with yes. This puts the GitHub server into a list of trusted servers, to guard against potential impostor servers.

You will end up in an empty Git GUI window: Git GUI in an empty directory

But if you check the folder you cloned the repository in, it will show the README.md file.

  1. Make changes

To see Git in action, you need to make some changes in your repository. Try it: edit the README.md file, and create a new R script file in the respository.

Once you are done, go back to Git GUI. If you closed the window, you can get back to your repository by launching Git GUI and clicking on its path in the Open Recent Repository list. You will see some changes:

Changes pending in Git GUI

At the top left corner, the Unstaged Changes panel, you can see all the files that changed in your workspace. If you click on the name of the file, the main panel will show you what changed since the last commit. Unless it is a non-text (data) file, in which case it will just note that something has changed. Note: Git is very efficient with storing changes in text files: these diff files are all it stores internally, it does not copy the whole file on each commit. However, it does not deal efficiently with non-text files, and thus you should limit the amount and size of such files as much as possible.

If you click on the icon of the file in the Unstaged Changes panel, the file changes will be staged and appear at the Staged Changes (Will Commit) panel. These are the file changes you want to save. You don't have to stage all files for each commit, only those you actually want to be tracked by git. You can safely ignore some files such as manual backups, temporary files, and the like and they will remain untracked by git, as long as you never stage them. If you do want to stage everything, you can press the Stage Changed button. If you staged more than you wanted to, you can click on the file icon in the Staged Changes panel to unstage it.

  1. Commit changes

Once you staged files that you want to commit, you need to fill out the commit message. This is a brief description of what changes you made between the last commit and the one you are about to create, and why did you do them. The first line you enter is the name of the commit, keep that one short. Subsequent lines are the description. You may notice that the Commit message box does not have a horizontal scrollbar: that is intentional, because your commit message should fit within that box without the need for scrolling. Use new lines to break the text.

If it is the first time you use Git GUI to make a commit, it might complain about it not knowing who you are. You should go to EditOptions... and fill out the Global (All Repositories) options User Name and Email Address. These will be displayed on GitHub.

Next press the Commit button and your commit will be saved locally. You will always be able to go back to the state that this commit was in.

In the case you made a mistake (made a mistake in the message, forgot to stage something etc.), you can press the Ammend Last Commit button and get right back to where you were when you made the last commit; but use this functionality very sparingly, as it does not work with changes already sent to GitHub.

  1. Push changes to the server

Press the Push button, and confirm the push, to send all your changes to your GitHub repository. You can now refresh the GitHub page to see your changes. Well done!

  1. Pull changes from the server

One of the major uses of Git is collaboration and the ability to synchronise changes across different devices. Multiple users can do changes in the same Git repository (as long as you change the repository settings in GitHub to allow another user to do that), and you can work on the same code on different devices yourself. In both cases, it is important to keep all local repositories in sync with the remote repository. That is done via Git GUI by using Fetch and Merge. If you like, you can test it by cloning the same repository in another folder, making changes and pushing them to the server, then using fetch in the other copy.

If there are any changes on GitHub that are not on your local copy yet, in Git GUI go to RemoteFetch fromorigin to download all changes. This will not apply them yet, however.

To attempt to apply the changes, go to MergeLocal Merge.... If all goes well, the changes will be applied.

There may be cases where files go out of sync in incompatile ways, however, like two people editing one file at the same time. In that case you may hit a merge conflict. It is best to try to avoid them. In case it happens, you need to go through the conflicting files in a text editor and edit them by hand, keeping the parts of the files you need. The conflicting parts will be in between lines of of >>>> and <<<< symbols. Once you remove the parts you don't need (including the separators), you can solve the conflict by comitting the changes.

Other Git GUI functionality

Git GUI not only provides a way to make, push and pull commits, but also to visualise the commit history of your repository in a tree graph. Go to RepositoryVisualise Master's History to see it. For larger and more complex projects with lots of contributors and merges, it might look like some sort of a subway map:

Git GUI history (gitk)

You might run into a situation when you have made changes in tracking files, but do not want to keep some of the changes. You can revert one file by selecting it in Git GUI, then clicking CommitRevert changes.

Project structure

Try to keep a consistent structure across your projects, so that it is easier for you to switch from one project to the other and immediately understand how things work. You may use the following structure:

  • A main.R script at the root of the project. This script performs step by step the different operations of your project. It is the only non-generic part of your project (it contains paths, already set variables, etc).
  • An R/ subdirectory: This directory should contain the functions you have defined as part of your project. These functions should be as generic as possible and are sourced and called by the main.R script.
  • A data/ subdirectory: This directory contains data sets of the project. Since Git is not as efficient with non-text files, and GitHub has storage limits, you should only put small data sets in that directory (<2-3 MB). These can be shapefiles, small rasters, csv files, but perhaps even better is to use the R archives. R offers two types of archives to store the important variables of the environments, .rda and .rds.
  • An output/ sub directory (when applicable).
project structure

Example main.R file

Typically the header of your main script will look like that.

# John Doe
# January 2017
# Import packages
library(raster)
library(sp)
# Source functions
source('R/function1.R')
source('R/function2.R')
# Load datasets 
load('data/input_model.rda')

# Then the actual commands

Bigger data

The data/ directory of your project should indeed only contain relatively small data sets. When handling bigger remote sensing data sets, these should stay out of the project, where you store the rest of your data.

Example

  • Create 3 files in your R/ directory (ageCalculator.R, HelloWorld.R and minusRaster.R) in which you will copy paste the respective functions.
  • Create a main.R script at the root of your project and add some code to it. The content of the main.R in that case could be something as below.
# Name
# Date
library(raster)

source('R/ageCalculator.R')
source('R/HelloWorld.R')
source('R/minusRaster.R')


HelloWorld('john')
ageCalculator(2009)

# import dataset
r <- raster(system.file("external/rlogo.grd", package="raster")) 
r2 <- r 
# Filling the rasterLayer with new values.
r2[] <- (1:ncell(r2)) / 10
# Performs the calculation
r3 <- minusRaster(r, r2) 

RStudio projects

RStudio has a functionality called projects that allows organising your files a bit better. You may have learned that one of the first things to do when opening a R session is to set your working directory, using the setwd() command. When creating a new project, the working directory is automatically set to the root of the project, where your main.R is located. When working with RStudio projects you should not change the working directory, if you want to access things stored in your data/ subdirectory, simply append data/ to the beginning of the string leading to the file you want to load.

Note: RStudio projects are not compatible with base R or with other R IDEs. Their use is optional.

RStudio itself has integration with Git, and when creating a project there is an option to make it a git repository as well. However, in this lesson we will not be using this method, since it is specific to RStudio and does not work for Python or in other R IDEs. Git GUI, in contrast, is language-agnostic and standalone. So if you do create a project with RStudio, create it inside your cloned GitHub repository, and do not select that it creates a new git repository, then use Git GUI to handle all the changes you do in the repository. This will save you the confusion about how to handle it when we will come to Python lessons.

Finding help

Sources for help

The most important helper is the R documentation. In the R console, just enter ?function or help(function) to get the manual page of the function you are interested in.

There are many places where help can be found on the internet. So in case the function or package documentation is not sufficient for what you are trying to achieve, a search engine like Google is your best friend. Most likely by searching the right key words relating to your problem, the search engine will direct you to the archive of the R mailing list, or to some discussions on Stack Exchange. These two are reliable sources of information, and it is quite likely that the problem you are trying to figure out has already been answered before.

However, it may also happen that you discover a bug or something that you would qualify as abnormal behavior, or that you really have a question that no one has ever asked (corollary: has never been answered). In that case, you may submit a question to one of the R mailing list. For general R question there is a general R mailing list, while the spatial domain has its own mailing list (R SIG GEO). Geo related questions should be posted to this latter mailing list.

Note: these mailing lists have heavy mail traffic, use your mail client efficiently and set filters, otherwise it will quickly bother you.

These mailing lists have a few rules, and it's important to respect them in order to ensure that:

  • no one gets offended by your question,
  • people who are able to answer the question are actually willing to do so,
  • you get the best quality answer.

So, when posting to the mail list:

  • Be courteous.
  • Provide a brief description of the problem and why you are trying to do that.
  • Provide a reproducible example that illustrate the problem, reproducing the eventual error.
  • Sign with your name and your affiliation.
  • Do not expect an immediate answer (although well presented questions often get answered fairly quickly).

Reproducible examples

Indispensable when asking a question to the online community, being able to write a reproducible example has many advantages:

  • It may ensure that when you present a problem, people are able to answer your question without guessing what you are trying to do.
  • Reproducible examples are not only to ask questions; they may help you in your thinking, developing or debugging process when writing your own functions.
    • For instance, when developing a function to do a certain type of raster calculation, start by testing it on a small auto-generated RasterLayer object, and not directly on your actual data that might be covering the whole world.

Example of an reproducible example

Well, one could define a reproducible example by:

  • A piece of code that can be executed by anyone who has R, independently of the data present on his machine or any preloaded variables.
  • The computation time should not exceed a few seconds and if the code automatically downloads data, the data volume should be as small as possible.

So basically, if you can quickly start a R session on your neighbour's computer while he is on a break, copy-paste the code without making any adjustments and see almost immediately what you want to demonstrate; congratulations, you have created a reproducible example.

Let's illustrate this by an example. I want to perform value replacements of one raster layer, based on the values of another raster layer. (We haven't covered raster analysis in R as part of the course yet, but you will quickly understand that for certain operations rasters are analog to vectors of values.)

## Create two RasterLayer objects of similar extent
library(raster)
## Loading required package: sp
r <- s <- raster(ncol=50, nrow=50)
## Fill the raster with values
r[] <- 1:ncell(r)
s[] <- 2 * (1:ncell(s))
s[200:400] <- 150
s[50:150] <- 151
## Perform the replacement
r[s %in% c(150, 151)] <- NA
## Visualise the result
plot(r)

Useful to know when writing a reproducible example: instead of generating your own small data sets (vectors or RasterLayers, etc) as part of your reproducible example, use some of R built-in data-sets. They are part of the main R packages. Some popular data sets are: cars, meuse.grid_ll, Rlogo, iris, etc. The auto completion menu of the data() function will give you an overview of the data sets available. (In most script editing environments, including the R console and RStudio, auto-completion can be invoked by pressing the tab key, use it without moderation.)

## Import the variable "cars" in the working environment
data(cars)
class(cars)
## [1] "data.frame"
## Visualise the first six rows of the variable
head(cars)
##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10
# The plot function on this type of dataset (class = data.frame, 2 column)
# automatically generates a scatterplot
plot(cars)

Another famous data set is the meuse data set, providing all sorts of spatial variables spread across a part of the Meuse watershed. The following example compiled from the help pages of the sp package.

## Example using built-in dataset from the sp package
library(sp)
## Load required datastes
data(meuse)
# The meuse dataset is not by default a spatial object
# but its x, y coordinates are part of the data.frame
class(meuse)
## [1] "data.frame"
coordinates(meuse) <- c("x", "y")
class(meuse)
## [1] "SpatialPointsDataFrame"
## attr(,"package")
## [1] "sp"

Now that the object belongs to a spatial class, we can plot it using one of the vector plotting functions of the sp package. See the result in the figure below.

bubble(meuse, "zinc", maxsize = 2.5,
       main = "zinc concentrations (ppm)", key.entries = 2^(-1:4))

The sp package help page contains multiple examples of how to explore its meuse built-in data set. Another example of multiple plots using meuse.grid is given in the figure below.

## Load meuse.riv dataset
data(meuse.riv)
## Create an object of class SpatialPolygons from meuse.riv
meuse.sr <- SpatialPolygons(list(Polygons(list(Polygon(meuse.riv)),"meuse.riv")))
## Load the meuse.grid dataset
data(meuse.grid)
## Assign coordinates to the dataset and make it a grid
coordinates(meuse.grid) = c("x", "y")
gridded(meuse.grid) = TRUE
## Plot all variables of the meuse.grid dataset in a multiple window spplot
spplot(meuse.grid, col.regions=bpy.colors(), main = "meuse.grid",
       sp.layout=list(
           list("sp.polygons", meuse.sr),
           list("sp.points", meuse, pch="+", col="black")
           )
       )

Good scripting/programming habits

Increasing your scripting/programming efficiency goes through adopting good scripting habits. Following a couple of guidelines will ensure that your work:

  • Can be understood and used by others.
  • Can be understood and reused by you in the future.
  • Can be debugged with minimal effort.
  • Can be re-used across different projects.
  • Is easily accessible by others.

In order to achieve these objectives, you should try to follow a few good practices. The list below is not exhaustive, but already constitutes a good basis that will help you getting more efficient now and in the future when working on R projects.

  • Comment your code.
  • Write functions for code you need more than once:
    • Make your functions generic and flexible, using control flow.
    • Document your functions.
  • Follow a R style guide. This will make your code more readable! Most important are:
    • Meaningful and consistent naming of files, functions, variables...
    • Indentation (like in Python: use spaces or tab to indent code in functions or loops etc.).
    • Consistent use of the assignment operator: either <- or = in all your code. The former is used by core R and allows assigning in function calls, the latter is shorter and consistent with most other programming languages.
    • Consistent placement of curly braces.
  • Make your own packages.
  • Work with projects.
  • Keep a similar directory structure across your projects.
  • Use version control to develop/maintain your projects and packages.

Note that R IDEs like RStudio make a lot of these good practices a lot easier and you should try to take maximum advantage of them. Take a moment to explore the menus of the RStudio session that should already be open on your machine. Particular emphasis will be given later in this tutorial on projects, project structure and use of version control.

Below is an example of a function written with good practices and without. First the good example:

ageCalculator <- function(x) {
    # Function to calculate age from birth year
    # x (numeric) is the year you were born
    if(!is.numeric(x)) {
        stop("x must be of class numeric")
    } else { # x is numeric
        # Get today's date
        date <- Sys.Date()
        # extract year from date and subtract
        year <- as.numeric(format(date, "%Y"))
        if(year <= x) {
            stop("You aren't born yet")
        }
        age <- year - x
    }
    return(age)
}

ageCalculator(1985)
## [1] 32

31, what a beautiful age for learning geo-scripting.

Then the bad example:

## DON'T DO THAT, BAD EXAMPLE!!!
funTest_4 <- function(x) {
if( !is.numeric(x))
{
stop("x must be of class numeric"  )
 }
else {
a = Sys.Date()
b<- as.numeric( format( a,"%Y"))
b-x
}
}

funTest_4(1985)
## [1] 32

It also works, but which of the two is the easiest to read, understand, and modify if needed? ... Exactly, the first one. So let's look back at the examples and identify some differences:

  • Function name: Not very self descriptive in the second example.
  • Function description: Missing in the second example.
  • Arguments description: Missing in the second example.
  • Comments: The second example has none (okay, the first one really has a lot, but that's for the example).
  • Variables naming: use of a and b not very self descriptive in second example.
  • Indentation: Missing in the second example.
  • Control flow: Second example does not check for implausible dates.
  • Consistency: Second example uses spaces, assigment operators and curly braces inconsistently.

You haven't fully understood what control flow is or you are not fully comfortable with function writing yet? We'll see more of that in the following sections.

Function writing

A function is a sequence of program instructions that perform a specific task, packaged as a unit. This unit can then be used in programs wherever that particular task should be performed. -Wikipedia

The objective of this section is to provide some help on effective function writing. That is functions that are:

  • simple,
  • generic, and
  • flexible.

They should integrate well in a processing/analysis chain and be easily be re-used in a slightly different chain if needed. More flexibility in your function can be achieved through some easy control flow tricks. The following section develops this concept and provides examples.

Control flow

Control flow refers to the use of conditions in your code that redirect the flow to different directions depending on variables values or class. Make use of that in your code, as this will make your functions more flexible and generic.

Object classes and Control flow

You have seen in a previous lesson already that every variable in your R working environment belongs to a class. You can take advantage of that, using control flow, to make your functions more flexible.

A quick reminder on classes:

# 5 different objects belonging to 5 different classes
a <- 12
class(a)
## [1] "numeric"
b <- "I have a class too"
class(b)
## [1] "character"
library(raster)
c <- raster(ncol=10, nrow=10)
class(c)
## [1] "RasterLayer"
## attr(,"package")
## [1] "raster"
d <- stack(c, c)
class(d)
## [1] "RasterStack"
## attr(,"package")
## [1] "raster"
e <- brick(d)
class(e)
## [1] "RasterBrick"
## attr(,"package")
## [1] "raster"

Controlling the class of input variables of a function

One way of making functions more auto-adaptive is by adding checks of the input variables. Using object class can greatly simplify this task. For example let's imagine that you just wrote a simple Hello World function.

HelloWorld <- function (x) {
    hello <- sprintf('Hello %s', x)
    return(hello)
}

# Let's test it
HelloWorld('john')
## [1] "Hello john"

Obviously, the user is expected to pass an object of character vector to x. Otherwise the function will return an error. But you can make it handle such cases gracefully and print an informative message by controlling the class of the input variable. For example.

HelloWorld <- function (x) {
    if (is.character(x)) {
      hello <- sprintf('Hello %s', x)
    } else {
      hello <- warning('Object of class character expected for x')
    }
    return(hello)
}

HelloWorld(21)
## Warning in HelloWorld(21): Object of class character expected for x
## [1] "Object of class character expected for x"

The function does not crash anymore, but returns a warning instead.

Note that most common object classes have their own logical function, that returns TRUE or FALSE. For example.

is.character('john')
## [1] TRUE
# is equivalent to 
class('john') == 'character'
## [1] TRUE
is.character(32)
## [1] FALSE
is.numeric(32)
## [1] TRUE

You should always try to take maximum advantage of these small utilities and check for classes and properties of your objects.

Also note that is.character(32) == TRUE is equivalent to is.character(32). Therefore when checking logical arguments, you don't need to use the == TRUE. As an example, a function may have an argument (say, plot) that, if set to TRUE will generate a plot, and if set to FALSE does not generate a plot. It means that the function certainly contains an if statement. if(plot) in that case is equivalent to if(plot == TRUE), it's just shorter (and very slightly faster).

An example, with a function that subtracts 2 RasterLayers, with the option to plot the resulting RasterLayer, or not.

library(raster)
## Function to subtract 2 rasterLayers
minusRaster <- function(x, y, plot=FALSE) { 
    z <- x - y
    if (plot) {
        plot(z)
    }
    return(z)
}

# Let's generate 2 rasters 
# that first one is the R logo raster
# converted to the raster package file format.
r <- raster(system.file("external/rlogo.grd", package="raster")) 
# The second RasterLayer is derived from the initial RasterLayer in order
# to avoid issues of non matching extent or resolution, etc
r2 <- r
## Filling the rasterLayer with new values
# The /10 simply makes the result more spectacular
r2[] <- (1:ncell(r2)) / 10
## Simply performs the calculation
r3 <- minusRaster(r, r2) 
## Now performs the calculation and plots the resulting RasterLayer
r4 <- minusRaster(r, r2, plot=TRUE) 

try and debugging

Use of try for error handling

The try() function may help you writing functions that do not stop with a cryptic error whenever they encounter an unknown of any kind. Anything (sub-function, piece of code) that is wrapped into try() will not interrupt the bigger function that contains try(). So for instance, this is useful if you want to apply a function sequentially but independently over a large set of raster files, and you already know that some of the files are corrupted and might return an error. By wrapping your function into try() you allow the overall process to continue until its end, regardless of the success of individual layers. So try() is a perfect way to deal with heterogeneous/unpredictable input data.

Also try() returns an object of different class when it fails. You can take advantage of that at a later stage of your processing chain to make your function more adaptive. See the example below that illustrate the use of try() for sequentially calculating frequency on a list of auto-generated RasterLayers.

library(raster)

## Create a raster layer and fill it with "randomly" generated integer values
a <- raster(nrow=50, ncol=50)
a[] <- floor(rnorm(n=ncell(a)))

# The freq() function returns the frequency of a certain value in a RasterLayer
# We want to know how many times the value -2 is present in the RasterLayer
freq(a, value=-2)
## [1] 315
# Let's imagine that you want to run this function over a whole list of RasterLayer
# but some elements of the list are impredictibly corrupted
# so the list looks as follows
b <- a
c <- NA
list <- c(a,b,c)
# In that case, b and a are raster layers, c is ''corrupted''
## Running freq(c) would return an error and stop the whole process
out <- list()
for(i in 1:length(list)) {
    out[i] <- freq(list[[i]], value=-2)
}
# Therefore by building a function that includes a try()
# we are able to catch the error,
# allowing the process to continue despite missing/corrupted data.
fun <- function(x, value) {
    tr <- try(freq(x=x, value=value), silent=TRUE)
    if (class(tr) == 'try-error') {
        return('This object returned an error')
    } else {
        return(tr)
    }
}

## Let's try to run the loop again
out <- list()
for(i in 1:length(list)) {
    out[i] <- fun(list[[i]], value=-2)
}
out
## [[1]]
## [1] 315
## 
## [[2]]
## [1] 315
## 
## [[3]]
## [1] "This object returned an error"
# Note that using a function of the apply family would be a more
# elegant/shorter way to obtain the same result
(out <- sapply(X=list, FUN=fun, value=-2))
## [1] "315"                           "315"                          
## [3] "This object returned an error"

Function debugging

Debugging a single line of code is usually relatively easy; simply double checking the classes of all input arguments often gives good pointers to why the line crashes. But when writing more complicated functions where objects created within the function are reused later on in that same function or in a nested function, it is easy to lose track of what is happening, and debugging can then become a nightmare. A few tricks can be used to make that process less painful.

traceback()

Here are the manual commands, which also work with RStudio and other IDEs:

  • The first thing to investigate right after an error occurs is to run the traceback() function; just like that without arguments.
  • Carefully reading the return of that function will tell you where exactly in your function the error occurred.
foo <- function(x) {
    x <- x + 2
    print(x)
    bar(2) 
}

bar <- function(x) { 
    x <- x + a.variable.which.does.not.exist 
    print(x)
}

foo(2) 
## gives an error

traceback()
## 2: bar(2) at #1
## 1: foo(2)
# Ah, bar() is the problem

For another example see: rfunction.com.

RStudio tools

RStudio has great debugging tools. However, they are specific to the RStudio IDE.

  • To force them to catch every error, select Debug - On Error - Break in Code in the main menu.
  • Run again foo(2).
  • RStudio will stop the execution where the error happened. The traceback appears in a separate pane on the right.
  • You can and use the little green "Next" button to go line by line through the code, or the red Stop button to leave the debugging mode.
  • Reset the On Error behaviour to Error Inspector. In this default setting, RStudio will try to decide wether the error is complex enough for debugging, and then offer the options to "traceback" or "rerun the code with debugging" with two buttons in the console.
  • Finally solve the problem:
    ## redefine bar
    bar <- function(x) {
    x + 5
    }
    foo(2)
    ## [1] 4
    ## [1] 7

The debugging mode can also reached through calling debug(). This is not covered in this lesson, feel free to explore by yourself these debugging functionality. Refer to the reference section of this document for further information on function debugging.

(optional) Writing packages

The next step to write re-usable code is packaging it, so others can simply install and use it. If followed the steps to here, this step is not very big anymore! For this course, it is optional. find instructions here and in the references below.

Exercise

Your task

Create a RStudio project, with git version control. The project should contain a simple function to calculate whether or not a year is a leap year. Use control flow, and provide some examples of how the function works in the main.R. The function should behave as follows:

> is.leap(2000)
[1] TRUE

> is.leap(1581)
[1] "1581 is out of the valid range"

> is.leap(2002)
[1] FALSE

> is.leap('john') #should return an error 
Error: argument of class numeric expected

Useful resources

Assessment

Assessment will consider whether the function works as intended, but also its readability and completeness (try as much as possible to use all good practices mentioned in this lecture). The structure of your project and the appropriate use of git will also be assessed.

How to submit?

Put your project on your gitHub account and post the clone url in the code review forum on BlackBoard before the end of the day (feedback before 10 am the next day)

For reviewing other teams aswers:

  • clone the repo you have to review to your computer and test it
  • you can either post your remarks in the review forum on blackboard, or try out the github feature issues

References