WUR Geoscripting WUR logo

Week 1, Lesson 3: Carrying out your R project

Good morning! Here is what you will do today:

Time Activity
Until 11:00 Review yesterday’s exercise answer of the other team
Morning Self-study: go through the following tutorial
14:00 to 15:00 Presentation and discussion
Rest of the afternoon Do/finalise the exercise.

Introduction

During the previous lecture, you saw some general aspects of the R language, such as an introduction to the syntax, object classes, reading of external data and function writing.

Today it’s about carrying out a geoscripting project. This tutorial is about R, but a lot of it can be applied to other languages!

Scripting means that you often go beyond easy things and therefore face challenges. It is normal you will have to look for help. This lesson will guide you through ways of finding help. It continues with a couple “good practices” for scripting, debugging and geoscripting projects. This includes using version control and project management.

Learning objectives

At the end of the lecture, you should be able to

  • Use version control to develop, maintain, and share your code with others
  • Find help for R related issues
  • Produce a reproducible example
  • Adopt some good scripting/programming habits
  • Use control flow for efficient function writing

Version control

Important note: you need to have git installed and properly configured on your computer to do the following. Visit the system setup page for more details. Git is preinstalled in the PC lab and on virtual machines already.

What is version control?

Have you ever worked on a project and ended up having so many versions of your work that you didn’t know which one was the latest, and what were the differences between the versions? Does the image below look familiar to you? Then you need to use version control (also called revision control). You will quickly understand that although it is designed primarily for big software development projects, being able to work with version control can be very helpful for scientists as well.

file name

The video below explains some basic concepts of version control and what the benefits of using it are.

What is VCS? (Git-SCM) • Git Basics #1 from GitHub on Vimeo.

So to sum up, version control allows to keep track of:

  • When you made changes to your files
  • Why you made these changes
  • What you changed

Additionally, version control:

  • Facilitates collaboration with others
  • Allows you to keep your code archived in a safe place (the cloud)
  • Allows you to go back to previous version of your code
  • Allows you to find out what changes broke your code
  • Allows you to have experimental branches without breaking your code
  • Allows you to keep different versions of your code without having to worry about file names and archiving organization

The three most popular version control software are Git, Mercurial (abbreviated as hg) and Subversion (abbreviated as svn). Git is by far the most modern and popular one, so we will only use Git in this course.

Git git

What git does

Git keeps track of changes in a local repository you set up on your computer. Typically that is a folder that contains all your code and optionally the data your code needs in order to run. The local repository contains all your files, but also (in a hidden folder) all the changes to the files you have made. It does not keep track of all files automatically: you need to tell git which files to track and which not. Therefore a repository contains your current tracked files (workspace), an index of files that are being tracked, and the version history.

Every time you make significant changes to the files in your workspace, you have to add the changed files to the index, which selects the files whose changes you want to save, and commit them, which means saving the changes to the history tracking of your local repository.

Often you also setup a remote repository, stored on an online platform like GitHub, GitLab or others. It is simply a remotely-hosted mirror of your local repository and allows you to have your work stored in a safe place and accessible from your other computers and potential collaborators. Once in a while (at the end of the day, or every new commit if you want) you can push your commits, which means sending them to the remote repository so it keeps in sync with your local one. When you want to update your local repository based on the content of a remote repository, you have to pull the commits from the remote repository.

Summary of git semantics

  • add: Tell git that you want a file or changes to be tracked. These files/changes are not yet saved in the repository! They are listed as “staged” in the index or staging area for the next commit.
  • commit: Save the staged changes to your local repository. This is like putting a milestone or taking a snapshot of your project at that moment. A commit describes what has been changed, why and when. In the future you can always revert all tracked files to the state they were at when you created the commit.
  • push: Send previous changes you committed to the local repository to the remote repository.
  • pull: Update your local repository (and your workspace) with all new stuff from the remote repository. This command is simple, but potentially destructive, since it overwrites your files with the ones in the remote server. Hence it is not available in the Git GUI.
    • fetch: Get information about the latest commits from the remote repository, but do not apply them to your local repository automatically. This is always safe as it does not change your workspace.
    • merge: Merges two versions (branches) into one, applying the result to the workspace. This includes merging commits from the remote repository with the commits of the local repository. In effect, a fetch followed by a merge is the same as a pull, but it allows you more fine-grained control and is available through the Git GUI.
  • clone : Copy the content of a remote repository locally for the first time.
  • more advanced:
    • branch : Create a branch (a parallel version of the code in the repository)
    • checkout: load the status of a branch into your workspace
git flows

Setting up a Git project

Effective use of git includes two components: local software to manage the files on your computer (git client) and an online git hosting service to make them centrally accessible. While git is a single system, there is a variety of clients and a variety of hosts.

In this course, we will primarily use Git GUI as the client. It is a simple client that is included with Git itself, and is language-agnostic. There are more graphical clients as well, including one integrated into RStudio itself, but these clients are outside the scope of this course.

Protip: For those who are comfortable with working from the terminal, the command line client is often the most efficient choice. Knowing how to use git from the command line is also useful when working on cloud virtual machines/servers for big data processing. So in protip boxes like this you will find command line equivalents to the GUI actions we will perform; you are not required to know them.

GitHub is the most popular host, but in order to facilitate the assignment submission process, we will use the GitLab instance of Wageningen University throughout this course. Note that the university instance is the same as what you can find on the GitLab website, except that it is managed by the university’s IT department (so you do not need to register) and you may choose to make projects visible only to others from the university.

Note: If you are taking the course externally and do not have access to the WUR GitLab instance, you can use the public GitLab instance or GitHub instead.

Client setup

  1. Launch Git GUI

Launch a program called Git GUI from your start menu. Git GUI is a graphical interface to Git that comes with Git itself, and is thus cross-platform and always available. When launched, it looks something like this:

Main screen of Git GUI

Note: External users should download git and install it to obtain Git GUI.

Protip: You can launch Git GUI from the terminal with git gui.

  1. Create an SSH key pair

In order for GitLab (and other services) to identify that the machine connecting to it is indeed owned by you, there are two options: using a password, or using an SSH key. SSH keys are much more secure than passwords, and it doesn’t require you to enter a password every time you try to communicate with the server. Therefore throughout the course we will use SSH keys.

You can generate a new SSH key pair in Git GUI by going to HelpShow SSH Key and pressing the Generate Key button. It will ask you for a passphrase. This is not a password and is completely optional: it is useful in the case your SSH key is stolen, for instance by a thief stealing your laptop or a virus; however, SSH keys are specific to each machine and are never sent over the network, so most of the time it is completely fine to leave the passphrase empty. If you keep it empty, you will not need to enter it every time you try to push your changes, yet the connection will be even more secure than when using a password.

Once done, you will see your new public key:

SSH public key generated

Protip: From the command line you generate a key pair by running ssh-keygen -t rsa -b 4096. In both cases, by default the public key is stored in the file ~/.ssh/id_rsa.pub (where ~ indicates the user’s home directory).

Account setup

Next, we will link our client with a Git host so that we can download and upload repositories.

  1. Log into GitLab

Go to WUR GitLab and log in. The username is your WUR email address and the password is your WUR account password.

  1. Enrol the public key to your user account

The SSH key pair is used to identify that you own the machine. On WUR Windows PCs, the keys are stored on your M: drive, so they will follow you on any computer in the university. On WUR Linux virtual desktop instances (VDIs) via MyWorkspace/VMware Horizon, the keys are stored on your personal VDI, so you can also access them from anywhere.

Now you need to tell GitLab about your new key. To do that, copy the public key from the dialog, then in GitLab click on your avatar in the top right and go to SettingsSSH keys. Give it a title describing your machine, paste the public key in the box, and press Add key. You might need to confirm the key by email for added security.

This only has to be done once (per machine/OS you use GitLab on).

Creating a new project

  1. Create remote repository

Now we are ready to start making new repositories for our projects! In GitLab, press the New… button (“⊞” button at the top, to the right) and select New project. Give it a descriptive name and a short description, choose the visibility of the project and check Initialize repository with a README.

New project creation on GitLab

If you were to do this on GitHub, you would also be asked to provide a license for your code. That is a good idea in general, as choosing a license is crucial to let others know what you allow them to do with your code. Code without a license is copyright by default, and thus nobody is allowed to make use of your code or contribute to it. For real projects, you will want to set a more permissive license so that others could make use of your code. See Choose a License for a quick overview of what licenses are available.

  1. Configure project settings

Explore your new blank project a bit. On the left sidebar, you can find that the project can have issues and merge requests assigned to them. Issues is what will be used to review your work, and what you will need to use to review the work of others, so try and make a few issues and close them.

Next, check out the project settings. Under the Members tab of Settings, you can invite other people to collaborate on your project. Go ahead and invite your team member (and give them the Maintainer role).

Example issue on GitLab

  1. Get the URL of your new repository

Now that you have a remote repository, it’s time to create a local repository that links to it! Open the main page of your new project, click the blue Clone button at the top right of the page, and copy the Clone with SSH address of your new repository.

Blank GitLab repository

  1. Clone your repository

Go back to Git GUI, and press Clone Existing Repository. Paste the URL you just copied to the Source Location field, and choose a folder you want to store your code in in the Target Directory field. Note: the Target Directory must not already exist! Git GUI will create it for you.

Once you click Clone, you will get a question about whether you trust the remote machine (if you ran git gui from a command line, it will appear stuck, but actually the question will appear in the terminal and you need to answer it there). You need to answer this with yes (the full word). This puts the GitLab server into a list of trusted servers, to guard against potential impostor servers.

You will end up in an empty Git GUI window:

Git GUI in an empty directory

Protip: From the command line, cd into the directory you want to clone into, and run git clone <url>, e.g. git clone git@git.wur.nl:you001/example.git. The repository will be cloned into a subdirectory with a matching name.

  1. Tell Git who you are

Before you start using Git, you should tell it what your name and email address it. You need to do that only once per Git installation. You should go to EditOptions… and fill out the Global (All Repositories) options User Name and Email Address. These will be displayed on GitLab.

Protip: To set your user name and email from the command line:

git config --global user.name "Your Name"
git config --global user.email you@example.com

Working with Git GUI

  1. Make changes

To see Git in action, you need to make some changes in your repository. Try it by creating a new R script file in the directory where you cloned your new project.

Once you are done, go back to Git GUI. If you closed the window, you can get back to your repository by launching Git GUI and clicking on its path in the Open Recent Repository list. If you did not close it, click the Rescan button. You will see some changes:

Changes pending in Git GUI

Protip: To see a list of files with pending changes from the command line, use git status while in a git repository. To see what exactly changed in each of these files, use git diff.

At the top left corner, the Unstaged Changes panel, you can see all the files that changed in your workspace. If you click on the name of the file, the main panel will show you what changed since the last commit. Unless it is a non-text (data) file, in which case it will just note that something has changed. Note: Git is very efficient with storing changes in text files: these diff files are all it stores internally, it does not copy the whole file on each commit. However, it does not deal efficiently with non-text files, and thus you should limit the amount and size of such files as much as possible.

If you click on the icon of the file in the Unstaged Changes panel, the file changes will be staged and appear at the Staged Changes (Will Commit) panel. These are the file changes you want to save and sent to GitLab. You don’t have to stage all files for each commit, only those you actually want to be tracked by git. You can safely ignore some files such as manual backups, temporary files, and the like and they will remain untracked by git, as long as you never stage them. If you do want to stage everything, you can press the Stage Changed button. If you staged more than you wanted to, you can click on the file icon in the Staged Changes panel to unstage it.

Remember: clicking the name of the file shows the changes you made, clicking the icon of the file stages or unstages the change!

Protip: To stage a change from the command line, use git add path/to/file.ext where path/to/file.ext is the file you want to stage. To unstage, use git reset HEAD path/to/file.ext.

  1. Commit changes

Once you staged the files that you want to commit, you need to fill out the commit message. This is a brief description of what changes you made between the last commit and the one you are about to create. The first line you enter is the title of the commit, keep that one short. Subsequent lines are the description. You may notice that the Commit message box does not have a horizontal scrollbar: that is intentional, because your commit message should fit within that box without the need for scrolling. Use new lines to break the text.

If it is the first time you use Git GUI to make a commit, and you haven’t filled out your user name and email, it might complain about it not knowing who you are. In that case go back to step 9.

Next press the Commit button and your commit will be saved locally. A commit is like a saved state: you are always able to roll back the contents of your tracked files to the state they were in when you committed the changes.

Protip: To commit a change from the command line, use git commit. If you want to stage all tracked and changed files and commit, use git commit -a.

In the case you made a mistake (a mistake in the message, forgot to stage something, etc.), you can press the Amend Last Commit button and get right back to where you were when you made the last commit; but use this functionality very sparingly, as it does not work with changes that have already been pushed to GitLab.

Protip: To amend the last commit from the command line, use git commit --amend.

  1. Push changes to the server

Press the Push button, and confirm the push, to send all your changes to your GitLab repository. You can now refresh the GitLab page to see your changes. Well done!

GitLab repository with content

Protip: To push changes from the command line, use git push.

  1. Pull changes from the server

One of the major uses of Git is collaboration and the ability to synchronise changes across different devices. Multiple users can do changes in the same Git repository (as long as you change the repository settings in GitHub to allow another user to do that), and you can work on the same code on different devices yourself. In both cases, it is important to keep all local repositories in sync with the remote repository. That is done via Git GUI by using Fetch and Merge. If you like, you can test it by cloning the same repository in another folder, making changes and pushing them to the server, then using fetch in the other copy.

If there are any changes on GitHub that are not on your local copy yet, in Git GUI go to RemoteFetch fromorigin to download all changes. This will not apply them yet, however.

To attempt to apply the changes, go to MergeLocal Merge…. If all goes well, the changes will be applied.

Protip: To do a fetch and merge together the command line, use git pull.

There may be cases where files go out of sync in incompatible ways, however, like two people editing one file at the same time. In that case you may hit a merge conflict. It is best to try to avoid them. In case it happens, you need to go through the conflicting files in a text editor and edit them by hand, keeping the parts of the files you need. The conflicting parts will be in between lines of of >>>> and <<<< symbols. Once you remove the parts you don’t need (including the separators), you can solve the conflict by committing the changes.

Other Git GUI functionality

You might run into a situation when you have made changes in tracked files, but do not want to keep some of the changes. You can revert one file by selecting it in Git GUI, then clicking CommitRevert changes.

Protip: The command line equivalent is git checkout -- path/to/file.ext, or if you want to reset all changed files, git reset --hard.

Git GUI not only provides a way to make, push and pull commits, but also to visualise the commit history of your repository in a tree graph. Go to RepositoryVisualise Master’s History to see it. For larger and more complex projects with lots of contributors and merges, it might look like some sort of a subway map:

Git GUI history (gitk)

Protip: The command line equivalent is git log.

The history view also allows you to reset the state of the repository to any previous commit by using the context menu. Note, however, that you can only push if you are on the latest commit. So the easiest way to revert changes is to copy over the files to a temporary directory outside of git, reset back, and move the files back into your repository.

Protip: A few more options are available from the command line. git revert <commit> will undo changes from a given commit, where <commit> is the commit ID (you can get commit IDs from git log, they look like a long string of letters and numbers). git checkout <commit> -- path/to/file.ext will reset a single file to the state it was at the given commit.

You can also browse the history of a repository from your Git hosting service, and GitLab/GitHub even allow editing files from a web interface.

Question: How do you find commit history and old versions of your files on GitHub/GitLab?

Project structure

Try to keep a consistent structure across your projects, so that it is easier for you to switch from one project to the other and immediately understand how things work. You may use the following structure:

  • A main.R script at the root of the project. This script performs step by step the different operations of your project. It is the only non-generic part of your project (it contains paths, already set variables, etc).
  • An R/ subdirectory: This directory should contain the functions you have defined as part of your project. These functions should be as generic as possible and are sourced and called by the main.R script.
  • A data/ subdirectory: This directory contains data sets of the project. Since Git is not as efficient with non-text files, and GitHub has storage limits, you should only put small data sets in that directory (<2-3 MB). These can be shapefiles, small rasters, csv files, but perhaps even better is to use the R archives. R offers two types of archives to store the important variables of the environments, .rda and .rds.
  • An output/ sub directory (when applicable).
project structure

Example main.R file

Typically the header of your main script will look like that.

# John Doe
# January 2017
# Import packages
library(raster)
library(sp)
# Source functions
source('R/function1.R')
source('R/function2.R')
# Load datasets 
load('data/input_model.rda')

# Then the actual commands

Bigger data

The data/ directory of your project should indeed only contain relatively small data sets. When handling bigger remote sensing data sets, these should stay out of the project, where you store the rest of your data.

Example

  • Create 3 files in your R/ directory (ageCalculator.R, HelloWorld.R and minusRaster.R) in which you will copy paste the respective functions.
  • Create a main.R script at the root of your project and add some code to it. The content of the main.R in that case could be something as below.
# Name
# Date
library(raster)

source('R/ageCalculator.R')
source('R/HelloWorld.R')
source('R/minusRaster.R')


HelloWorld('john')
ageCalculator(2009)

# import dataset
r <- raster(system.file("external/rlogo.grd", package="raster")) 
r2 <- r 
# Filling the rasterLayer with new values.
r2[] <- (1:ncell(r2)) / 10
# Performs the calculation
r3 <- minusRaster(r, r2) 

RStudio projects

RStudio has a functionality called projects that allows organising your files a bit better. You may have learned that one of the first things to do when opening a R session is to set your working directory, using the setwd() command. When creating a new project, the working directory is automatically set to the root of the project, where your main.R is located. When working with RStudio projects you should not change the working directory. If you want to access files stored in your data/ subdirectory, simply append data/ to the beginning of the string leading to the file you want to load.

Note: RStudio projects are specific to RStudio and are not usable with base R or with other R IDEs. The use of RStudio projects is optional and is merely for convenience. When using other IDEs you can assume the user will set the working directory to where the script is located.

RStudio itself has integration with Git, and when creating a project there is an option to make it a git repository as well. However, in this lesson we will not be using this method, since it is specific to RStudio and does not work for Python or in other R IDEs. Git GUI, in contrast, is language-agnostic and standalone. So if you do create a project with RStudio, create it inside your cloned GitHub repository, and do not select that it creates a new git repository, then use Git GUI to handle all the changes you do in the repository. This will save you the confusion about how to handle it when we will come to Python lessons.

Finding help

Sources for help

The most important helper is the R documentation. In the R console, just enter ?function or help(function) to get the manual page of the function you are interested in.

There are many places where help can be found on the internet. So in case the function or package documentation is not sufficient for what you are trying to achieve, a search engine like Google is your best friend. Most likely by searching the right key words relating to your problem, the search engine will direct you to the archive of the R mailing list, or to some discussions on Stack Exchange. These two are reliable sources of information, and it is quite likely that the problem you are trying to figure out has already been answered before.

However, it may also happen that you discover a bug or something that you would qualify as abnormal behavior, or that you really have a question that no one has ever asked (corollary: has never been answered). In that case, you may submit a question to one of the R mailing list. For general R question there is a general R mailing list, while the spatial domain has its own mailing list (R SIG GEO). Geo related questions should be posted to this latter mailing list.

Note: these mailing lists have heavy mail traffic, use your mail client efficiently and set filters, otherwise it will quickly bother you.

These mailing lists have a few rules, and it’s important to respect them in order to ensure that:

  • no one gets offended by your question,
  • people who are able to answer the question are actually willing to do so,
  • you get the best quality answer.

So, when posting to the mail list:

  • Be courteous.
  • Provide a brief description of the problem and why you are trying to do that.
  • Provide a reproducible example that illustrate the problem, reproducing the eventual error.
  • Sign with your name and your affiliation.
  • Do not expect an immediate answer (although well presented questions often get answered fairly quickly).

Reproducible examples

Indispensable when asking a question to the online community, being able to write a reproducible example has many advantages:

  • It may ensure that when you present a problem, people are able to answer your question without guessing what you are trying to do.
  • Reproducible examples are not only to ask questions; they may help you in your thinking, developing or debugging process when writing your own functions.
    • For instance, when developing a function to do a certain type of raster calculation, start by testing it on a small auto-generated RasterLayer object, and not directly on your actual data that might be covering the whole world.

Example of a reproducible example

Well, one could define a reproducible example by:

  • A piece of code that can be executed by anyone who has R, independently of the data present on his machine or any preloaded variables.
  • The computation time should not exceed a few seconds and if the code automatically downloads data, the data volume should be as small as possible.

So basically, if you can quickly start a R session on your neighbour’s computer while he is on a break, copy-paste the code without making any adjustments and see almost immediately what you want to demonstrate; congratulations, you have created a reproducible example.

Let’s illustrate this by an example. I want to perform value replacements of one raster layer, based on the values of another raster layer. (We haven’t covered raster analysis in R as part of the course yet, but you will quickly understand that for certain operations rasters are analog to vectors of values.)

## Create two RasterLayer objects of similar extent
library(raster)
## Loading required package: sp
r <- s <- raster(ncol=50, nrow=50)
## Fill the raster with values
r[] <- 1:ncell(r)
s[] <- 2 * (1:ncell(s))
s[200:400] <- 150
s[50:150] <- 151
## Perform the replacement
r[s %in% c(150, 151)] <- NA
## Visualise the result
plot(r)

Useful to know when writing a reproducible example: instead of generating your own small data sets (vectors or RasterLayers, etc) as part of your reproducible example, use some of R built-in data-sets. They are part of the main R packages. Some popular data sets are: cars, meuse.grid_ll, Rlogo, iris, etc. The auto completion menu of the data() function will give you an overview of the data sets available.

Protip: In most script editing environments, including the R console and RStudio, auto-completion can be invoked by pressing the tab key, use it without moderation.

## Import the variable "cars" in the working environment
data(cars)
class(cars)
## [1] "data.frame"
## Visualise the first six rows of the variable
head(cars)
##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10
# The plot function on this type of dataset (class = data.frame, 2 column)
# automatically generates a scatterplot
plot(cars)

Another famous data set is the meuse data set, providing all sorts of spatial variables spread across a part of the Meuse watershed. The following example compiled from the help pages of the sp package.

## Example using built-in dataset from the sp package
library(sp)
## Load required datastes
data(meuse)
# The meuse dataset is not by default a spatial object
# but its x, y coordinates are part of the data.frame
class(meuse)
## [1] "data.frame"
coordinates(meuse) <- c("x", "y")
class(meuse)
## [1] "SpatialPointsDataFrame"
## attr(,"package")
## [1] "sp"

Now that the object belongs to a spatial class, we can plot it using one of the vector plotting functions of the sp package. See the result in the figure below.

bubble(meuse, "zinc", maxsize = 2.5,
       main = "zinc concentrations (ppm)", key.entries = 2^(-1:4))

The sp package help page contains multiple examples of how to explore its meuse built-in data set. Another example of multiple plots using meuse.grid is given in the figure below.

## Load meuse.riv dataset
data(meuse.riv)
## Create an object of class SpatialPolygons from meuse.riv
meuse.sr <- SpatialPolygons(list(Polygons(list(Polygon(meuse.riv)),"meuse.riv")))
## Load the meuse.grid dataset
data(meuse.grid)
## Assign coordinates to the dataset and make it a grid
coordinates(meuse.grid) = c("x", "y")
gridded(meuse.grid) = TRUE
## Plot all variables of the meuse.grid dataset in a multiple window spplot
spplot(meuse.grid, col.regions=bpy.colors(), main = "meuse.grid",
       sp.layout=list(
           list("sp.polygons", meuse.sr),
           list("sp.points", meuse, pch="+", col="black")
           )
       )

Good scripting/programming habits

Increasing your scripting/programming efficiency goes through adopting good scripting habits. Following a couple of guidelines will ensure that your work:

  • Can be understood and used by others.
  • Can be understood and reused by you in the future.
  • Can be debugged with minimal effort.
  • Can be re-used across different projects.
  • Is easily accessible by others.

In order to achieve these objectives, you should try to follow a few good practices. The list below is not exhaustive, but already constitutes a good basis that will help you getting more efficient now and in the future when working on R projects.

  • Comment your code.
  • Write functions for code you need more than once:
    • Make your functions generic and flexible, using control flow.
    • Document your functions.
  • Follow a R style guide. This will make your code more readable! Most important are:
    • Meaningful and consistent naming of files, functions, variables…
    • Indentation (like in Python: use spaces or tab to indent code in functions or loops etc.).
    • Consistent use of the assignment operator: either <- or = in all your code. The former is used by core R and allows assigning in function calls, the latter is shorter and consistent with most other programming languages.
    • Consistent placement of curly braces.
  • Make your own packages.
  • Keep a similar directory structure across your projects.
  • Use version control to develop/maintain your projects and packages.

Note that R IDEs like RStudio make a lot of these good practices a lot easier and you should try to take maximum advantage of them. Take a moment to explore the menus of the RStudio session that should already be open on your machine. Particular emphasis will be given later in this tutorial on projects, project structure and use of version control.

Below is an example of a function written with good practices and without. First the good example:

ageCalculator <- function(x) {
    # Function to calculate age from birth year
    # x (numeric) is the year you were born
    if(!is.numeric(x)) {
        stop("x must be of class numeric")
    } else { # x is numeric
        # Get today's date
        date <- Sys.Date()
        # extract year from date and subtract
        year <- as.numeric(format(date, "%Y"))
        if(year <= x) {
            stop("You aren't born yet")
        }
        age <- year - x
    }
    return(age)
}

ageCalculator(1985)
## [1] 34

What a beautiful age for learning geo-scripting!

Then the bad example:

## DON'T DO THAT, BAD EXAMPLE!!!
funTest_4 <- function(x) {
if( !is.numeric(x))
{
stop("x must be of class numeric"  )
 }
else {
a = Sys.Date()
b<- as.numeric( format( a,"%Y"))
b-x
}
}

funTest_4(1985)
## [1] 34

It also works, but which of the two is the easiest to read, understand, and modify if needed? … Exactly, the first one. So let’s look back at the examples and identify some differences:

  • Function name: Not very self descriptive in the second example.
  • Function description: Missing in the second example.
  • Arguments description: Missing in the second example.
  • Comments: The second example has none (okay, the first one really has a lot, but that’s for the example).
  • Variables naming: use of a and b not very self descriptive in second example.
  • Indentation: Missing in the second example.
  • Control flow: Second example does not check for implausible dates.
  • Consistency: Second example uses spaces, assigment operators and curly braces inconsistently.

You haven’t fully understood what control flow is or you are not fully comfortable with function writing yet? We’ll see more of that in the following sections.

Function writing

A function is a sequence of program instructions that perform a specific task, packaged as a unit. This unit can then be used in programs wherever that particular task should be performed. -Wikipedia

The objective of this section is to provide some help on effective function writing. That is functions that are:

  • simple,
  • generic, and
  • flexible.

They should integrate well in a processing/analysis chain and be easily be re-used in a slightly different chain if needed. More flexibility in your function can be achieved through some easy control flow tricks. The following section develops this concept and provides examples.

Control flow

Control flow refers to the use of conditions in your code that redirect the flow to different directions depending on variables values or class. Make use of that in your code, as this will make your functions more flexible and generic.

Object classes and Control flow

You have seen in a previous lesson already that every variable in your R working environment belongs to a class. You can take advantage of that, using control flow, to make your functions more flexible.

A quick reminder on classes:

# 5 different objects belonging to 5 different classes
a <- 12
class(a)
## [1] "numeric"
b <- "I have a class too"
class(b)
## [1] "character"
library(raster)
c <- raster(ncol=10, nrow=10)
class(c)
## [1] "RasterLayer"
## attr(,"package")
## [1] "raster"
d <- stack(c, c)
class(d)
## [1] "RasterStack"
## attr(,"package")
## [1] "raster"
e <- brick(d)
class(e)
## [1] "RasterBrick"
## attr(,"package")
## [1] "raster"

Controlling the class of input variables of a function

One way of making functions more auto-adaptive is by adding checks of the input variables. Using object class can greatly simplify this task. For example let’s imagine that you just wrote a simple Hello World function.

HelloWorld <- function (x) {
    hello <- sprintf('Hello %s', x)
    return(hello)
}

# Let's test it
HelloWorld('john')
## [1] "Hello john"
HelloWorld(2.5)
## [1] "Hello 2.5"
HelloWorld(c("Devis", "Martin"))
## [1] "Hello Devis"  "Hello Martin"
HelloWorld(NULL)
## character(0)
HelloWorld(data.frame(a=1:10, b=rep("Name", 10)))
## [1] "Hello 1:10"                           
## [2] "Hello c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1)"

Surprisingly enough, R is smart enough to give intuitive output in most of the tested cases, since the sprintf function automatically casts non-character variables into character ones, and is also vectorised to produce output when the input is a vector. However, in the last two cases, the output is not intuitive. We may want to only allow passing character vectors to this function. We can do this with a small change:

HelloWorld <- function (x) {
    if (!is.character(x))
      stop('Object of class "character" expected for x')

    hello <- sprintf('Hello %s', x)
    return(hello)
}

HelloWorld(21)
## Error in HelloWorld(21): Object of class "character" expected for x

The function now throws an informative error when something not supported is requested. These function argument “sanity checks” are useful to avoid lengthy processing, when we know that the output of the function is going to be wrong anyway because of invalid arguments. Alternatively to stop(), the function could throw a warning() but still return some value.

Question: In which cases should you use stop(), warning(), message(), and when should you return a string?

Note that most common object classes have their own logical function (that returns TRUE or FALSE) to check what class it is. For example:

is.character('john')
## [1] TRUE
# is similar to
class('john') == 'character'
## [1] TRUE
is.character(32)
## [1] FALSE
is.numeric(32)
## [1] TRUE

You should always try to take maximum advantage of these small utilities and check for classes and properties of your objects. This is important in some cases that you might not think of in advance, for instance, consider an object with more than one class:

a = list(a=1:10, b="b")
# We can also set a class (only do that if you make your own class!)
class(a) = c("myclass", "list")

# Wrong
if (class(a) == "list") {
  print("a is a list")
} else {
  print("a is not a list")
}
## Warning in if (class(a) == "list") {: the condition has length > 1 and only
## the first element will be used
## [1] "a is not a list"
# Right
if (is.list(a)) {
  print("a is a list")
} else {
  print("a is not a list")
}
## [1] "a is a list"

Also note that is.character(32) == TRUE is equivalent to is.character(32). Therefore when checking logical arguments, you don’t need to use the == TRUE. As an example, a function may have an argument (say, plot) that, if set to TRUE will generate a plot, and if set to FALSE does not generate a plot. It means that the function certainly contains an if statement. if(plot) in that case is equivalent to if(plot == TRUE), it’s just shorter (and very slightly faster).

An example, with a function that subtracts 2 RasterLayers, with the option to plot the resulting RasterLayer, or not.

library(raster)
## Function to subtract 2 rasterLayers
minusRaster <- function(x, y, plot=FALSE) { 
    z <- x - y
    if (plot) {
        plot(z)
    }
    return(z)
}

# Let's generate 2 rasters 
# that first one is the R logo raster
# converted to the raster package file format.
r <- raster(system.file("external/rlogo.grd", package="raster")) 
# The second RasterLayer is derived from the initial RasterLayer in order
# to avoid issues of non matching extent or resolution, etc
r2 <- r
## Filling the rasterLayer with new values
# The /10 simply makes the result more spectacular
r2[] <- (1:ncell(r2)) / 10
## Simply performs the calculation
r3 <- minusRaster(r, r2) 
## Now performs the calculation and plots the resulting RasterLayer
r4 <- minusRaster(r, r2, plot=TRUE) 

Vectorised functions

A lot of core R functions are vectorised, i.e. they are capable of taking vectors, rather than individual values, as input. This allows very simple and powerful syntax without needing to use loops, for instance:

NumVec = 1:10
# We do not need to run the function on each element individually
as.character(NumVec)
##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"
# Functions that reduce input into fewer numbers are also vectorised
mean(NumVec)
## [1] 5.5
range(NumVec)
## [1]  1 10
# All math operators are vectorised
NumVec + 100
##  [1] 101 102 103 104 105 106 107 108 109 110
NumVec^2
##  [1]   1   4   9  16  25  36  49  64  81 100

Because most base functions are already vectorised, it is easy to write new functions that are themselves vectorised. For the most part, all you need to do is test your functions not just on single values, but also vectors of values, and think whether the result makes sense.

Vectorisation also allows us to write short and versatile code. For instance, to check whether the input is a positive number:

is.positive.number = function(x)
{
  is.numeric(x) & x > 0
}

is.positive.number(100)
## [1] TRUE
is.positive.number(c(-100, 10, NA))
## [1] FALSE  TRUE    NA
is.positive.number(c(TRUE, FALSE))
## [1] FALSE FALSE

Question: Why does the following function call return different and counter-intuitive results?

is.positive.number(c(TRUE, -5, 10))
## [1]  TRUE FALSE  TRUE

Type consistency

We just became familiar with functions for checking the types (classes) of variables. You may have noticed that there is one way in which they are all consistent: no matter the input, they return a logical value (TRUE, FALSE, or also NA). This is called type consistency and is a useful property. Consider code like this:

AddNewSubject = function(NewBirthYear)
{
  # Birth years of previous subjects
  BirthYears = c(1980, 1985, 1987, 1990, 1993, 1994, 1998, 2000)
  return(c(BirthYears, NewBirthYear))
}

# Works as expected
NewSubjects = AddNewSubject(c(1995, 1998, 1999, 2000))
sum(NewSubjects >= 2000) # How many subjects are born on or after 2000
## [1] 2
# Whoops!
NewSubjects = AddNewSubject(c("MCMXCV", "1998", "1999", "2000"))
sum(NewSubjects >= 2000)
## [1] 3

Question: What happened, and what would happen if we didn’t include the Roman numerals?

AddNewSubject is not type consistent: depending on the input, the output class changes. This is sometimes convenient, but in the case above, the function succeeds but gives an output that is completely misleading. This has gone horribly right: a lot of code in R is flexible and can handle multiple input types, and so you may only notice that your output is wrong when you inspect it yourself. If the function AddNewSubject checked the input and always returned integers, we would be sure that comparing its output against numbers and summing them is safe. Similarly, the is.numeric() etc. functions are type consistent, and thus it is safe to use them in if statements which always require a logical input. This allows us to not need extra type checking after running such a function. In addition, the function name is hinting towards the type consistency: is. makes us realise that it will give a yes/no answer.

try and debugging

Use of try for error handling

The try() function may help you writing functions that do not stop with a cryptic error whenever they encounter an unknown of any kind. Anything (sub-function, piece of code) that is wrapped into try() will not interrupt the bigger function that contains try(). So for instance, this is useful if you want to apply a function sequentially but independently over a large set of raster files, and you already know that some of the files are corrupted and might return an error. By wrapping your function into try() you allow the overall process to continue until its end, regardless of the success of individual layers. So try() is a perfect way to deal with heterogeneous/unpredictable input data.

Also try() returns an object of different class when it fails. You can take advantage of that at a later stage of your processing chain to make your function more adaptive. See the example below that illustrate the use of try() for sequentially calculating frequency on a list of auto-generated RasterLayers.

library(raster)

## Create a raster layer and fill it with "randomly" generated integer values
a <- raster(nrow=50, ncol=50)
a[] <- floor(rnorm(n=ncell(a)))

# The freq() function returns the frequency of a certain value in a RasterLayer
# We want to know how many times the value -2 is present in the RasterLayer
freq(a, value=-2)
## [1] 340
# Let's imagine that you want to run this function over a whole list of RasterLayer
# but some elements of the list are impredictibly corrupted
# so the list looks as follows
b <- a
c <- NA
list <- c(a,b,c)
# In that case, b and a are raster layers, c is ''corrupted''
## Running freq(c) would return an error and stop the whole process
out <- list()
for(i in 1:length(list)) {
    out[i] <- freq(list[[i]], value=-2)
}
## Error in (function (classes, fdef, mtable) : unable to find an inherited method for function 'freq' for signature '"logical"'
## If you wrap the call in a try(), you still get an error, but it's non-fatal
out <- list()
for(i in 1:length(list)) {
    out[i] <- try(freq(list[[i]], value=-2))
}
out
## [[1]]
## [1] 340
## 
## [[2]]
## [1] 340
## 
## [[3]]
## [1] "Error in (function (classes, fdef, mtable)  : \n  unable to find an inherited method for function 'freq' for signature '\"logical\"'\n"
# By building a function that includes a try()
# we are able to catch the error without having it printed,
# allowing the process to handle the error gracefully.
fun <- function(x, value) {
    tr <- try(freq(x=x, value=value), silent=TRUE)
    if (class(tr) == 'try-error') {
        return('This object returned an error')
    } else {
        return(tr)
    }
}

## Let's try to run the loop again
out <- list()
for(i in 1:length(list)) {
    out[i] <- fun(list[[i]], value=-2)
}
out
## [[1]]
## [1] 340
## 
## [[2]]
## [1] 340
## 
## [[3]]
## [1] "This object returned an error"
# Note that using a function of the apply family would be a more
# elegant/shorter way to obtain the same result
(out <- sapply(X=list, FUN=fun, value=-2))
## [1] "340"                           "340"                          
## [3] "This object returned an error"

Function debugging

Debugging a single line of code is usually relatively easy; simply double checking the classes of all input arguments often gives good pointers to why the line crashes. But when writing more complicated functions where objects created within the function are reused later on in that same function or in a nested function, it is easy to lose track of what is happening, and debugging can then become a nightmare. A few tricks can be used to make that process less painful.

traceback() and debugonce()

Here are the manual commands, which also work with RStudio and other IDEs:

  • The first thing to investigate right after an error occurs is to run the traceback() function; just like that without arguments.
  • Carefully reading the return of that function will tell you where exactly in your function the error occurred.
foo <- function(x) {
    x <- x + 2
    print(x)
    bar(2) 
}

bar <- function(x) { 
    x <- x + a.variable.which.does.not.exist 
    print(x)
}

foo(2) 
## gives an error

traceback()
## 2: bar(2) at #1
## 1: foo(2)
# Ah, bar() is the problem

# Debug it by declaring what to debug and running it
debugonce(bar)
foo(2)

Depending on the IDE you are using, you may be presented with tools for stepping through the function line by line, as well as a Browse console, which allows you to query the state of the variables involved so that you can identify exactly what is going on in the function call. For instance, in RKWard, the Debugging Frames pane on the right shows which line you are stepping through.

For another example see: rfunction.com.

RStudio

RStudio has integration with the debugging tools in R, so you can use a point-and-click interface. However, some parts of it are specific to the RStudio IDE.

  • To force them to catch every error, select Debug - On Error - Break in Code in the main menu.
  • Run again foo(2).
  • RStudio will stop the execution where the error happened. The traceback appears in a separate pane on the right.
  • You can and use the little green “Next” button to go line by line through the code, or the red Stop button to leave the debugging mode.
  • Reset the On Error behaviour to Error Inspector. In this default setting, RStudio will try to decide whether the error is complex enough for debugging, and then offer the options to “traceback” or “rerun the code with debugging” with two buttons in the console.

Finally, solve the problem:

## redefine bar
bar <- function(x) {
    x + 5
}
foo(2)
## [1] 4
## [1] 7

Refer to the reference section of this document for further information on function debugging.

(optional) Writing packages

The next step to write re-usable code is packaging it, so others can simply install and use it. If followed the steps to here, this step is not very big anymore! For this course, it is optional. Find instructions here and in the references below.

Exercise 3

Your task

Create a GitLab project on the WUR instance. The project should contain a simple function to calculate whether or not a year is a leap year. Use control flow, and provide some examples of how the function works in the main.R. The function should behave as follows:

> is.leap(2000)
[1] TRUE

> is.leap(1580)
Warning message:
In is.leap(year): 1580 was before the Gregorian calendar was in use, using proleptic Gregorian calendar
[1] TRUE

> is.leap(2002)
[1] FALSE

> is.leap('john') #should throw an error 
Error: argument of class numeric expected

Useful resources

Assessment

When doing assessment, the following points are to be considered:

  1. Whether the function works as intended
  2. Whether the fucntion handles a variety of input, throws meaningful errors and warnings when appropriate.
  3. Whether the function is readable, complete and type consistent.

The structure of your project and the appropriate use of git will also be assessed. For bonus points, try to make the function as short as possible: can you come up with a solution without using any if statements?

How to submit?

Important: Please carefully read the submission instructions on Blackboard! You may not get scored if you submit your exercises incorrectly.

As a summary: create a private GitLab project with the name Geoscripting-Exercise<id>-<teamname>, where <id> is the number of the exercise (3 in today’s case) and <teamname> is the name of your team. In the Members section of the project, add the staff members responsible for checking your exercises as members of the project and grant them “Maintainer” privileges. Finish the exercise by the deadline today. The staff will check it and publish your answers on the student group Geoscripting<year> on GitLab after the deadline.

You will need to give the team you are reviewing feedback on their exercise solution the next day (use the review team generator Shiny app to know who you are reviewing and who you are reviewed by). Answers from other groups will be available on the Geoscripting<year> group on GitLab. For reviewing other teams’ answers:

  • Clone the repository of the team you have to review to your computer and test it.
  • Add an issue to their project and write out your review. Make sure to mention your team name in the review.

Once you receive issues on your repository with feedback, do not close them! They will be checked by staff as part of the assessment.

This is the way the exercises will need to be submitted and reviewed from this lesson on.

References