Dainius Masiliūnas, Loïc Dutrieux, Jan Verbesselt, Johannes Eberenz

2024-08-29

WUR Geoscripting

Learning objectives

  • Learn about graphical interfaces for programming in R and Python
  • Understand the structure of a package in R and Python
  • Adopt some good scripting/programming habits
  • Learn how to find help with solving programming problems

Graphical user interfaces (GUIs) and integrated development environments (IDEs)

It’s time for us to actually delve into R and Python programming! But how do we do that efficiently? We could write scripts in Notepad and run them, but that is not particularly efficient, as there are many graphical user interfaces (GUIs) that can help us write code faster. Comprehensive GUIs that help with writing, debugging and packaging code are called integrated development environments (IDEs).

R

There are multiple IDEs for R. The most popular one is RStudio, as it is developed by a company that is very active in contributing to R (Hadley Wickham and others). RStudio is cross-platform and open-source. It comes in two types: RStudio Desktop is a regular desktop app, and RStudio Server is meant for running on a remote server, to which you can connect through your web browser. Interestingly enough, the desktop version is in fact just RStudio Server running locally, with an integrated web browser based on Google Chrome.

Even though RStudio is very popular, it is not the only IDE for R. RKWard is an alternative IDE, aimed at easier learning of R for people coming from other statistical software, such as SPSS. It includes menus for common statistical analysis algorithms, such as getting descriptive statistics, running statistical tests, and making plots. It also includes an editor for data tables. RKWard is an open-source native desktop app that is developed on Linux, but has also been ported to other platforms, including Windows.

Furthermore, there are cross-language GUIs/IDEs, such as Jupyter, whose name is a portmanteau of Julia, Python and R. Jupyter can run different language interpreters, what it calls kernels, for each script that is open, therefore scripts in multiple languages can be edited at the same time in the same interface.

One thing to keep in mind is that there is a distinction between a programming language, such as R, and its IDEs, such as RStudio. Which IDE you use is up to your own personal preference. Which IDE you used to develop code does not matter, because in the end you just have an R script that can be run on any of the IDEs, or directly from the command line. What does matter is that the script is written in the R language, as the users will need to have the R interpreter installed in order to run the script. Likewise, the packages that you use in your script are also important, as the users will need to install them before they can run your script. Therefore, whenever you refer to scripts and packages, for instance when writing a thesis, you need to specify what language the scripts are written in, and what packages (and ideally what versions) you used. You should not mention which IDE you used, as that is irrelevant to the readers. For example, you should not write “the scripts are written in RStudio”, but rather, you should write “the scripts are written in R, using package lattice version 0.22”. In fact, you should also cite the authors of the language and the packages you used. R includes a handy function citation() to help you cite:

citation()
## To cite R in publications use:
## 
##   R Core Team (2024). _R: A Language and Environment for Statistical
##   Computing_. R Foundation for Statistical Computing, Vienna, Austria.
##   <https://www.R-project.org/>.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {R: A Language and Environment for Statistical Computing},
##     author = {{R Core Team}},
##     organization = {R Foundation for Statistical Computing},
##     address = {Vienna, Austria},
##     year = {2024},
##     url = {https://www.R-project.org/},
##   }
## 
## We have invested a lot of time and effort in creating R, please cite it
## when using it for data analysis. See also 'citation("pkgname")' for
## citing R packages.

It also works on packages:

citation("lattice")
## To cite package 'lattice' in publications use:
## 
##   Sarkar D (2008). _Lattice: Multivariate Data Visualization with R_.
##   Springer, New York. ISBN 978-0-387-75968-5,
##   <http://lmdvr.r-forge.r-project.org>.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Book{,
##     title = {Lattice: Multivariate Data Visualization with R},
##     author = {Deepayan Sarkar},
##     year = {2008},
##     publisher = {Springer},
##     address = {New York},
##     isbn = {978-0-387-75968-5},
##     url = {http://lmdvr.r-forge.r-project.org},
##   }

There are more IDEs, such as R Commander, but we will not be talking about them within the scope of this course. Nevertheless, feel free to use whichever IDE is your favourite!

As a sidenote, IDEs can change how certain commands in R work. For instance, there are two commands in R that load packages: library() and require(). Most IDEs make no difference between the two; require() just gives a warning if it can’t load a package, whereas library() stops with an error. However, in RKWard, running require() on a package that does not exist will result in its package management dialogue opening with the required packages preselected for installation, and closing the dialogue will result in a successful loading of the package (if it was installed successfully). Therefore, do not use require(“package”) to check if package is installed, otherwise it will be installed twice under RKward. Instead, you can use “package” %in% installed.packages().

Next, let’s look a bit more in detail at the IDEs described above and their configuration options.

RStudio

In the virtual machines provided in the course, RStudio is already installed for you. In case you are working on your own computer and would like to know how to install R and RStudio, see the RStudio website.

If you are new to RStudio take a look at the following summary YouTube video about how to use RStudio: Intro to RStudio (6 min). In it you will learn how to navigate the RStudio environment and even run some code. This will be helpful for later in the course.

Additionally, the first time you open RStudio it’s helpful to change some global settings. To do this go to ToolsGlobal Options… in the top menu bar of RStudio. The following box should appear. Set the settings to match that of the screenshot, namely, Never save the workspace to .RData file on exit and uncheck the boxes under History and Restore .RData. The effect is that RStudio will stop prompting you on exit about whether you want to save the workspace and reload it on next start, which will save you time answering prompts and save time opening RStudio. Most importantly, it will make sure that restarting RStudio leaves you with a clean environment, equal to what another user of your code would start with, so you can properly test your code, and restart RStudio in case something goes wrong to get back to a clean state.

Global Options
Global Options

To make a new script, click on File -> New File -> R Script, and a new editor pane will open, allowing you to write code. When you write a line, you can run each individual line that your text cursor is on by hitting Ctrl+Enter. To install packages, aside from using the R console below (install.packages() function), you can install them through the graphical Packages pane at the lower right corner of the screen, where you can search for and install packages and their updates.

RKWard

RKWard is also preinstalled on the virtual machine. On your own computer, you can get it from the RKWard website.

When you launch RKWard, it will start a first-start setup wizard, that suggests installing some additional packages. Feel free to simply dismiss it by clicking Cancel. To start a new file, click Create -> Script File. Typically you can also use Ctrl+Enter to run each line that your cursor is on. RKWard sometimes gives too many autocompletion tooltips, you can disable them by going to Settings -> Configure Script Editor and unchecking Function call tip.

You can install packages, aside from using the install.packages() function in the R console below, through Settings -> Manage R packages and plugins…. You will be prompted to select a mirror, just use the top one (0-Cloud) to automatically select the best one. The tab Install / Update / Remove R packages lists all packages that are available on CRAN, packages that are installed and are the latest version, and packages that are installed but can be updated. To install packages, click the checkbox next to their names, and to remove packages, uncheck them. The changes will only be applied once you click Apply.

Jupyter

Jupyter comes in two types as well: Jupyter Notebook and Jupyter Lab.

Jupyter Notebook is an older interface that is based on the notebook concept, where you combine text with code in a single (.ipynb) file. Jupyter Notebooks are extremely popular in the Python community, as they are easy for writing documentation and tutorials (vignettes). They are displayed automatically in GitLab and GitHub. However, Jupyter Notebooks are less popular in R, because RMarkdown provides a similar solution for R (and recently Quarto extends RMarkdown beyond R, therefore directly competing with Jupyter Notebooks). An issue with Jupyter Notebooks is that, due to mixing of text and code, you cannot run a notebook from the terminal, or in fact use any IDE other than Jupyter Notebook. They are also not well compatible with Git, as multiple developers cannot edit the same file without causing spurious conflicts, albeit there are workarounds.

Jupyter Lab is a newer interface that allows directly editing R and Python files, in addition to opening Jupyter Notebooks. Its advantage is that it can edit multiple scripts in multiple languages at the same time, but its disadvantage is that it cannot provide functionality that is specific to any one language, therefore it is relatively barebones.

Like RStudio Server, Jupyter is mainly designed to be run on a server. It can be run locally by launching a server and then connecting to it through a web browser. However, since Jupyter is written in Python, it requires specific Python setup to install. We will learn more about how to run Jupyter in the Python part of the course.

The fact that Jupyter is open source and designed to work as a server has created opportunities for Google to create something of a Google Docs version of Jupyter, that is called Google Collaboratory, or Google Colab for short. Google provides every user of Google Colab with computer resources to run Python or R. It is therefore the easiest way to run Python without changing anything on your own computer. If anything goes wrong, you can always create a new Colab notebook and start over.

First, go to the Google Colab website and click on + New notebook to create a Colab notebook. The default kernel type in Google Colab is Python, but you can change it to R by going to Runtime -> Change runtime type and selecting R. Click Connect at the top right to start the kernel, so that you can run code blocks.

Note that Google Colab has a special character, the exclamation point !, which allows you to run Bash commands within Python code blocks. This is very useful and important for installing Python packages, as they are typically installed using the pip command from Bash. You can try running a code block like this to install and load the rasterio package:

!pip install rasterio
import rasterio

Python

Aside from Jupyter, you can use other IDEs for Python. A simple to install IDE is called Spyder, as it is its own Python package. You can also use more complex IDEs for Python scripting, such as PyCharm and Microsoft Visual Studio Code, but they are more complicated to install.

You will learn more about Python IDEs later in the course. For now, we will stick with using Google Colab to run Python code.

The overall system architecture

System overview System Overview IDE Integrated DevelopmentEnvironment(IDE) Engine Engine Packages Packages Bindings Bindings Libraries Libraries R IDEs RStudio,RKWard,R GUI, ... Jupyter Python IDEs Spyder,PyCharm, ... Webbrowser Web browser(Firefox, Chrome, ...) R engine Python engine Earth Engine Earth Engine R packages terra, sf, ranger, ... Python packages rasterio, geopandas, ... pyproj pyproj earthengine-api earthengine-api GEOS GEOS GDAL PROJ PROJ GitLab git Local Data Local data Cloud storage Cloud storage(Google Drive,Google CloudStorage, ...)
System architecture graph

To recap, the overall system architecture is comprised of integrated development environments (IDE), engines, packages, bindings, and libraries. Let’s start with the engine, the core program that executes the foundation or crucial tasks of the programming language. The most relevant engines in this course are Python and R. To interact with the engine, we can either code in the command line directly or we can use an IDE, which provides many tools and features for working with the engine integrated in a single environment. An IDE allows the developer to write code, test it, and debug it all in a single software application.

We then install packages in the IDE, these are collections of related code files, libraries, and resources that are related and compatible with each other. Libraries are pre-written code that provide specific functions and services. The idea of libraries and packages is to make coding more efficient by the reuse of common functions.

What are packages?

Packages extend the functionality of a programming language by providing new functions (and/or datasets). There is a huge variety of packages in both R and Python. But how are packages made?

In essence, packages are nothing more than bundles of scripts! Typically, these scripts provide functions: small pieces of scripts that do one particular task, typically with a name and input arguments. By loading a package, we simply place these functions in what is called the global environment: a part of your computer memory that holds your variables, function definitions, datasets etc. for the currently running session of R or Python.

Packages are formalised by a set of rules that all developers have to abide by. One important rule is that the action of merely loading a package should not cause harm to the user (or their environment): functions can be loaded into the global environment, but should not overwrite what is already there, or run any code (which could potentially delete files etc.). Therefore, packages typically consist of script files with nothing other than function definitions. Running those functions is left up to the users in their own scripts. How to run them is documented using examples, either in comments next to the functions, or in an external help file, but never directly as executable code.

In addition to the scripts (and their documentation), packages typically include some metadata, such as the name of the author, package version, what other packages are needed to run this package, etc. In R, this metadata is mandatory, whereas in Python it’s to some extent optional.

Packages also typically follow a particular structure, i.e. where files are located and how. Let’s take a closer look at the package structure in both Python and R.

Package structure

In Python

A simple package (module) in Python is simply a directory with a Python file in it. In Colab, this code will create a directory and a Python file containing a Hello World example:

! mkdir -p MyPackage
! echo "def helloworld():" > MyPackage/helloworld.py
! echo "    print('hello world')" >> MyPackage/helloworld.py

Now we can directly make use of our new helloworld() function in Python:

import MyPackage.helloworld as mphw

mphw.helloworld()

In the import name, MyPackage is the name of the directory, and helloworld is the name of the file. We rename MyPackage.helloworld to mphw for brevity, but it’s optional.

Well, that was easy!

Though our package is technically not a complete package yet, as it can only be run if we copy the file to every code project that wants to use our function. A true package can be installed and then run from any place on your computer. Thankfully that’s also easy to do! All we need is to provide a bit more structure and metadata.

The metadata file in Python is called pyproject.toml, and it has to be in a subdirectory with the name of the package, as the name will appear in the list of packages. The actual functions need to be in a subdirectory of that subdirectory, with the name of the package, as the package will appear in Python. Typically, both of these names should be the same, otherwise it will result in user confusion. Lastly, if we want to be able to just import the package to get our functions, without having to specify the name of the file that holds the functions we want to import, we should call the script with our functions __init__.py.

In summary, the package structure looks like this:

MyPackage
├── MyPackage
│   └── __init__.py
└── pyproject.toml

Let’s make our files match this structure! First, make the subdirectory MyPackage/MyPackage, and then rename our helloworld.py to MyPackage/MyPackage/__init__.py:

! mkdir MyPackage/MyPackage
! mv MyPackage/helloworld.py MyPackage/MyPackage/__init__.py

Next, let’s add the project metadata. We only need to specify the name of the package and its version. To prevent confusion, use the same name as the name of the two directories.

! echo '[project]' > MyPackage/pyproject.toml
! echo 'name = "MyPackage"' >> MyPackage/pyproject.toml
! echo 'version = "0.1"' >> MyPackage/pyproject.toml

Bam, we now have a true package! We can install it with pip:

! pip install ./MyPackage

And now we can use our function from any directory simply as:

import MyPackage

MyPackage.helloworld()

In R

Packages in R are also nothing more than an organised collection of R scripts. Compared to Python, there are a few more files necessary to make an R package, but thankfully, we also have tools that help to create one!

Once again, let’s make a Hello World function and put it into a file:

echo 'HelloWorld <- function() print("Hello world")' > HelloWorld.r

Let’s make it into a package! We use the function package.skeleton() to autocreate a package structure:

package.skeleton(name = "MyPackage", code_files = "HelloWorld.r")
## Creating directories ...
## Creating DESCRIPTION ...
## Creating NAMESPACE ...
## Creating Read-and-delete-me ...
## Copying code files ...
## Making help files ...
## Done.
## Further steps are described in './MyPackage/Read-and-delete-me'.

Let’s follow the instructions to read the file MyPackage/Read-and-delete-me. And then delete it:

rm MyPackage/Read-and-delete-me

Let’s see what is the R package structure:

tree -n MyPackage
## MyPackage
## ├── DESCRIPTION
## ├── man
## │   ├── HelloWorld.Rd
## │   └── MyPackage-package.Rd
## ├── NAMESPACE
## └── R
##     └── HelloWorld.r
## 
## 3 directories, 5 files

It’s a little bit more complicated than Python, but not by much. The script files are stored in a directory called R (as R packages can also include C++ code etc.), and our script file is already there. Next we have the DESCRIPTION file, which, like the pyproject.toml file we saw for Python, includes metadata about the package. You can open it and edit it. Lastly, we have the NAMESPACE file, which states which functions will be accessible to package users (as opposed to only internal to the package), and a man directory, which includes the documentation manual entries. We have two entries: HelloWorld.Rd, which is the description of our HelloWorld function, and MyPackage-package.Rd, which is a description of the package itself. These documentation files are written in an R variant of LaTeX.

Let’s install our package! First we need to build it, from the terminal:

R CMD build MyPackage
## * checking for file ‘MyPackage/DESCRIPTION’ ... OK
## * preparing ‘MyPackage’:
## * checking DESCRIPTION meta-information ... OK
## * installing the package to process help pages
## * saving partial Rd database
## * checking for LF line-endings in source and make files and shell scripts
## * checking for empty or unneeded directories
## * building ‘MyPackage_1.0.tar.gz’

As you can see, the package is just a zip of our package files, plus some metadata cache. Now we can use R itself to install our new package:

install.packages("./MyPackage_1.0.tar.gz")
## Installing package into '/home/osboxes/R/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
## inferring 'repos = NULL' from 'pkgs'

Perfect, we can immediately load it and use it:

library(MyPackage)

HelloWorld()
## [1] "Hello world"

As a bonus, we can also see the documentation we had in the man files:

?HelloWorld

That was also pretty easy!

To summarise, we have learned that packages in both Python and R are nothing more than scripts that contain functions, in a particular directory structure. If we start developing our projects with that structure in mind, we can make packages out of our scripts very easily!

Tip: Once you have a package, you can also have it uploaded to a package repository, so that others (or you from another computer) can easily download and run the code in the package. However, it does require you to have your code in good working order and be well documented, since, after all, it’s going to be public for everyone to see!

The package repository for Python is called PyPI, The Python Package Index. PyPI is developer-friendly, which means that you can easily upload your packages on PyPI without much hassle, as long as your project is in a reasonable shape and declares all its dependencies properly. See the PyPI documentation to learn more about how to submit packages to PyPI.

The package repository for R is called CRAN, the Comprehensive R Archive Network. Unlike PyPI, CRAN is end-user-friendly. That means that submitting packages to CRAN is not nearly as easy, because it has much higher quality requirements for any (new) packages. All packages submitted to CRAN have to pass through both automatic tests (using R CMD check command), as well as human review. Your package will be rejected if it depends on a package version that is not currently on CRAN, if you fail to document any function parameter, make any typo anywhere or even fail to use quotes correctly. Therefore, the package submission process for CRAN is more akin to peer review for publishing a scientific paper. However, it ensures that packages always work together with other packages, which immensely simplifies package dependency management and allows users to simply use install.packages() and expect the new package to just work. Packages are also regularly removed from CRAN if they stop working due to updates in their dependencies, making sure that all packages on CRAN stay compatible with each other. See CRAN documentation for more information about package submission requirements.

Good programming habits

Project structure

Making packages is a great way to stay organized, keep track of what you are doing and be able to use it quickly and properly at any time. Packages are:

  • Easy to share with others
  • Dependencies are automatically imported and functions are sourced (reduces the risk of having broken dependencies)
  • Documentation is attached to the functions and cannot be lost (or forgotten)

For these reasons, if you build a package, next year you will still be able to run the functions you wrote yesterday. Which is often not the case for stand alone functions that are poorly documented and may depend on many other functions … that you cannot find any more. So to summarize, packages are not only a way to extend functionality, they are also the standard way to archive and save functions.

To make it easy for you to make a package out of your code, you need to stick with a project structure that packages have. Another benefit of maintaining a consistent project structure is that it will make it easier for you to switch from one project to the other and immediately understand how things work.

To practice keeping a consistent project structure, in this course we will be following the structure below:

Project Structure Schema
Project Structure Schema
  • A main script at the root of the project. This script performs step by step the different operations of your project. It is the only non-generic part of your project (it contains executable code outside functions, including paths, already set variables, etc.). The file extension of this file will depend on what language you are using for your project. We typically call these files main.r and main.py, but it could also be e.g. task1.r.
  • As we will be working with multiple languages throughout this course we will keep things organized by placing the scripts into their respective language sub-directories (R/, Python/, and Bash/). These directories should contain the functions you have defined as part of your project. These functions should be as generic as possible and are sourced and called by the main script. The way this is done depends on the language used by the main script. For example, in R you would write source("R/myfunction.R"). Whereas in Python, you would use import Python.myfunction.
    • Each file in the R and Python directory should ideally consist of a single function with the same name as the file itself, to make it easy to find. In Python it is common to combine multiple functions in one file, because typically you refer to the file (module) name in addition to the function name (through import MyPackage), but it is still good practice to keep each function in its own file (so you don’t get confused where each function comes from even if you do from MyPackage import *).
  • A data/ subdirectory: This directory contains data sets of the project. Since Git is not as efficient with non-text files, and GitLab has storage limits, you should only put small data sets in that directory (<2-3 MB). Typically, you do not include it in your git repository at all; rather, this directory is created from your main script, and is used to store data downloaded from the internet. It can be safely removed after the script is finished running.
  • An output/ subdirectory (when applicable), where you place the final result of running your script. This should also not be tracked by git: your scripts create the output, so there is no need to store it.
  • A README.md file should be included, this file should contain a description of your project, along with a description of what other packages your script needs to function correctly, and any other instructions needed to correctly run your code.
  • Finally, you should include a LICENSE.txt file with the software licence which you would like your code to have.

Warning: Never upload large (raster) files to git! If your repository exceeds 100 MiB in size, it will no longer be loaded by CodeGrade. Deleting the files will have no effect, because Git stores the history forever. Fixing a repository like that is extremely complicated and time-consuming!

Note: Without a LICENSE file, your code is copyright, which means that nobody has the right to even read it, let alone run it! Make sure to always specify a license that would at least allow that.

Example main file

Typically the header of your main script will look like the following:

In R (main.R)

# Team Teamname (John Doe and Jane Smith)
# January 2020
# Import packages
library(terra)
library(sf)
# Source functions
source('R/download_data.R')
source('R/function2.R')
# Use sourced function to download data to the directory "data"
download_data("data")
# Load datasets 
postboxes <- st_read('data/postbox_locations.gpkg')
# Then the actual commands

In Python (main.py)

# Team Teamname (John Doe and Jane Smith)
# January 2020
# Import packages
import geopandas as gpd
import matplotlib.pyplot at plt

# Import functions
import Python.download as download

# Use imported function to download data to the directory "data"
download.download_data("data")

# Load datasets 
postboxes = gpd.read_file('data/postbox_locations.gpkg')
# Then the actual commands

Working directory, relative and absolute file paths

At the end of the following section you should be able to explain the difference between the following:

  • relative path
  • absolute path
  • working directory,
  • And the following special directories:
    • .
    • ..
    • / or the root directory

In the R and Python examples above we load the datasets by indicating the file location from the data directory "data/postbox_locations.gpkg". However, you may have many data folders on your computer, for all types of different projects. So how does the system know to look in the correct one? Moreover, if you share your script with a friend the location of their project and data folders will be different than that of your setup. It would be a nuisance if they had to change all references to these files in their script. To deal with these issues, we use relative file paths.

In relative file paths, we don’t include the location of the project (working) directory itself, these paths are relative to the working directory (the “Project_Structure” folder). In the example above, the relative file path for the post box locations file would be data/postbox_locations.gpkg, whereas the absolute file path would be "/home/osboxes/Geoscripting/Project_Structure/data/postbox_locations.gpkg". The absolute file path refers to a file from the root of the entire file system. On Linux (and other UNIX-like systems like macOS), absolute file paths always start with /.

Note: on Windows, you might see a backslash (\) being used as a path separator instead of a slash (/). Don’t do this! In many languages, including R, a backslash denotes an escape sequence. In addition, a backslash is not a valid path separator on non-Windows platforms, whereas both a slash and a backslash are valid on Windows. So save yourself the trouble and always use a slash as a path separator!

So what is the working directory? By convention, the working directory is the same as the location of the script which you are working on. This means that you can simply assume that whoever runs your script, will run it from the directory that your script is located.

Note: this also means that when you test others’ code, you should also make sure to run it from the directory that the script is located, unless stated otherwise!

To refer to directories or files that are within the working directory, we simply use their names. So to refer to a directory called R in our working directory, we type R. To refer to a file or directory within another directory, we type the name of the directory, a slash, and then the name of the file/directory, for instance, R/function1.R.

If we want to refer to a file or directory that is above the indicated directory, we use the special directory ... For instance, if our main.R is not in our project root, but located in the sub-directory demo (therefore our working directory is demo), we would refer to our function1.R file as ../R/function1.R. Another special directory is ., which refers to the indicated directory itself.

When making Git repository, you want to make sure that all your code is portable and self-contained, i.e. you can run it from any computer, and ideally using any operating system. That means that as a rule of thumb you should always use relative file paths in your scripts.

Question 1: what would be the location of this file: ././R/./.././R/./././function2.R? How about /./R/./.././R/./././function2.R? What would be the meaning of C:/Windows/cmd.exe on Linux? Is it a relative or absolute file path?

Documentation

As we have seen in the package examples, documentation of code is extremely important to understand what your code is doing. It’s not only for others reading your code, but also for yourself, when you have to revisit code you haven’t touched in a while. If you follow the good scripting habits mentioned above, you will already have your code divided into reusable functions, that are in their own function files. Each function should have enough comments to understand each step the function is performing. Comments are any words after the # character, in either language. Same goes for the main script, it should also have each step described. But you don’t need to make it redundant: if it’s already clear from the function name what it does, it’s useless to repeat it in a comment. Rather, you would describe sections of code at once, e.g. what a loop does by itself, rather than its indvidual steps. In addition, if you use code from someone else, make sure to obey the license of the code you are using, and give credit to the original author using comments.

In addition to comments of your code, each function should also have a description and an explanation for all of its arguments. In both R and Python, there is a special way to add this information in comments next to the function definition, so that help files can be generated out of them automatically. The systems have different conventions, but effectively achieve the same purpose.

In Python, the system is called Docstrings. It consists of strings using triple quotes, that effectively acts as a multi-line comment. Here’s an example of a Python function documented with a Dosctring using the official reStructuredText (Sphinx) format:

def hello(name):
  """Greet a person.
  
  Give a customised "Hello" greeting for the input name.
  
  :param name: A string containing the name of the person to greet.
  :return: A greeting string.
  """
  return("Hello " + name)

The Docstring will be automatically printed when you run help(hello), and will be converted into a nice webpage using tools such as Sphinx.

In R, the system is called roxygen2 and is provided by the roxygen2 package. Here’s an equivalent example in R, using roxygen2:

#' Greet a person
#' 
#' Give a customised "Hello" greeting for the input name.
#' 
#' @param name A string containing the name of the person to greet.
#' @returns A greeting string.
hello <- function(name) {
  return(paste("Hello", name))
}

The roxygen2 package was created to make it easier to write documentation in R, so that documentation does not have to be separate from the code. The roxygenize() function generates the documentation files from the comments automatically, and then the documentation appears when running ?hello in our example.

Summary

Increasing your scripting/programming efficiency goes through adopting good scripting habits. Following the guidelines above will ensure that your work:

  • Can be understood and used by others.
  • Can be understood and reused by you in the future.
  • Can be debugged with minimal effort.
  • Can be re-used across different projects.
  • Is easily accessible by others.

The summary of good practice guidelines below is not exhaustive, but already constitutes a good basis that will help you getting more efficient now and in the future when working on scripting projects:

  • Comment your code.
  • Write functions for code you need more than once:
  • Document your functions.
  • Keep consistent style.
  • Make your own packages, or at least keep a similar directory structure across your projects.
  • Use version control to develop/maintain your projects and packages.