R and Python basics

Throughout the course we will be learning R and Python. Both are programming languages, and both can be used for handling geospatial data. So what is the difference between them? Why would you choose one over another? And how does equivalent code look like in each language? We will go over these questions in this tutorial.

Learning objectives of the day

In this tutorial, you will learn to:

describe the differences between R and Python;
apply R and Python for basic handling of data types in each language;
create new object classes and use inheritance.

What are R and Python?

If you ask an average person what Python is, they will tell you that it’s a type of snake. If you ask an average person what R is, they will tell you that it’s a letter of the alphabet. Today you will learn beyond that: the details of the programming languages of R and Python!

Both R and Python programming languages were created at around the same time: Python was created in 1991 in the Netherlands, and R in 1993. However, R actually has a much longer history, because it is an extension of the S programming language that was created all the way back in 1976 in Bell Labs. R was created specifically as a statistical package, similar to SPSS or Stata. However, it has since evolved into a general-purpose programming language. Due to its focus on statistics, R is mainly used in academia (schools and universities).

Python was created as a general-purpose programming language. It was designed with readability in mind, and it has become the most popular programming language in the world. Python is therefore widely used in the industry, such as startups and large corporations. Python is especially strong in the field of deep learning, since packages such as Google’s TensorFlow are implemented primarily in Python. Additionally, because of it’s popularity, it is very easy to find online examples and it is used often to communicate with other software.

In terms of spatial data handling, both R and Python can perform the same tasks. Both are relatively easy to integrate with other software as well, which allows extending the capabilities of each language. For example, the open source GIS software QGIS, including all its plugins, are written in Python. R scripts can be run directly from the QGIS interface as processing tools. R has a package reticulate that can run Python scripts, and Python has a package rpy2 that can run R scripts.

Both R and Python have extremely active communities maintaining a wealth of packages. The package ecosystems differ significantly between the two languages, however. Python packages are hosted on PyPI, the Python Package Index, which is a very easily accessible place where anyone can upload the Python packages they create. There is minimal oversight over them and the package authors are free to do with their packages what they please. The R package repository is called CRAN, the Comprehensive R Archive Network. CRAN is curated: any package submitted to CRAN goes through a review process, and has to pass a suite of tests before it gets accepted. This ensures that, among other things, every argument in every function in every package is documented, and that each package is compatible with all the other packages in CRAN. The result is that it is much more difficult to publish R packages, but the quality of R packages is generally much higher than Python packages. It also means that package management in R is very easy, as no conflicts can happen between packages. Packages are also more inclined to depend on other packages, as it is more certain that their dependencies will stay maintained. In contrast, Python has a lot more packages, but they are often poorly documented and often do not interoperate with other packages as well. Package management is a big issue in Python, as many packages work only with certain versions of other packages.

In practice, this means that while all geospatial data handling tasks can be done in R or Python, R is better integrated for this. In this course you will learn about the packages terra for raster handling and sf for vector handling, which are both integrated with each other and offer a full suite of processing tools, and all of the other packages that use vectors or rasters in R also make use of sf and terra package objects. In Python, raster reading is done using the rasterio package, but processing needs to be done using other packages, many of which do not support rasterio objects. Vector handling is done in for example geopandas, which has spatial processing tools, but once again they are not always integrated with raster object support. So rasters often need to be converted into number matrices for processing, then converted back into raster objects, by the user.

Both R and Python can plot geospatial data and have several frameworks and packages for that. R has a built-in plot() function that can visualise any type of data quickly, and all packages make use of it. For more advanced custom visualisation, ggplot2 can be used. In Python, the standard plotting package is matplotlib. Spatial packages often implement their own plotting functions for their own objects, therefore putting multiple objects into one plot can often be more challenging than in R.

Ultimately, the choice of language to use is often decided by interoperability needs (e.g. if you work in a company that uses Python, you will be expected to also write in Python, so that your script can be used in a pipeline) and personal preference.

Running Python and R from Bash

Both the R and the python interpreters can be run from Bash. Here is an example of code that is correct in both Python and R. You can use this example to execute a script written in either language:

# Create a script file "script" (no extension)
echo "print('Hello world!')" > script
# Run the script with R
Rscript ./script
# Run the script with Python
python3 ./script

## [1] "Hello world!"
## Hello world!

This allows you to run script files even without having any graphical user interface installed or running, which is the fastest way to run any script from top to bottom. You can use this method to try out the code examples below, or use your favourite integrated development environment (IDE), such as RStudio or RKWard for R and Spyder or Jupyter for Python.

Data types

Both Python and R provide a set of primitive data types. The ones they have in common are integers, floating-point numbers, logical boolean values, character strings, associative arrays (dictionaries) and lists. To find out the type of a variable, use type() in Python and class() in R.

A whole number is called an integer, and a number with decimals (real number) is called a float (floating point number). In Python, if you enter a number without decimal points, it will be an integer, otherwise it will be a float.

type(10)

## <class 'int'>

type(10.1)

## <class 'float'>

type(10.)

## <class 'float'>

type(int(10.))

## <class 'int'>

In R, any number will be a float (called “numeric”) by default, and integers are obtained by explicitly casting to an integer using as.integer():

class(10)

## [1] "numeric"

class(10.1)

## [1] "numeric"

class(as.integer(10))

## [1] "integer"

Question 1: What is the type/class of the sum 3+5 and of 3.0+5 in R and in Python? Write a sum of two numbers that returns an integer in R and in Python.

Boolean, or logical, types can only have two values: true or false. When cast to an integer, true is represented by 1 and false is represented by 0. Both R and Python are case-sensitive, and use different cases to represent booleans (True in Python and TRUE in R).

type(True)

## <class 'bool'>

type(True + 2)

## <class 'int'>

True + 2

## 3

class(TRUE)

## [1] "logical"

class(TRUE + 2)

## [1] "numeric"

TRUE + 2

## [1] 3

Character strings are letters and words. They can also include numbers, but the numbers do not have a mathematical meaning, and therefore you cannot do arithmetics on strings. Strings are marked with quotes, either single or double:

"Hello world #1!"

## 'Hello world #1!'

5 + '6'

## unsupported operand type(s) for +: 'int' and 'str'

type('6')

## <class 'str'>

"Hello world #1!"

## [1] "Hello world #1!"

5 + '6'

## Error in 5 + "6": non-numeric argument to binary operator

class('6')

## [1] "character"

You can combine strings to produce longer strings. This is done with the function paste() in R and using the + operator in Python (as long as all parts are strings).

WorldNr = 2
"Hello world #" + str(WorldNr) + "!"

## 'Hello world #2!'

WorldNr = 2
paste("Hello world #", WorldNr, "!", sep="") # The 'sep' argument avoids adding spaces in between

## [1] "Hello world #2!"

Note that Python uses = to define a new variable or to give it a new value, as was done above for the variable WorldNr. On the other hand, R typically uses <- to assign a value to a variable. However R also accepts =. In this tutorial, we use = for R to keep it simple, but you will find later tutorials using <-. Keep in mind that either option is valid as long as it is kept consistent throughout your code!

Additionally, in Python, string formatting can be used, in for example a print function. To do this, we start a string with an f in front of the quote.

WorldNr = 2
print(f"Hello world # {WorldNr} !")

## Hello world # 2 !

A variable can also hold multiple values of a particular data type, or even mix data types. Python supports associative arrays, called dictionaries, that are created using curly braces:

WUR = {"name": "Wageningen University", "x": 5.7, "y": 52}
type(WUR)

## <class 'dict'>

WUR["x"]

## 5.7

type(WUR["x"])

## <class 'float'>

The dictionary type allows naming its elements and accessing them by name. The elements can be of any type.

Both R and Python support lists, which allow combining any data types into one variable. In Python they are created using square brackets, and in R using list():

WUR = ["Wageningen University", 5.7, 52]

WUR = list("Wageningen University", 5.7, 52)

Elements of a list are sequential and can be accessed by slicing the list using a number (index) in square brackets. An important difference between the two languages is that R starts counting from 1, but Python starts counting from 0! This is similar to how in different countries, floors of buildings are counted starting from 1 or from 0. Netherlands is a “Python” style country where the ground floor is number 0, whereas Canada is an “R” style country where the ground floor is number 1. The R style has the advantage of the index matching the element number, i.e. [2] will give you the second element, where in Python the second element is [1]:

WUR[1]

## 5.7

WUR[2]

## [[1]]
## [1] 5.7

There is also a difference in what happens if you use a negative index. Python uses negative indices to wrap around, so [-1] means “last element”, whereas in R it stands for exclusion, so [-1] means “all elements except for the first one”:

WUR[-1]

## 52

WUR[-1]

## [[1]]
## [1] 5.7
## 
## [[2]]
## [1] 52

What is the equivalent of a dictionary in R? It’s a named list! When creating a list, you can specify a name of each element, and then slice the list using names. Unlike the colon : used in Python for dictionaries, R simply uses the equal sign =:

WUR = list("name"="Wageningen University", "x"=5.7, "y"=52)
class(WUR)

## [1] "list"

WUR["x"]

## $x
## [1] 5.7

class(WUR["x"])

## [1] "list"

Here we can see another difference in how R and Python deal with lists. In R, a list always consists of lists, recursively. Each element of a list is always a list itself. To obtain the value, we need to access it using the double square bracket operator:

WUR[["x"]]

## [1] 5.7

class(WUR[["x"]])

## [1] "numeric"

There are several data types that are specific to R, though there are Python packages that implement equivalent functionality as well. The most basic type is a vector, which can hold multiple values of the same type. An extension of a vector is a matrix, that has two dimensions. Adding further dimensions, we get an array. A matrix is a special case of an array (two-dimensional), as is a vector (one-dimensional array).

In R, almost everything appears as a vector. That is why in the R print output above you can see [1] next to most output, indicating that the value is just the first in a 1-length vector. To make longer vectors, the function c() (for concatenate) is used:

WURcoords = c(5.7, 52)
WURcoords

## [1]  5.7 52.0

class(WURcoords)

## [1] "numeric"

Matrices are made using the function matrix by combining multiple vectors:

# `nrow` specifies how many rows the matrix will have
LocMat = matrix(c(WURcoords, WURcoords + 1), nrow = 2)
LocMat

##      [,1] [,2]
## [1,]  5.7  6.7
## [2,] 52.0 53.0

class(LocMat)

## [1] "matrix" "array"

Question 2: Using the c() and matrix() functions, build a tic-tac-toe board with X’s and O’s in R.

And arrays of higher order are likewise created using array():

# Here "dim" defines the shape, i.e. number of rows, columns, layers, etc.
LocArray = array(c(LocMat, LocMat+1), dim=c(2,2,2))
LocArray

## , , 1
## 
##      [,1] [,2]
## [1,]  5.7  6.7
## [2,] 52.0 53.0
## 
## , , 2
## 
##      [,1] [,2]
## [1,]  6.7  7.7
## [2,] 53.0 54.0

class(LocArray)

## [1] "array"

Base Python can only achieve a similar effect using nested lists:

WURcoords = [5.7, 52]
WURcoords

## [5.7, 52]

LocMat = [WURcoords, WURcoords]
LocMat

## [[5.7, 52], [5.7, 52]]

LocArray = [LocMat, LocMat]
LocArray

## [[[5.7, 52], [5.7, 52]], [[5.7, 52], [5.7, 52]]]

Here we can also notice that R can perform vectorised arithmetics: WURcoords + 1 added 1 to each value of the vector WURcoords, and LocMat + 1 added 1 to each value of the matrix LocMat. Core Python does not allow doing so without writing a loop. However, since arrays and vectorised arithmetics are very useful, it has all been implemented in the package NumPy, which is now considered to be an essential package in Python:

import numpy

npWURcoords = numpy.array(WURcoords)
npWURcoords

## array([ 5.7, 52. ])

type(npWURcoords)

## <class 'numpy.ndarray'>

npLocMat = numpy.array([npWURcoords, npWURcoords + 1])
npLocMat

## array([[ 5.7, 52. ],
##        [ 6.7, 53. ]])

type(npLocMat)

## <class 'numpy.ndarray'>

npLocArray = numpy.array([npLocMat, npLocMat + 1])
npLocArray

## array([[[ 5.7, 52. ],
##         [ 6.7, 53. ]],
## 
##        [[ 6.7, 53. ],
##         [ 7.7, 54. ]]])

type(npLocArray)

## <class 'numpy.ndarray'>

Question 3: Make the same tic-tac-toe board as in question 2 in Python. Hint: You may wish to use numpy.

Another useful data type in R is Data Frames. A Data Frame is a table, similar to a matrix, but with one key difference: while matrices require all values to be of the same type, a data.frame only requires each column to have a consistent type. The Data Frame concept comes from R’s statistical background, where it is useful to have multiple variables that are being studied, as columns, and multiple records or observations of those values, as rows. For example:

WURbuildings = data.frame(name = c("Gaia", "Aurora"), x = c(5.665, 5.657), y = c(51.987, 51.982))
WURbuildings

##     name     x      y
## 1   Gaia 5.665 51.987
## 2 Aurora 5.657 51.982

class(WURbuildings)

## [1] "data.frame"

To know the type of each column, we can use the function str (structure):

str(WURbuildings)

## 'data.frame':    2 obs. of  3 variables:
##  $ name: chr  "Gaia" "Aurora"
##  $ x   : num  5.67 5.66
##  $ y   : num  52 52

We see here that the name column is made of character strings, whereas x and y columns are floating point numbers. The Data Frame gives more structure than a plain list does, ensuring that the data has rows and columns and that the column types are consistent.

In Python, Data Frames are implemented in the package pandas:

import pandas
WURbuildings = pandas.DataFrame({"name": ["Gaia", "Aurora"],  "x": [5.665, 5.657], "y": [51.987, 51.982]})
WURbuildings

##      name      x       y
## 0    Gaia  5.665  51.987
## 1  Aurora  5.657  51.982

type(WURbuildings)

## <class 'pandas.core.frame.DataFrame'>

WURbuildings.dtypes

## name     object
## x       float64
## y       float64
## dtype: object

The .dtypes accessor is equivalent to str() in R, though you can see one difference: the strings are reported as “objects”.

Slicing and accessors

When dealing with variables that hold multiple values, we often need to select some smaller subset of those values. This is called slicing an array. It is done using operators that are called accessors. We used three of them in the examples above: [] (both languages), [[]] (R) and . (Python). These are the main accessors, though there are a few others that the languages support.

The [] accessor accepts indices of values to select. As we saw earlier, the indices can also be negative, or a combination of positive and negative. Both R and Python have a way to quickly create ranges for slicing arrays, using the : operator, but there are some differences in how the ranges behave:

Buildings = c("Gaia", "Lumen", "Forum", "Orion", "Aurora", "Atlas")
Buildings[2:4] # We get the second, third and fourth elements

## [1] "Lumen" "Forum" "Orion"

Buildings = ["Gaia", "Lumen", "Forum", "Orion", "Aurora", "Atlas"]
Buildings[2:4] # We get the third and fourth elements

## ['Forum', 'Orion']

As we can see, slicing in R is inclusive, i.e. any number you mention will be included in the resulting slice. In Python, it is one-sided: the fist number of the range will be included, but the last number will be excluded. And, of course, Python counts from 0 whereas R counts from 1.

Let’s try to slice the first half and the second half of the string array in both languages:

Buildings[1:3]

## [1] "Gaia"  "Lumen" "Forum"

Buildings[4:6]

## [1] "Orion"  "Aurora" "Atlas"

Buildings[:3] # The fourth element is not included

## ['Gaia', 'Lumen', 'Forum']

Buildings[3:] # It is included here

## ['Orion', 'Aurora', 'Atlas']

Question 4: Create a subset of Buildings with only Gaia and Aurora, using index slicing in R and Python. Hint: You may wish to use the functions c() and numpy.array().

As you see, Python interprets a missing number as the first and the last element of the array. R does not, and requires you to enter 1 for the first element and length(YourArray) (e.g. length(Buildings)) for the last one.

When slicing two-dimensional arrays (matrices or data frames), two indices are used, separated by a comma, in the order [row, column]:

LocMat

##      [,1] [,2]
## [1,]  5.7  6.7
## [2,] 52.0 53.0

LocMat[2, 1] # Get y coordinate of WUR campus

## [1] 52

npLocMat

## array([[ 5.7, 52. ],
##        [ 6.7, 53. ]])

npLocMat[0, 1] # Get y coordinate of WUR campus

## 52.0

If you want to select the whole row/column, without manually specifying the size of your array, in Python you can use :, and in R you can omit the index:

LocMat[, 1] # Get both coordinates of WUR campus

## [1]  5.7 52.0

npLocMat[0, :] # Get both coordinates of WUR campus

## array([ 5.7, 52. ])

Likewise, if we have an array with more dimensions, we can specify more indices (e.g. a three-dimensional array will accept three indices, x, y, z).

Previously we saw another accessor in R, namely, the [[]] accessor. It is a list accessor, and generally is used to directly access a value, treating the variable as a list. The accessor accepts numbers and character strings as input. Another accessor popular in R is the $ accessor, which allows accessing values by name. The $ accessor is generally equivalent to the [[]] accessor with a string input. Here’s an example of how these work with data frames:

WURbuildings

##     name     x      y
## 1   Gaia 5.665 51.987
## 2 Aurora 5.657 51.982

WURbuildings[1] # When accessed using a single number, it selects a column

##     name
## 1   Gaia
## 2 Aurora

class(WURbuildings[1]) # Single-column data.frame

## [1] "data.frame"

WURbuildings[[1]] # Also selects a column

## [1] "Gaia"   "Aurora"

class(WURbuildings[[1]]) # But now the result is a vector!

## [1] "character"

WURbuildings[["name"]] # Column by name

## [1] "Gaia"   "Aurora"

WURbuildings$name # Same

## [1] "Gaia"   "Aurora"

Generally the $ accessor is not recommended, because it does not allow the use of variables, it only works with literal strings.

Note that in Python, the operator [[]] has a very different meaning: the outer [] is the accessor, and the inner [] is a list constructor, in other words, it’s equivalent to the R [c()]:

npLocArray = numpy.array([npLocMat, npLocMat + 1, npLocMat + 2, npLocMat + 3])
npLocArray

## array([[[ 5.7, 52. ],
##         [ 6.7, 53. ]],
## 
##        [[ 6.7, 53. ],
##         [ 7.7, 54. ]],
## 
##        [[ 7.7, 54. ],
##         [ 8.7, 55. ]],
## 
##        [[ 8.7, 55. ],
##         [ 9.7, 56. ]]])

npLocArray[0,1] # Meaning "first layer, second row, all columns"

## array([ 6.7, 53. ])

npLocArray[[0,2]] # Meaning "first and third layers, all rows and all columns"

## array([[[ 5.7, 52. ],
##         [ 6.7, 53. ]],
## 
##        [[ 7.7, 54. ],
##         [ 8.7, 55. ]]])

Similar to the R $ accessor, Python has the accessor . to select by literal string:

WURbuildings

##      name      x       y
## 0    Gaia  5.665  51.987
## 1  Aurora  5.657  51.982

WURbuildings["name"]

## 0      Gaia
## 1    Aurora
## Name: name, dtype: object

WURbuildings.name

## 0      Gaia
## 1    Aurora
## Name: name, dtype: object

Instead of supplying numbers or strings, we can also slice by using a boolean array of the same dimensions as what we are trying to slice. This is very handy, as we can make use of this to select by a rule:

LocMat

##      [,1] [,2]
## [1,]  5.7  6.7
## [2,] 52.0 53.0

LocMat[LocMat > 50] # Get only the values above 50

## [1] 52 53

What happens here is that the inner part of the accessor generates a boolean array, which is subsequently used for slicing, as if it was a mask:

LocMat > 50

##       [,1]  [,2]
## [1,] FALSE FALSE
## [2,]  TRUE  TRUE

The same applies in NumPy:

npLocMat

## array([[ 5.7, 52. ],
##        [ 6.7, 53. ]])

npLocMat[npLocMat > 50]

## array([52., 53.])

Question 5: In R and Python, slice LocMat to select all values below 6 and above 52. Hint: You can combine conditions by using & (AND) or | (OR).

Objects and inheritance

Both R and Python are object-oriented languages, and in this tutorial we have already worked with many objects. For example, data frames, lists and matrices are objects, and the class() or type() function prints what type of object it is, in other words, what is the class of the object. In R, str() allows investigating the structure of an object. Python does not have a unified function for this and different packages implement this functionality differently, if at all.

WUR = list("name"="Wageningen University", "x"=5.7, "y"=52)
str(WUR) # Show the structure of a list

## List of 3
##  $ name: chr "Wageningen University"
##  $ x   : num 5.7
##  $ y   : num 52

WURbuildings

##     name     x      y
## 1   Gaia 5.665 51.987
## 2 Aurora 5.657 51.982

str(WURbuildings) # Show the structure of a data frame (which is also a type of list)

## 'data.frame':    2 obs. of  3 variables:
##  $ name: chr  "Gaia" "Aurora"
##  $ x   : num  5.67 5.66
##  $ y   : num  52 52

WURbuildings # pandas data frame

##      name      x       y
## 0    Gaia  5.665  51.987
## 1  Aurora  5.657  51.982

WURbuildings.dtypes # Shows data types

## name     object
## x       float64
## y       float64
## dtype: object

vars(WURbuildings) # More generic, but does not work with lists and dicts

## {'_is_copy': None, '_mgr': BlockManager
## Items: Index(['name', 'x', 'y'], dtype='object')
## Axis 1: RangeIndex(start=0, stop=2, step=1)
## NumericBlock: slice(1, 3, 1), 2 x 2, dtype: float64
## ObjectBlock: slice(0, 1, 1), 1 x 2, dtype: object, '_item_cache': {'name': 0      Gaia
## 1    Aurora
## Name: name, dtype: object}, '_attrs': {}, '_flags': <Flags(allows_duplicate_labels=True)>}

dir(WURbuildings) # All of the methods contained in the object

## ['T', '_AXIS_LEN', '_AXIS_ORDERS', '_AXIS_REVERSED', '_AXIS_TO_AXIS_NUMBER', '_HANDLED_TYPES', '__abs__', '__add__', '__and__', '__annotations__', '__array__', '__array_priority__', '__array_ufunc__', '__array_wrap__', '__bool__', '__class__', '__contains__', '__copy__', '__deepcopy__', '__delattr__', '__delitem__', '__dict__', '__dir__', '__divmod__', '__doc__', '__eq__', '__finalize__', '__floordiv__', '__format__', '__ge__', '__getattr__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__iadd__', '__iand__', '__ifloordiv__', '__imod__', '__imul__', '__init__', '__init_subclass__', '__invert__', '__ior__', '__ipow__', '__isub__', '__iter__', '__itruediv__', '__ixor__', '__le__', '__len__', '__lt__', '__matmul__', '__mod__', '__module__', '__mul__', '__ne__', '__neg__', '__new__', '__nonzero__', '__or__', '__pos__', '__pow__', '__radd__', '__rand__', '__rdivmod__', '__reduce__', '__reduce_ex__', '__repr__', '__rfloordiv__', '__rmatmul__', '__rmod__', '__rmul__', '__ror__', '__round__', '__rpow__', '__rsub__', '__rtruediv__', '__rxor__', '__setattr__', '__setitem__', '__setstate__', '__sizeof__', '__str__', '__sub__', '__subclasshook__', '__truediv__', '__weakref__', '__xor__', '_accessors', '_accum_func', '_add_numeric_operations', '_agg_by_level', '_agg_examples_doc', '_agg_summary_and_see_also_doc', '_align_frame', '_align_series', '_arith_method', '_as_manager', '_attrs', '_box_col_values', '_can_fast_transpose', '_check_inplace_and_allows_duplicate_labels', '_check_inplace_setting', '_check_is_chained_assignment_possible', '_check_label_or_level_ambiguity', '_check_setitem_copy', '_clear_item_cache', '_clip_with_one_bound', '_clip_with_scalar', '_cmp_method', '_combine_frame', '_consolidate', '_consolidate_inplace', '_construct_axes_dict', '_construct_axes_from_arguments', '_construct_result', '_constructor', '_constructor_sliced', '_convert', '_count_level', '_data', '_dir_additions', '_dir_deletions', '_dispatch_frame_op', '_drop_axis', '_drop_labels_or_levels', '_ensure_valid_index', '_find_valid_index', '_flags', '_from_arrays', '_from_mgr', '_get_agg_axis', '_get_axis', '_get_axis_name', '_get_axis_number', '_get_axis_resolvers', '_get_block_manager_axis', '_get_bool_data', '_get_cleaned_column_resolvers', '_get_column_array', '_get_index_resolvers', '_get_item_cache', '_get_label_or_level_values', '_get_numeric_data', '_get_value', '_getitem_bool_array', '_getitem_multilevel', '_gotitem', '_hidden_attrs', '_indexed_same', '_info_axis', '_info_axis_name', '_info_axis_number', '_info_repr', '_init_mgr', '_inplace_method', '_internal_names', '_internal_names_set', '_is_copy', '_is_homogeneous_type', '_is_label_or_level_reference', '_is_label_reference', '_is_level_reference', '_is_mixed_type', '_is_view', '_iset_item', '_iset_item_mgr', '_iset_not_inplace', '_item_cache', '_iter_column_arrays', '_ixs', '_join_compat', '_logical_func', '_logical_method', '_maybe_cache_changed', '_maybe_update_cacher', '_metadata', '_mgr', '_min_count_stat_function', '_needs_reindex_multi', '_protect_consolidate', '_reduce', '_reindex_axes', '_reindex_columns', '_reindex_index', '_reindex_multi', '_reindex_with_indexers', '_replace_columnwise', '_repr_data_resource_', '_repr_fits_horizontal_', '_repr_fits_vertical_', '_repr_html_', '_repr_latex_', '_reset_cache', '_reset_cacher', '_sanitize_column', '_series', '_set_axis', '_set_axis_name', '_set_axis_nocheck', '_set_is_copy', '_set_item', '_set_item_frame_value', '_set_item_mgr', '_set_value', '_setitem_array', '_setitem_frame', '_setitem_slice', '_slice', '_stat_axis', '_stat_axis_name', '_stat_axis_number', '_stat_function', '_stat_function_ddof', '_take_with_is_copy', '_to_dict_of_blocks', '_typ', '_update_inplace', '_validate_dtype', '_values', '_where', 'abs', 'add', 'add_prefix', 'add_suffix', 'agg', 'aggregate', 'align', 'all', 'any', 'append', 'apply', 'applymap', 'asfreq', 'asof', 'assign', 'astype', 'at', 'at_time', 'attrs', 'axes', 'backfill', 'between_time', 'bfill', 'bool', 'boxplot', 'clip', 'columns', 'combine', 'combine_first', 'compare', 'convert_dtypes', 'copy', 'corr', 'corrwith', 'count', 'cov', 'cummax', 'cummin', 'cumprod', 'cumsum', 'describe', 'diff', 'div', 'divide', 'dot', 'drop', 'drop_duplicates', 'droplevel', 'dropna', 'dtypes', 'duplicated', 'empty', 'eq', 'equals', 'eval', 'ewm', 'expanding', 'explode', 'ffill', 'fillna', 'filter', 'first', 'first_valid_index', 'flags', 'floordiv', 'from_dict', 'from_records', 'ge', 'get', 'groupby', 'gt', 'head', 'hist', 'iat', 'idxmax', 'idxmin', 'iloc', 'index', 'infer_objects', 'info', 'insert', 'interpolate', 'isin', 'isna', 'isnull', 'items', 'iteritems', 'iterrows', 'itertuples', 'join', 'keys', 'kurt', 'kurtosis', 'last', 'last_valid_index', 'le', 'loc', 'lookup', 'lt', 'mad', 'mask', 'max', 'mean', 'median', 'melt', 'memory_usage', 'merge', 'min', 'mod', 'mode', 'mul', 'multiply', 'name', 'ndim', 'ne', 'nlargest', 'notna', 'notnull', 'nsmallest', 'nunique', 'pad', 'pct_change', 'pipe', 'pivot', 'pivot_table', 'plot', 'pop', 'pow', 'prod', 'product', 'quantile', 'query', 'radd', 'rank', 'rdiv', 'reindex', 'reindex_like', 'rename', 'rename_axis', 'reorder_levels', 'replace', 'resample', 'reset_index', 'rfloordiv', 'rmod', 'rmul', 'rolling', 'round', 'rpow', 'rsub', 'rtruediv', 'sample', 'select_dtypes', 'sem', 'set_axis', 'set_flags', 'set_index', 'shape', 'shift', 'size', 'skew', 'slice_shift', 'sort_index', 'sort_values', 'squeeze', 'stack', 'std', 'style', 'sub', 'subtract', 'sum', 'swapaxes', 'swaplevel', 'tail', 'take', 'to_clipboard', 'to_csv', 'to_dict', 'to_excel', 'to_feather', 'to_gbq', 'to_hdf', 'to_html', 'to_json', 'to_latex', 'to_markdown', 'to_numpy', 'to_parquet', 'to_period', 'to_pickle', 'to_records', 'to_sql', 'to_stata', 'to_string', 'to_timestamp', 'to_xarray', 'to_xml', 'transform', 'transpose', 'truediv', 'truncate', 'tz_convert', 'tz_localize', 'unstack', 'update', 'value_counts', 'values', 'var', 'where', 'x', 'xs', 'y']

In the last example we can see that objects can contain other objects and/or functions. A class is a definition of an object, and a function contained inside an object is traditionally called a method.

R and Python both support classes and objects, but their philosophy regarding them differs significantly. R is very lax and allows the users to freely (re)define object classes as they see fit. Python is a lot more formal and requires a formal class definition to define an object class. Python prefers self-contained objects that contain all the methods that can interact with the object embedded inside the object. In contrast, the R philosophy is to define global functions whose behaviour is different depending on the class the function is run on.

As an example, we might want to define a class building which, when instantiated as an object, will contain a name, and a vector coordinates. We want to also have a function that prints these attributes in a nice to read way. This is how it would be done in R:

Gaia = list("name"="Gaia", "coordinates"=c("x"=5.665, "y"=51.987))
class(Gaia) # It is a list

## [1] "list"

class(Gaia) = "building" # And now it's a `building`!
class(Gaia)

## [1] "building"

Aurora = list("name"="Aurora", "coordinates"=c("x"=5.657, "y"=51.982))
class(Aurora) = "building"
class(Aurora)

## [1] "building"

# Define a function to print buildings
# We use the "paste" function to format strings. 
print.building = function(x)
{
  print(paste(x[["name"]], "is a building that is located at x:",
    x[["coordinates"]]["x"], "y:", x[["coordinates"]]["y"]))
}
# Now we simply use print() and get our custom output:
print(Gaia)

## [1] "Gaia is a building that is located at x: 5.665 y: 51.987"

print(Aurora)

## [1] "Aurora is a building that is located at x: 5.657 y: 51.982"

The reason why this works is because R uses the concept of function signatures. When running print(), R will first check what is the class of the object you are calling the function on, and checks if there is a function defined that is called print.class (in our case print.building). On a match, it will run that function instead of the print.default function.

In Python, we need to formally define a class and then instantiate it:

class building:
  def __init__(self, name, coordinates):
    self.name = name
    self.coordinates = coordinates

  def print(self):
    # Note that we use "string formatting". If a "f" is put before a string quote, 
    # all text inside curly brackets will be executed as plain Python 
    print(f'{self.name} is a building that is located at x: {self.coordinates["x"]}, y: {self.coordinates["y"]}')

# Instantiate the class by calling it as if it was its __init__ function
Gaia = building(name="Gaia", coordinates={"x": 5.665, "y": 51.987})
Aurora = building(name="Aurora", coordinates={"x": 5.657, "y": 51.982})
type(Gaia)

## <class '__main__.building'>

type(Aurora)

## <class '__main__.building'>

Gaia.print()

## Gaia is a building that is located at x: 5.665, y: 51.987

Aurora.print()

## Aurora is a building that is located at x: 5.657, y: 51.982

Note that we called the method print() that is inside our object, not the global function print(), which would give a different output:

print(Gaia)

## <__main__.building object at 0x7fdee5dd59c0>

print(Aurora)

## <__main__.building object at 0x7fdee5dd7e20>

A key concept in object-oriented programming is inheritance: a class can inherit properties and methods of its parent class. This allows us to make extensions of existing classes without having to duplicate a lot of work. Let’s say we want to extend our building class with an attribute purpose, calling the new child class purposeBuilding. In R, inheritance works by making an object part of multiple classes:

GaiaPurpose = list("name"="Gaia", "coordinates"=c("x"=5.665, "y"=51.987), purpose="office")
class(GaiaPurpose) = c("purposeBuilding", "building")
print(GaiaPurpose) # We reuse the parent class `print()`

## [1] "Gaia is a building that is located at x: 5.665 y: 51.987"

AuroraPurpose = list("name"="Aurora", "coordinates"=c("x"=5.657, "y"=51.982), purpose="education")
class(AuroraPurpose) = c("purposeBuilding", "building")
print(AuroraPurpose)

## [1] "Aurora is a building that is located at x: 5.657 y: 51.982"

# Make a custom print function for purposeBuilding
print.purposeBuilding = function(x)
{
  print(paste(x[["name"]], "is an", x[["purpose"]], "building that is located at x:",
    x[["coordinates"]]["x"], "y:", x[["coordinates"]]["y"]))
}

print(GaiaPurpose)

## [1] "Gaia is an office building that is located at x: 5.665 y: 51.987"

print(AuroraPurpose)

## [1] "Aurora is an education building that is located at x: 5.657 y: 51.982"

In Python, inheritance is also formally declared in the definition of the new class:

class purposeBuilding(building):
  def __init__(self, name, coordinates, purpose):
    super().__init__(name, coordinates) # Let the parent class handle these
    self.purpose = purpose

Gaia = purposeBuilding(name="Gaia", coordinates={"x": 5.665, "y": 51.987}, purpose="office")
Aurora = purposeBuilding(name="Aurora", coordinates={"x": 5.657, "y": 51.982}, purpose="education")
Gaia.print()

## Gaia is a building that is located at x: 5.665, y: 51.987

Aurora.print()

## Aurora is a building that is located at x: 5.657, y: 51.982

We can override methods by redeclaring them, but any instantiated objects will have to be reinstantiated for the changes to apply:

# Let's also override the print now:
class purposeBuilding(building):
  def __init__(self, name, coordinates, purpose):
    super().__init__(name, coordinates) # Let the parent class handle these
    self.purpose = purpose

  def print(self):
    print(f'{self.name} is a building that is located at x: {self.coordinates["x"]}, y: {self.coordinates["y"]}')
  

# If we don't reinstantiate, the old definition applies:
Gaia.print()

## Gaia is a building that is located at x: 5.665, y: 51.987

Aurora.print()

## Aurora is a building that is located at x: 5.657, y: 51.982

Gaia = purposeBuilding(name="Gaia", coordinates={"x": 5.665, "y": 51.987}, purpose="office")
Aurora = purposeBuilding(name="Aurora", coordinates={"x": 5.657, "y": 51.982}, purpose="education")
Gaia.print()

## Gaia is a building that is located at x: 5.665, y: 51.987

Aurora.print()

## Aurora is a building that is located at x: 5.657, y: 51.982

Note that R also has a more formalised type of classes, called S4 classes, that behave a bit more similar to the Python classes, but S4 classes are generally not recommended to use as they are further away from the R philosophy.

Scope and side effects

Another difference between R and Python you may also have noticed in the examples above: R uses curly braces {} to denote scope, whereas Python uses a colon : and enforces indentation. As an example:

def scope():
  print("I am inside the scope of the function scope()!")
  print("I am also inside the scope of the function scope()!")
print("I am outside the scope of the function scope()!")

## I am outside the scope of the function scope()!

scope()

## I am inside the scope of the function scope()!
## I am also inside the scope of the function scope()!

scope = function() {
  print("I am inside the scope of the function scope()!")
  print("I am also inside the scope of the function scope()!")
}
  print("I am outside the scope of the function scope()!")

## [1] "I am outside the scope of the function scope()!"

scope()

## [1] "I am inside the scope of the function scope()!"
## [1] "I am also inside the scope of the function scope()!"

In addition, in R, typically anything that happens inside of a function scope stays in the function, i.e. it does not alter the global state.

location = "Gaia"
move = function(where)
{
  location = where
  print(paste("Moved to", location))
}

move("Aurora")

## [1] "Moved to Aurora"

print(location)

## [1] "Gaia"

In Python it is also often true, but the list of exceptions is much longer.

location = "Gaia"
def move(where):
  location = where
  print("Moved to " + location)

move("Aurora")

## Moved to Aurora

print(location)

## Gaia

For instance, lists are mutable even outside of the function scope:

locations = ["Gaia"]
def addLocation(where):
  locations.append(where)
  print(locations)

addLocation("Aurora")

## ['Gaia', 'Aurora']

print(locations)

## ['Gaia', 'Aurora']

Such mutability that goes outside of the function scope is called a side effect. They are best avoided as much as possible, as it brings confusion to the users (and may even destroy their work)! Normally, you expect that if you as a user run a function, it will process your arguments and return a new, processed object, but will not change your global environment, or the object that you passed to the function itself.

To avoid such side effects, in Python we need to explicitly copy mutable objects:

locations = ["Gaia"]
def addLocation(where):
  result = locations.copy()
  result.append(where)
  print(result)

addLocation("Aurora")

## ['Gaia', 'Aurora']

print(locations)

## ['Gaia']

Summary

We have looked at some similarities and differences between R and Python. As you can see, the same or similar functionality is available in both languages, but the philosophies of the languages are sometimes different, so it is important to be aware of them. We will make extensive use of the basics in the upcoming tutorials, and will build further upon them to specifically handle spatial data. We will first go through the specifics of R, then the specifics of Python, but it’s good to keep in mind throughout the course that both languages can do what we want, just in a slightly different way. You can use this tutorial as a reference for the basics if you get stuck in future tutorials. Calling R and Python from Bash can also be a very useful technique to combine them during the project at the end of the course.

Dainius Masiliūnas

11 September, 2023