Week 2, Lesson 6: Intro to vector with R

WUR Geoscripting WUR logo

Week 2, Lesson 6: Intro to vector with R

To Do in the morning

Go through the whole lesson and try to answer the questions below. We will address the questions in the lesson and discuss them.

Objective of today

Learn how to handle vector data

Learning outcomes of today:

In today's lecture, we will explore the basics of handling spatial vector data in R. There are several R packages for this purpose but we will focus on using sf and some related packages. At the end of the lecture, you should be able to:

  • create point, line and polygon objects from scratch;
  • explore the structure of sf classes for spatial vector data;
  • plot spatial vector data;
  • transform between datums and map projections;
  • write and read spatial vector formats (e.g. KML, GML, GeoJSON, shapefile);
  • apply basic operations on vector data, such as buffering, intersection and area calculation;

Vector R basics

Some packages for working with spatial vector data in R

In this tutorial we will use the sf package. It is a spatial vector data package, that builds on top of sp, rgdal and rgeos packages.

The sf package focuses solely on vector data. It provides a standardized encoding of vector data and uses GDAL to read and write data, GEOS for geometrical operations and Proj.4 for projection conversions and datum transformations.

Another package that commonly is used, is the sp package. It provides classes for importing, manipulating and exporting spatial data in R, and methods for doing so. It can handle both vectors and rasters. It is often the foundation for other spatial packages, such as raster.

The rgdal package includes bindings to parts of the OGR Simple Feature Library which provides access to a variety of vector file formats such as GML, GeoJSON, ESRI Shapefiles and KML. The OGR library is part of the widely used Geospatial Data Abstraction Library (GDAL). The GDAL library is the most useful freely-available library for reading and writing geospatial data. The GDAL library is well-documented (http://gdal.org/), but with a catch for R and Python programmers. The GDAL (and associated OGR) library and command line tools are all written in C and C++. Bindings are available that allow access from a variety of other languages including R and Python but the documentation is all written for the C++ version of the libraries. This can make reading the documentation rather challenging. Fortunately, the sf package, providing GDAL bindings in R, is also well documented with lots of examples. The same is valid for the Python libaries.

Similarly, rgeos is an interface to the powerful Geometry Engine Open Source (GEOS) library for all kinds of operations on geometries (buffering, overlaying, area calculations, etc.).

Thus, functionality that you commonly find in expensive GIS software is also available within R, using free but very powerful software libaries.

The possiblities are huge. In this course we can only scratch the surface with some essentials, which hopefully invite you to experiment further and use them in your research. Details can be found in the book Applied Spatial Data Analysis with R and several vignettes authored by Roger Bivand, Edzer Pebesma and Virgilio Gomez-Rubio.

Creating and manipulating geometries

The package sf has geometry types, such as point (for one point), linestring (for one line) and polygon (for one polygon). For sets of spatial classes it has the multipoint (multiple points), multilinestring, multipolygon and geometrycollection. The geometry collection data class allows to have mixed geometries in one object. The package sf has even more special geometry types (check here). The most common geometry types for spatial vector data are shown below:

Overview of sf package geometry types.
Geometry type description
point POINT single point ; zero-dimensional geometry
points MULTIPOINT set of points
line LINESTRING sequence of points connected by straight, non-self intersecting line pieces; one-dimensional geometry
lines MULTILINESTRING set of linestrings
rings POLYGON sequence of points form a closed, non-self intersecting ring; the first ring denotes the exterior ring, zero or more subsequent rings denote holes in this exterior ring; geometry with a positive area (two-dimensional);
rings MULTIPOLYGON set of polygons
mix GEOMETRYCOLLECTION set of geometries of any type except GEOMETRYCOLLECTION

Simple feature objects are stored in data classes. Three data classes are used in sf to represent features:

  • simple feature geometry (sfg): the feature geometry of an individual simple feature;
  • simple feature collections (sfc): a list or collection of simple feature geometries;
  • simple features (sf): simple feature collections linked to a data.frame with feature attributes;

We will go through a few examples of creating geometries from scratch to familiarize yourself with these geometry types and data classes.

First, go to Google Maps on your computer, click on a location (e.g. a building or road), look at the bottom of the banner that pops up and copy the longitude and latitude of two points in Wageningen that are relevant to you. Use a decimal degree notation with at least 4 digits after the decimal point.

Question 1: In what coordinate reference system [CRS] are your points from Google Maps? What is the EPSG belonging to that CRS?

Points

The example below shows how you can create spatial point objects from these coordinates. Type ?function name (e.g. ?st_point) for finding help on the functions used.

## Load the sf package
library(sf)
## Linking to GEOS 3.5.0, GDAL 2.1.3, proj.4 4.9.2, lwgeom 2.2.1 r14555
## Create two simple feature geometry points
## Get coordinates of two points identified in Google Maps, for example:
point1_sfg <- st_point(c(5.664190, 51.993843))  # enter your own lat/long
point2_sfg <- st_point(c(5.673310, 51.982752))   # enter your own lat/long

## Combine point features into simple feature collection with CRS: mypoints_sfc
mypoints_sfc <- st_sfc(point1_sfg, point2_sfg, crs = 4326) 
mypoints_sfc_wgs <- st_sfc(point1_sfg, point2_sfg, crs = "+proj=longlat +datum=WGS84 +no_defs")
# These two functions do the same. Which one do you like more?
## Are they identical?
identical(mypoints_sfc, mypoints_sfc_wgs)
## Inspect the geometry types and data class of your objects
point1_sfg
class(point1_sfg)

mypoints_sfc
class(mypoints_sfc)
## Create and display some attribute data and store in a data frame
mydata <- data.frame(cbind(
    id = c(1,2), 
    name = c("my first point","my second point")))
mydata
##   id            name
## 1  1  my first point
## 2  2 my second point
## Combine simple feature collection and dataframe to create simple feature: mypoints_sf
mypoints_sf <- st_set_geometry(mydata, mypoints_sfc)
mypoints_sf
class(mypoints_sf) # notice how simple features are two classes
class(mypoints_sf) == "data.frame"

Question 2: What is the the difference between the objects mypoints_sfc and mypoints_sf?

## When plotting a simple feature it makes a plot for every column
## Plot only the name column
plot(mypoints_sf[,"name"], col = c("orange","blue"))

Lines

Now let us connect the two points by a straight line. Consult how to make a line: ?st_linestring.

## Now make the line by creating a matrix and linestring
points_matrix <- rbind(point1_sfg, point2_sfg) # Create matrix
myline_sfg <- st_linestring(points_matrix) # Create sfg
myline_sfc <- st_sfc(myline_sfg, crs = 4326) # Create sfc
## Check the class and structure of the objects again.
class(points_matrix)
class(myline_sfg)
class(myline_sfc)
myline_sfc
## Add feature attributes by giving the line an id and a name
myline_df <- data.frame(id = 1, name = "Straight line") # Create data.frame
myline_sf <- st_set_geometry(myline_df, myline_sfc) #Create sf
## Check class and geometry type
class(myline_df)
class(myline_sf)
myline_sf
plot(myline_sf[1], main = myline_sf$name, col = "red", lwd = 3, lty = 5, graticule = TRUE, xlab = "latitude", ylab = "longitude", axes = TRUE)

Try to understand the above code and its results by studying help: ?plot_sf

Try to add the points together with the lines on the same map.

Writing and reading spatial vector data

What now follows is a brief intermezzo before we continue with the classes for polygons. We will use the OGR functionality of GDAL available through the package sf.

## Create subdirectory data within current working directory
#getwd() #if needed
#setwd() 
dir.create("data", showWarnings = FALSE) 

## Write to KML
st_write(obj = mypoints_sf, dsn = "./data/mypoints.kml")
st_write(obj = myline_sf, dsn = "./data/myline.kml")

The function st_write requires entries for the arguments obj (object of class sf) and dsn (data source name). The function detects what driver it needs based on the file extension of the dsn. Please study details in the help file: ?st_write. Similarly the function st_read allows reading OGR compatible data into a suitable simple feature object.

Check (in Google Maps) whether the attribute data were written correctly to the KML output (use My Maps from Google). You need to be logged in on your google account, go to My Maps, click get started, click the + button to create a new digital map. On the untitled layer click import, import the mypoints.kml, add a new layer and import myline.kml. Cool! You just created a webmap with your point and line features.

Now let us see if we can create a line in the webmap and import it to R. Digitize a route (e.g. a walking route) between two points of interest in Wageningen: click on the draw a line symbol (between the place marker and the direction symbol), click Add walking route, draw your route in Wageningen and name the new layer myroute.

Click on the main menu in the left panel (top ... symbol), click export to KML, check box export to .KML file, select your route layer and download the kml (check here how to export). Save the KML in the data folder within the working directory under the name route.kml. We will read this file into a simple features object.

## Read the KML as a simple feature
myroute <- st_read(dsn = "./data/myroute.kml") 
## Reading layer `route' from data source `/home/david/Documents/Geoscripting/IntroToVector/data/myroute.kml' using driver `LIBKML'
## Simple feature collection with 3 features and 11 fields
## geometry type:  GEOMETRY
## dimension:      XYZ
## bbox:           xmin: 5.66538 ymin: 51.97307 xmax: 5.67242 ymax: 51.9777
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
myroute #What is the geometry type of this object? 
## Simple feature collection with 3 features and 11 fields
## geometry type:  GEOMETRY
## dimension:      XYZ
## bbox:           xmin: 5.66538 ymin: 51.97307 xmax: 5.67242 ymax: 51.9777
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
##                                                                                         Name
## 1 Directions from Churchillweg 141-185, 6708 Wageningen, Netherlands to Bloemhof, Wageningen
## 2                                         Churchillweg 141-185, 6708 Wageningen, Netherlands
## 3                                                                       Bloemhof, Wageningen
##   description timestamp begin  end altitudeMode tessellate extrude
## 1                                    1       0
## 2                                   -1       0
## 3                                   -1       0
##   visibility drawOrder icon                       geometry
## 1         -1        NA  LINESTRING Z (5.67109 51.97...
## 2         -1        NA  POINT Z (5.6710926 51.97769...
## 3         -1        NA  POINT Z (5.6653834 51.97358...
names(myroute)
##  [1] "Name"         "description"  "timestamp"    "begin"       
##  [5] "end"          "altitudeMode" "tessellate"   "extrude"     
##  [9] "visibility"   "drawOrder"    "icon"         "geometry"
myroute = myroute[,"Name"] # Select name column only
plot(myroute) # What geometry types do you see?

Now we have the data, but some of our friends want this exact data too. One friend of ours is a software engineer and he wants a GeoJSON. Another friend is a GIS-analyst in QGIS and as a backup he wants the file in two formats: Geographic Markup Language (GML) and a shapefile. These fileformats (KML, GeoJSON, GML and shapefile) are commonly used in spatial analysis. Let's try to give them the files in those formats!

## Try to export the simple feature to a GeoJSON. What happens?
try(st_write(myroute, dsn = "./data/route.geojson"))

Okay success! Let us continue with the GML.

## Try to export the simple feature to a GeoJSON. What happens?
try(st_write(myroute, dsn = "./data/route.gml"))

Works! Now transform to a shapefile.

## Again export to a shapefile. What happens now?
try(st_write(myroute, dsn = "./data/route.shp"))

Writing the myroute object to a shapefile worked only for the linestring, but did not work for the points. Shapefiles do not handle multiple geometry types well. We can solve this! Have a look at ?st_is.

## Select linestring route
myroute_line <- myroute[st_is(myroute, type = "LINESTRING"),]

## Select start and end point route
myroute_points <- myroute[st_is(myroute, type = "POINT"),]
## Let us try to write the simple features to a shapefile
try(st_write(myroute_line, dsn = "./data/route_line.shp"))
try(st_write(myroute_points, dsn = "./data/route_points.shp"))

We run into another issue; shapefiles do not support 3D line strings and points. We have fixed the geometry type, but now need to work on the dimensions of the data. Let us look into dimensions.

Dimensions

All geometries are composed of points. Points are coordinates in a 2-, 3- or 4-dimensional space. All points in a geometry have the same dimensionality. In addition to X and Y coordinates, there are two optional additional dimensions:

  • a Z coordinate, denoting altitude
  • a M coordinate (rarely used), denoting a measure that is associated with the point; examples could be time of measurement, or measurement error of the coordinates

The four possible cases are:

  1. two-dimensional points XY
  2. three-dimensional points XYZ
  3. three-dimensional points XYM
  4. four-dimensional points XYZM

So we need to find a function that changes dimensions from XYZ to XY. Feel free to script your own function that can change the dimensions. Otherwise look into ?st_zm.

## Convert XYZ dimension to XY dimension
myroute_points_xy <- st_zm(myroute_points)
myroute_line_xy <- st_zm(myroute_line)

## Check your geometry type and dimension. Are they point/linestring and XY?
myroute_points_xy
## Simple feature collection with 2 features and 1 field
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 5.665383 ymin: 51.97358 xmax: 5.671093 ymax: 51.9777
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
##                                                 Name
## 2 Churchillweg 141-185, 6708 Wageningen, Netherlands
## 3                               Bloemhof, Wageningen
##                       geometry
## 2 POINT (5.6710926 51.9776967)
## 3 POINT (5.6653834 51.9735844)
myroute_line_xy
## Simple feature collection with 1 feature and 1 field
## geometry type:  LINESTRING
## dimension:      XY
## bbox:           xmin: 5.66538 ymin: 51.97307 xmax: 5.67242 ymax: 51.9777
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
##                                                                                         Name
## 1 Directions from Churchillweg 141-185, 6708 Wageningen, Netherlands to Bloemhof, Wageningen
##                         geometry
## 1 LINESTRING (5.67109 51.9777...
## Try to export the simple feature to a shapefile 
try(st_write(myroute_points_xy, dsn = "./data/route_points.shp"))
#Overwrite your previous shapefile by adding delete_layer = TRUE
try((st_write(myroute_line_xy, dsn = "./data/route.shp", delete_layer = TRUE))) 

Hooray, it worked! Now our friends have the data they wanted. Try to understand the above code and results. Feel free to display the data in QGIS.

Transformation of coordinate system

Transformations between coordinate systems are crucial to many GIS applications. The Keyhole Markup Language (KML) used by Google Earth uses latitude and longitude in a polar WGS84 coordinate system (i.e. geographic coordinates). However, in some of the examples below we will use metric distances (i.e. projected coordinates).There are two types of coordinate systems that you need to recognise: projected coordinate systems and geographic coordinate systems.

One of the challenges of working with geo-spatial data is that geodetic locations (points on the Earth surface) are mapped into a two-dimensional cartesian plane using a cartographic projection. Projected coordinates are coordinates that refer to a point on a two-dimensional map that represents the surface of the Earth (i.e. projected coordinate system). Latitude and Longitude values are an example of an unprojected coordinate system. These are coordinates that directly refer to a point on the Earth's surface. One way to deal with this is by transforming the data to a planar coordinate system. In R this can be achieved via bindings to the PROJ.4 - Cartographic Projections Library (http://trac.osgeo.org/proj/), which are available in sf.

Central to spatial data in the sf package is that they have a coordinate reference system, which is coded as a numeric epsg code and a character string describing the projection by the PROJ.4 projection library. Operations on different spatial data sets need to have compatible coordinate reference systems (i.e., identical). An interface to the PROJ.4 library is available through the sf package.

We will transform our spatial data to the Dutch grid (Rijksdriehoekstelsel), often referred to as RD or RD_New. Please note that:

## Define CRS object for RD projection
# See how the transformation from unprojected WGS84 is defined to a projected coordinate system
prj_string_RD <- st_crs("+proj=sterea +lat_0=52.15616055555555 +lon_0=5.38763888888889 +k=0.9999079 +x_0=155000 +y_0=463000 +ellps=bessel +towgs84=565.2369,50.0087,465.658,-0.406857330322398,0.350732676542563,-1.8703473836068,4.0812 +units=m +no_defs")

## Perform the coordinate transformation from WGS84 to RD
myroute_rd_string <- st_transform(myroute_line_xy, crs = prj_string_RD)

# Instead of having to define the projection ourselves, we can make use of the projection library by using an EPSG code
myroute_rd_num <- st_transform(myroute_line_xy, crs = 28992)
myroute_points_rd <- st_transform(myroute_points_xy, crs = 28992)

# Check if your transformed routes are the same
identical(myroute_rd_string, myroute_rd_num)
## [1] FALSE

Question 3: Why are your routes not identical? Hint: inspect the data

After transforming the line, plot the line using the basic plot command.

plot(myroute_rd_num , main = myroute_rd_num$name, col = "red")
box()

Now that the geometries are projected to a planar coordinate system, the length can be computed.

## Let us use different methods to calculate the distance between the two points

## Before we calculate distances we need the package geosphere
try(install.packages("geosphere"))
## Installing package into '/home/david/R/x86_64-pc-linux-gnu-library/3.4'
## (as 'lib' is unspecified)
library(geosphere)

## Network distance: WGS84, RD_New
(dist_netw_wgs = st_length(myroute_line_xy)) # line with WGS84 CRS
## 984.3845 m
(dist_netw_rd = st_length(myroute_rd_num)) # line with RD_new CRS
## 984.3013 m
## Euclidean distance: WGS84, RD_New
# Get data from matrix position [1, 2]
(dist_eucl_wgs = st_distance(myroute_points_xy)[1,2])
## 602.7191 m
(dist_eucl_rd = st_distance(myroute_points_rd)[1,2]) 
## 602.6681 m

Okay as expected the network distance is bigger than a straight line between the two points, but a question remains:

Question 4: Why does the distance slightly differ between objects with CRS WGS84 and RD_New?

Now we know the network and euclidean distance between the two points. As you might have noticed the st_length() and st_distance() functions gave a unit behind the number. It was a meter in our case, but maybe we want to know how many feet or inch the distance was. The sf package allows unit conversions by bindings to the units package.

units::set_units(dist_netw_rd, km)
## 0.9843013 km
units::set_units(dist_netw_rd, feet)
## 3229.335 feet
units::set_units(dist_netw_rd, inch)
## 38752.02 inch

Calculating units can also be performed for surfaces or volumes. Later you can give a try to calculate from square meters to hectares for a surface area.

Polygons

Polygons with the sf package

After the intermezzo, we now continue with sf classes for polygon objects. The idea is to illustrate the classes; the data are meaningless. Let us create overlapping circles around our points.

## Create sfc with geometry in RD New
## We need to have projected (RD_New) data to be able to measure features in 2D (e.g. meters)
point1_sfg_rd <- st_point(c(174023.833719082, 445086.995949553))
point2_sfg_rd <- st_point(c(174655.068279838, 443855.475880421))

## Combine point features into simple feature collection with CRS
mypoints_sfc_rd <- st_sfc(point1_sfg_rd, point2_sfg_rd, crs = 28992)

## Create individual points
pnt1_rd <- mypoints_sfc_rd[[1]]
pnt2_rd <- mypoints_sfc_rd[[2]]

## Calculate distance between points
dist_points_rd <- st_distance(mypoints_sfc_rd)[1,2]
## Make circles around points, with radius equal to distance between points
## Define a series of angles going from 0 to 2pi
ang <- pi*0:200/100
# distance needs to be converted from units m class to double class
circle1x <- pnt1_rd[1]+ cos(ang) * as.double(dist_points_rd) 
circle1y <- pnt1_rd[2] + sin(ang) * as.double(dist_points_rd)
circle2x <- pnt2_rd[1]+ cos(ang) * as.double(dist_points_rd)
circle2y <- pnt2_rd[2] + sin(ang) * as.double(dist_points_rd)
c1 <- cbind(circle1x, circle1y)
c2 <- cbind(circle2x, circle2y)

You can plot these points using basic R plot commands:

plot(c1, pch = 19, cex = 0.2, col = "red", ylim = range(circle1y, circle2y), xlim = range(circle1x, circle2x))
points(c2, pch = 19, cex = 0.2, col = "blue")
plot(mypoints_sfc_rd, pch = 3, col= "darkgreen", add = TRUE)

See how we calculated a circle around every point. Although it looks like a closed ring, the visualized circle is not a polygon. Let's try to create a circular polygon. As before we step through the process of creating a simple feature:

  • Simple feature geometry (sfg)
  • Simple feature collections (sfc)
  • Dataframe (df)
  • Simple features (sf)
## Bind columns together into matrix and create a list of matrix
# ?cbind combines columns and ?rbind combines rows
circle1_sfg <- st_polygon(list(cbind(circle1x, circle1y)))
circle2_sfg <- st_polygon(list(cbind(circle2x, circle2y)))

circles_sfc <- st_sfc(circle1_sfg, circle2_sfg, crs = 28992)

circles_df <- data.frame(name = c("circle1", "circle2"), row.names=c("1", "2"))
circles_sf <- st_set_geometry(circles_df, circles_sfc)
## Plot the circle
plot(circles_sf)

Polygon operations with sf (buffer, intersect, difference)

Just now we calculated the points of the circle to obtain a perfect circle and then made it into a polygon, but we can also create a circular polygon by using the function st_buffer().

The sf package can do more spatial operations by spatially selecting certain features and changing the geometry. Examples of spatial selections are st_intersects, st_contains, st_within, st_covers etc.

The package can change geometries of features with st_difference (clip), st_intersection (intersect), st_union, st_sym_difference, st_centroid, st_segmentize, st_convex_hull and st_buffer. More methods can be found with the command: methods(class = "sf").

The final results of st_buffer can be plotted using basic R plotting commands:

## Buffer and plot
point_buff <- st_buffer(mypoints_sfc_rd, dist = dist_points_rd)

plot(point_buff, col = c("#2b8cbe", "#31a354"), lwd = 2)
plot(mypoints_sfc_rd, add = TRUE, col="red", pch=19, cex=1.5)
## Buffer again with the nquadsegs option. See the difference?
circle_quadsegs <- st_buffer(mypoints_sfc_rd[1],  dist=dist_points_rd, nQuadSegs=2)
circles_diff <- st_difference(circles_sfc[1], circle_quadsegs)
plot(circles_sfc[1], col = "red")
plot(circle_quadsegs, add = TRUE, lty = 3, lwd = 2, col = "grey")

Question 5: What happens if you change nQuadSegs to a higher number?

## Check areal difference
st_area(circles_diff) 
## 598753.3 m^2
# What is the difference in hectares?
units::set_units(st_area(circles_diff),hectare) 
## 59.87533 hectare
circles_intersect <- st_intersection(circles_sfc[1], circles_sfc[2])
plot(circles_sfc, col = "grey")
plot(circles_intersect, col="red", add = TRUE)
# What percentage of the two circles overlaps?
(paste(round(100 * st_area(circles_intersect) / st_area(circles_sfc[1]),2), "%"))
## [1] "39.1 %"

Question 6: Do you understand the script? What is the difference between st_difference and st_intersection ?

Today's summary

We learned about:

  • The geometry types of the sf package: points, linestrings, polygons etc.
  • The data classes of the sf package: sfg, sfc and sf
  • How to read/write data and change vector data format (KML, GML, shapefile and GeoJSON)
  • Visualize spatial vector data in R and in Google Maps
  • How to perform simple spatial operations on Geometries in R using the sf package

Exercise of today: Lesson06

First, download and unzip the following shape files manually from this website:

  • Places
  • Railways

Second, create a clear and documented script that:

  • Selects the "industrial" (type == "industrial") railways
  • Buffers the "industrial" railways with a buffer of 1000m (hint: st_buffer)
  • Find the place (i.e. a city) that intersects with this buffer.
  • Create a plot that shows the buffer, the points, and the name of the city
  • Write down the name of the city and the population of that city as one comment at the end of the script.

Think about project structure and use of git!

Bonus (optional!): if you can also download and unzip this within the script.

More info

About projections and code