Unit 12 GIS Analysis Using R-I
Unit 12 GIS Analysis Using R-I
12.3 Basics of spatial operations using 12.4 Spatial Interactions of Vector and Raster
different functions Data
Spatial Join 12.5 Cropping to the Area of Interest
Attribute Aggregation 12.6 Summary
Spatial Aggregation 12.7 Glossary
Interpolation
Unions
12.1 INTRODUCTION
This course will introduce the use of R for script-based geographic information systems (GIS)
overlay analysis workflows. By the end of the course, you will be able to script simple GIS analyses
and present results in interactive, self-documented web pages.
Whereas desktop GIS packages are great applications for exploratory spatial data analysis and
generation of map graphics, with typical work flows they are limited for scientific research. If you
can’t remember, reproduce, or clearly communicate your methodology, then what you are doing
cannot be technically called “research.” Although both ArcGIS Desktop and QGIS, two leading
desktop GIS applications, employ the use of Python for scripting geoprocessing analyses and user
interfaces, learning Python can be difficult and requires a strong commitment and a large amount of
time to master.
12.2 R IN GIS ANALYSIS PLATFORM
R is an option to ArcPy or the QGIS Python API, being highly used, extensible,
free, and script-based. If you have been taking courses in social science,
epidemiology, or statistics at UW, you probably already know enough R to get
started using it as a GIS analysis platform
(https://csde.washington.edu/workshop/reproducible-gis-analysis-with-r/). R is
also becoming more commonly used for various programming tasks beyond its
initial development as a statistical programming environment, particularly due to
the proliferation of packages that can be added to extend R functionality. One
of those packages, sf (Simple Features for R), extends the R analytic
framework to include a plethora of spatial data management and analysis tasks
and takes advantage of the tibble data frame to include representation of
geometric objects. Additionally, the leaflet package can be used for rendering
“slippy” (real-time scale-able, pan-able) maps embedded in web pages created
by the RMarkdown package. Combining these functionalities with R’s relatively
easy programming language can help you perform GIS analyses, generate data
summaries from geoprocessing operations, render high-quality tables and
graphics, and include maps in your reports.
Function Features
# aggregation operations
continent_pop=group_by(world, continent)%>%
summarise(pop=sum(pop, na.rm=T))
# glance at new data
continent_pop
## Simple feature collection with 8 features and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -180 ymin: -89.9 xmax: 180 ymax: 83.64513
## Geodetic CRS: WGS 84
## # A tibble: 8 x 3
## continent pop geom
## <chr><dbl><MULTIPOLYGON [arc_degree]>
## 1 Africa 1154946633 (((11.91496 -5.037987, 11.09377 -
3.978827,~
## 2 Antarctica 0 (((-180 -89.9, 180 -89.9, 180 -84.71338, 1~
## 3 Asia 4311408059 (((120.295 -10.25865, 118.9678 -9.557969,
~
## 4 Europe 669036256 (((-53.77852 2.376703, -54.08806
2.105557,~
## 5 North America 565028684 (((-80.9473 8.858504, -80.5219
9.111072, -~
## 6 Oceania 37757833 (((171.9487 -41.51442, 172.0972 -40.9561,
~
## 7 Seven seas (open ocean) 0 (((68.935 -48.625, 68.8675 -48.83, 68.72
-~
## 8 South America 412060811 (((-57.75 -51.55, -58.05 -51.9, -59.4 -
52.~
# map after aggregation
ggplot(continent_pop)+
geom_sf()+
theme(panel.background=element_blank())
subregion_info=group_by(world, subregion)%>%
summarise(
pop=sum(pop, na.rm=T), # use function sum() to calculate total
population
lifeExp=mean(lifeExp, na.rm=T), # use function mean() to calculate
average life expectancy
num_cou=n() # use function n() to calculate total number in
each group
)%>%
arrange(desc(lifeExp)) # use arrange to order data, and desc() means
descending order
# glance at new data
subregion_info
## Simple feature collection with 22 features and 4 fields
## Geometry type: GEOMETRY
## Dimension: XY
## Bounding box: xmin: -180 ymin: -89.9 xmax: 180 ymax: 83.64513
## Geodetic CRS: WGS 84
## # A tibble: 22 x 5
## subregion pop lifeExp num_cou geom
## <chr><dbl><dbl><int><GEOMETRY [arc_degree]>
## 1 Australia and New Zealand 2.80e7 81.9 2 MULTIPOLYGON
(((171.1251~
## 2 Western Europe 1.26e8 81.8 7 MULTIPOLYGON (((-
52.9396~
## 3 Northern America 3.54e8 80.4 3 MULTIPOLYGON (((-
155.542~
## 4 Northern Europe 9.66e7 79.5 10 MULTIPOLYGON (((-
9.97708~
## 5 Southern Europe 1.53e8 78.4 12 MULTIPOLYGON
(((23.51498~
## 6 Eastern Asia 1.57e9 76.3 6 MULTIPOLYGON
(((110.2116~
## 7 Central America 1.70e8 74.7 8 POLYGON ((-81.80857
8.95~
## 8 Western Asia 2.52e8 74.5 18 MULTIPOLYGON
(((51.17252~
## 9 Eastern Europe 2.93e8 74.5 10 MULTIPOLYGON
(((27.6739 ~
## 10 Caribbean 4.06e7 73.8 7 MULTIPOLYGON (((-60.935 ~
## # ... with 12 more rows
Function sum() obtains the total value; mean() obtains the average value; n()
obtains the total number in the group. Parameter na.rm=T means that the
calculation would skip the NA (not available). If we do not add this parameter,
the result would be NA as well. Parameter desc() means arrange the data in
descending order, or it would be in ascending order by default. As we can see
in the result printed above, the first three highest life expectancy are Australia
and New Zealand, Western Europe, and Northern America respectively.
In addition to the application of dplyr, we can use function aggregate() provided
by package sf to obtain the same result. The first parameter in the function is
the data we want to summarize. Note the data should be remain the sf features.
Thus, do not use the form like world$pop, whose type is “numeric.” Instead, we
should write in the form as world["pop"], whose type is “sf” and “dataframe.”
Parameter by= should be filled in with the columns we want to group by. Note
that the type of data should be a list. Parameter FUN is the spatial operation
such as sum or average. Last, if the data contains NA, we should add the
parameter na.rm=T, in order to skip NA. Again, take data “world” for example,
and group by the data by continent, summarize the total population of each
continent. The code and result are as follows.
continent_pop_ag=aggregate(world["pop"], by=list(world$continent), FUN=sum,
na.rm=T)
continent_pop_ag
## Simple feature collection with 8 features and 2 fields
## Attribute-geometry relationship: 0 constant, 1 aggregate, 1 identity
## Geometry type: GEOMETRY
## Dimension: XY
## Bounding box: xmin: -180 ymin: -89.9 xmax: 180 ymax: 83.64513
## Geodetic CRS: WGS 84
## Group.1 pop geometry
## 1 Africa 1154946633 MULTIPOLYGON (((11.91496 -5...
## 2 Antarctica 0 MULTIPOLYGON (((-61.13898 -...
## 3 Asia 4311408059 MULTIPOLYGON (((120.295 -10...
## 4 Europe 669036256 MULTIPOLYGON (((-53.77852 2...
## 5 North America 565028684 MULTIPOLYGON (((-80.9473 8....
## 6 Oceania 37757833 MULTIPOLYGON (((171.9487 -4...
## 7 Seven seas (open ocean) 0 POLYGON ((68.935 -48.625, 6...
## 8 South America 412060811 MULTIPOLYGON (((-57.75 -51....
12.3.3 Spatial Aggregation
Similar to attribute data aggregation introduced in the previous section, spatial
aggregation can be done by function aggregate(). In attribute aggregation, the
first parameter should be in class sf, while the second parameter by= is a list
with non-spatial data. There is a little difference on the data imported in
parameter by=, which the data should also contain geometry in spatial
aggregation. Briefly to say, in function aggregate(x, by=y), spatial aggregation is
the geometry of the source (y) that defines how values in the target object (x)
are grouped.
Take data nz and nz_height for example. We want to know the average height
in each province of New Zealand. The code is shown below.
aggregate(x=nz_height, by=nz, FUN=mean)
## Simple feature collection with 16 features and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: 1090144 ymin: 4748537 xmax: 2089533 ymax:
6191874
## Projected CRS: NZGD2000 / New Zealand Transverse Mercator 2000
## First 10 features:
## t50_fid elevation geometry
## 1 NA NA MULTIPOLYGON (((1745493 600...
## 2 NA NA MULTIPOLYGON (((1803822 590...
## 3 2408405 2734.333 MULTIPOLYGON (((1860345 585...
## 4 NA NA MULTIPOLYGON (((2049387 583...
## 5 NA NA MULTIPOLYGON (((2024489 567...
## 6 NA NA MULTIPOLYGON (((2024489 567...
## 7 NA NA MULTIPOLYGON (((1740438 571...
## 8 2408395 2777.000 MULTIPOLYGON (((1866732 566...
## 9 NA NA MULTIPOLYGON (((1881590 548...
## 10 2368390 2889.455 MULTIPOLYGON (((1557042 531...
As the result shows, all of attribute features of data nz_height (first parameter)
are average value of each province. And the geometry is based on the spatial
data in second parameter by=. You may find that the data we obtain are not
perfect, since it does not contain the features in second parameter by=. Also, it
does not make sense to calculate the average of all attribute features. For
instance, variable “t50_fid” is the identity number of the highest points, it does
not make sense to derive its mean.
In addition to function aggregate(), the same result can also be generated by
function group_by() %>% summarise() in dplyr The code and result are shown
below.
ve_height=st_join(nz, nz_height)%>%
group_by(Name)%>%
summarise(elevation=mean(elevation, na.rm=T))
In the figure above, the study area contains three districts whose population
have been surveyed. It is covered by 40% of Area 1; 25% of Area 2; and 40%
of Area 3. Population is the attribute assumed to be spatially extensive, and the
sum is preserved. Hence, we can simply derive the total population of the study
area based on the weight of each area: 1000* 40% + 800* 25% + 900* 40% =
960.
For the study area, there are 40% of it is Area 1; 33.3% of it is Area 2; 26.7% of
it is Area 3. Density is spatially intensive instead, and the mean is preserved.
Hence, we can simply derive the mean density of the study area based on the
area weight of study area: 200/3* 40% + 40* 33.3% + 90* 26.7% = 64.
Package sf provides a function st_interpolation_aw() to conduct the
interpolation. There are three parameters required in the function
st_interpolate_aw(x, to, extensive). The first parameter (x) is the object of
simple features, for which we want to aggregate attributes. The second
parameter (to) is the object of simple features with the target geometries.
Parameter extensive= is the operation of data. If it is True, the attribute
variables are assumed to be spatially extensive (like population) and the sum is
preserved, otherwise, spatially intensive (like population density) and the mean
is preserved.
Here, we use the data incongruent and aggregating_zones provided by
package spData. The former one is a specific region separated into 9 districts in
UK, while the latter is in the same region, but separated into 2 major area (as
the figure shown below). We need to interpolate the value in data incongruent
to the geometry data aggregating_zones, and derive the sum of value in two
areas. The code and result are shown below.
12.3.7 Boundary
Function st_boundary() creates a polygon that encompasses the full extent of
the geometry. Boundary and union are often applied together to emphasize the
border of the study area. The example below shows the us_state and makes
the border of country thicker.
# union us_states and then derive the boundary
us_boundary=st_boundary(st_union(us_states))
## although coordinates are longitude/latitude, st_union assumes that they are
planar
# plot the map
ggplot(us_states)+
Figure 9:polygon that encompasses the full extent of the geometry
geom_sf()+
geom_sf(data=us_boundary, size=1)+
theme(axis.text=element_blank(),
axis.ticks=element_blank(),
axis.title=element_blank(),
panel.background=element_blank())
12.3.8 Clipping
Clipping is a critical spatial operation in GIS. It extracts the geometry (point,
line, polygon) that lies within the area outlined by the clip polygon. For instance,
we have polygon of Taipei City and points of schools in Taiwan, and suppose
we want to extract the schools located in Taipei City. We need to use the
polygon to clip the point data, removing all the points outside the polygon.
Though there is no “clip” function provided by package sf in fact, function
st_intersection() help us do so. But please note that there is a difference
between clipping and intersection in an accurate GIS definition. Clipping and
intersection both extracts the geometry in a specific polygon; nonetheless,
clipping does not retain the attributes of the clip polygon, while intersection
creates output features which possess the attributes of all input features. The
concept is illustrated below.
Figure 10: Clipping and intersection
The slight difference may be a fundamental concept of GIS, but it is not really
vital in terms of spatial analysis in R (after all, there is no clip function provided
in R). In st_intersection(x, y), it is free to place the spatial data in two
parameters. But please note data x would be on the left-hand side of new data
frame, while y is on the right-hand side. Here we use the example of nz (map of
New Zealand) and nz_height (101 highest point in New Zealand) again.
Suppose we want to extract the points located in North Island, the code and
result are shown below.
# filter North Island in data nz
nz_north=filter(nz, Island=="North")
# st_intersection(point, polygon)
st_intersection(nz_height, nz_north)
## Warning: attribute variables are assumed to be spatially constant throughout
all
## geometries
## Simple feature collection with 5 features and 8 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: 1820643 ymin: 5647971 xmax: 1822492 ymax:
5650492
## Projected CRS: NZGD2000 / New Zealand Transverse Mercator 2000
## t50_fid elevation Name Island Land_area Population
## 99 2408397 2751 Waikato North 23900.04 460100
## 100 2408406 2720 Waikato North 23900.04 460100
## 101 2408411 2732 Waikato North 23900.04 460100
## 97 2408394 2797 Manawatu-Wanganui North 22220.61 234500
## 98 2408395 2757 Manawatu-Wanganui North 22220.61 234500
## Median_income Sex_ratio geometry
## 99 27900 0.9520500 POINT (1820660 5649488)
## 100 27900 0.9520500 POINT (1822263 5650429)
## 101 27900 0.9520500 POINT (1822492 5650492)
## 97 25000 0.9387734 POINT (1821014 5647971)
## 98 25000 0.9387734 POINT (1820643 5648331)
Remember the spatial join operation? Actually, we can use function st_join() to
do the same thing. The code is shown below.
# union two data first
nz_height_join=st_join(nz_height, nz)
# filter the highest point located in North Island
filter(nz_height_join, Island=="North")
## Simple feature collection with 5 features and 8 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: 1820643 ymin: 5647971 xmax: 1822492 ymax:
5650492
## Projected CRS: NZGD2000 / New Zealand Transverse Mercator 2000
## t50_fid elevation Name Island Land_area Population
Median_income
## 1 2408394 2797 Manawatu-Wanganui North 22220.61 234500
25000
## 2 2408395 2757 Manawatu-Wanganui North 22220.61 234500
25000
## 3 2408397 2751 Waikato North 23900.04 460100 27900
## 4 2408406 2720 Waikato North 23900.04 460100 27900
## 5 2408411 2732 Waikato North 23900.04 460100 27900
## Sex_ratio geometry
## 1 0.9387734 POINT (1821014 5647971)
## 2 0.9387734 POINT (1820643 5648331)
## 3 0.9520500 POINT (1820660 5649488)
## 4 0.9520500 POINT (1822263 5650429)
## 5 0.9520500 POINT (1822492 5650492)
In this example, we find that function st_join() and st_intersection() do the
similar operation. This is because that st_join() sets the operation to
“intersection” by default, that is, the parameter is st_join(...,
join="st_intersects"). Other operations are listed in the table of
geos_binary_pred.
In addition to st_intersection, there are some functions to do subsetting and
clipping spatial data. To illustrate the concept and operations of these functions,
we use a simple example: two overlapping circles (X and Y), with a center point
one unit away from each other and a radius of one. The code and results are
shown below.
X=st_buffer(st_point(c(1,1)), dist=1)
Y=st_buffer(st_point(c(2,1)), dist=1)
st_intersection(X, Y)
st_difference(X, Y)
st_difference(Y, X)
st_union(X, Y)
st_sym_difference(X, Y)
12.6 SUMMARY
This topic will understand the use of R for script-based geographic information
systems (GIS) overlay analysis workflows and simple script GIS analyses and
present results in interactive, self-documented web pages.
The use of R for script-based workflows for GIS overlay analysis is covered in
this course. Users will be able to script basic GIS analyses and display the
findings in self-documented, interactive web pages by the end of the course. R
is an alternate to ArcPy or the QGIS Python API, being heavily utilized,
extendable, free, and script-based. Beyond its original conception as a
statistical programming environment, it is also increasingly frequently utilized for
other programming tasks. R capabilities can be expanded by using packages
like the leaflet package and sf (Simple Features for R). These features, along
with R's rather simple programming language, make it easier to do GIS
analysis, produce data summaries from geoprocessing procedures, produce
excellent tables and graphics, and include maps into reports.
12.7 GLOSSARY
A shapefile: is a file-based data format native to ArcGIS software (a much
older version of ArcGIS Pro). Conceptually, a shapefile is a feature class–it
stores a collection of features that have the same geometry type (point, line, or
polygon), the same attributes, and a common spatial extent.
GeoPackage: This is a relatively new data format that follows open format
standards (i.e. it is non-proprietary). It’s built on top of SQLite (a self-contained
relational database). Its one big advantage over many other vector formats is its
compactness–coordinate value, metadata, attribute table, projection
information, etc…, are all stored in a single file which facilitates portability. Its
filename usually ends in .gpkg. Applications such as QGIS (2.12 and up), R and
ArcGIS will recognize this format.
Raster Data File Formats: Rasters are in part defined by their pixel depth.
Pixel depth defines the range of distinct values the raster can store. For
example, a 1-bit raster can only store 2 distinct values: 0 and 1.
There is a wide range of raster file formats used in the GIS world. Some of the
most popular ones are listed below.
https://journal.r-project.org/articles/RJ-2017-067/