Prerequisites

Please locate your vsp_bigdata folder under “My Documents” and navigate to group-session. Create 08-lecture folder under the group-session folder.

  1. For this group session, we will use the Gapminder database.
    Please download this CSV file and save it under the group session folder: Gapminder data 2016

  2. Now, you need to download the gapminder geographic data (coordinates): Gapminder geographic coordinates



Instruction

1. Synopsis

The purpose of this group session is to get you familiar with visualization and web mapping. We will go through three parts.

First, we will join the gapminder data with the geographic coordinates.

Second, we will create a few plots showing the relationship between income and life expectancy.

Third, we will create a map showing each of the variables. Then we will export the map as an interactive web map that can be shared with other people.



2. Part one - Joining data

First, we are going to join the gapminder data with the geographic coordinates data. Let’s load the libraries first.

# Set CRAN repository source
options(repos="https://cran.rstudio.com")

# Install packages
install.packages("leaflet")
## 
## The downloaded binary packages are in
##  /var/folders/kn/rkjcf7yd4s7dlxxrpnm6dkg00000gp/T//RtmpZaCJDL/downloaded_packages
install.packages("dplyr")
## 
## The downloaded binary packages are in
##  /var/folders/kn/rkjcf7yd4s7dlxxrpnm6dkg00000gp/T//RtmpZaCJDL/downloaded_packages
install.packages("magrittr")
## 
## The downloaded binary packages are in
##  /var/folders/kn/rkjcf7yd4s7dlxxrpnm6dkg00000gp/T//RtmpZaCJDL/downloaded_packages
install.packages("ggplot2")
## 
## The downloaded binary packages are in
##  /var/folders/kn/rkjcf7yd4s7dlxxrpnm6dkg00000gp/T//RtmpZaCJDL/downloaded_packages
install.packages("plotly")
## 
## The downloaded binary packages are in
##  /var/folders/kn/rkjcf7yd4s7dlxxrpnm6dkg00000gp/T//RtmpZaCJDL/downloaded_packages
# Load packages
library(leaflet)
library(dplyr)
library(magrittr)
library(ggplot2)
library(plotly)

Let’s load the data.

# Paths
gapminder = read.csv(file.choose())
geo = read.csv(file.choose())

Let’s examine the data with the head() function.

head(gapminder)
##                  name   region income lifeExp
## 1         Afghanistan     asia   1740    58.0
## 2             Albania   europe  11400    77.7
## 3             Algeria   africa  14000    77.4
## 4             Andorra   europe  48200    82.5
## 5              Angola   africa   6030    64.7
## 6 Antigua and Barbuda americas  20800    77.3
head(geo)
##                  name       lat      long population
## 1         Afghanistan  33.00000  66.00000   34700000
## 2             Albania  41.00000  20.00000    2930000
## 3             Algeria  28.00000   3.00000   40600000
## 4             Andorra  42.50779   1.52109      77300
## 5              Angola -12.50000  18.50000   28800000
## 6 Antigua and Barbuda  17.05000 -61.80000     101000

We can also view the data in a more familiar tabular format.

View(gapminder)
View(geo)

Another way to see the “structure” of the dataset is to run the str() function.

str(gapminder)
## 'data.frame':    187 obs. of  4 variables:
##  $ name   : Factor w/ 187 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ region : Factor w/ 4 levels "africa","americas",..: 3 4 1 4 1 2 2 4 3 4 ...
##  $ income : int  1740 11400 14000 48200 6030 20800 18500 8170 44400 44100 ...
##  $ lifeExp: num  58 77.7 77.4 82.5 64.7 77.3 76.7 75.7 82.5 81.5 ...
str(geo)
## 'data.frame':    187 obs. of  4 variables:
##  $ name      : Factor w/ 187 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ lat       : num  33 41 28 42.5 -12.5 ...
##  $ long      : num  66 20 3 1.52 18.5 ...
##  $ population: int  34700000 2930000 40600000 77300 28800000 101000 43800000 2920000 24100000 8710000 ...

You will notice that the first 2 columns/variables “name” and “region” are both “Factor” type variables. This means that they are texts, or more precisely, categorical variables. “income”" and “lifeExp” are “int” Integer and “num” Numeric type variables.

The geo dataset contain name as well as lat, long, and population columns. “lat” and “long” columns are “num” Numeric type variables, and population is “int” Interger type variable.

The built in function summary() in base R does a good simple summary statistics for all variables in the dataset provided. Since this dataset only has 4 variables, we can simply call summary(gapminder) which will give us the summary statistics for all 4 variables.

summary(gapminder)
##                   name          region       income          lifeExp     
##  Afghanistan        :  1   africa  :54   Min.   :   625   Min.   :50.30  
##  Albania            :  1   americas:34   1st Qu.:  3325   1st Qu.:66.65  
##  Algeria            :  1   asia    :54   Median : 10800   Median :73.50  
##  Andorra            :  1   europe  :45   Mean   : 17351   Mean   :72.21  
##  Angola             :  1                 3rd Qu.: 23850   3rd Qu.:77.65  
##  Antigua and Barbuda:  1                 Max.   :118000   Max.   :83.90  
##  (Other)            :181

It looks like the column name is common across the two data sets. Now, let’s join the data together to prepare for mapping later. We are going to use inner_join so that we only choose countries with complete geographic data.

# Join gapminder data and the geographic coordinates
gapminder = gapminder %>% inner_join(geo, by="name")

# Check the joined data
head(gapminder)
##                  name   region income lifeExp       lat      long
## 1         Afghanistan     asia   1740    58.0  33.00000  66.00000
## 2             Albania   europe  11400    77.7  41.00000  20.00000
## 3             Algeria   africa  14000    77.4  28.00000   3.00000
## 4             Andorra   europe  48200    82.5  42.50779   1.52109
## 5              Angola   africa   6030    64.7 -12.50000  18.50000
## 6 Antigua and Barbuda americas  20800    77.3  17.05000 -61.80000
##   population
## 1   34700000
## 2    2930000
## 3   40600000
## 4      77300
## 5   28800000
## 6     101000



3. Part two - Creating plots

Now, let’s create some plots for exploratory data anlysis. We will first create a simple graph showing the life expectancy grouped by different continents.

ggplot(gapminder, aes(x = region, y = lifeExp)) +
  geom_boxplot()

Let’s add all the data points add color for the outliers.

ggplot(gapminder, aes(x = region, y = lifeExp)) +
  geom_boxplot(outlier.colour = "hotpink") +
  geom_jitter()

Let’s follow the Tufte’s rule 7 of separating layers by changing the opacity of the points.

ggplot(gapminder, aes(x = region, y = lifeExp)) +
  geom_boxplot(outlier.colour = "hotpink") +
  geom_jitter(position = position_jitter(width = 0.1, height = 0), alpha = 1/4)

Let’s first explore the relationship between income and life expectancy. What relationship do we expect to see?

ggplot(gapminder, aes(x = income, y = lifeExp)) +
  geom_point()

We can be a little fancy by adding a smooth trend line.

ggplot(gapminder, aes(x = income, y = lifeExp)) +
  geom_point() +
  geom_smooth()

We can also color different continents.

ggplot(gapminder, aes(x = income, y = lifeExp)) +
  geom_point(aes(color = region)) +
  geom_smooth()

We can color the trend lines as well.

ggplot(gapminder, aes(x = income, y = lifeExp)) +
  geom_point(aes(color = region)) +
  geom_smooth(aes(color = region))

Or we can just cut the data and show each continent separately using the facet_grid(row ~ column) option.

ggplot(gapminder, aes(x = income, y = lifeExp)) +
  geom_point(aes(color = region)) +
  geom_smooth() + 
  facet_grid(.~region)

The X axis is hard to read. Let’s rotate the texts using the theme option.

ggplot(gapminder, aes(x = income, y = lifeExp)) +
  geom_point(aes(color = region)) +
  geom_smooth() + 
  facet_grid(.~region) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Lastly, we can also make the plot interactive, so that we can see which dot represents which country. Note that we added a text option in geom_point to include country names.

p = ggplot(gapminder, aes(x = income, y = lifeExp)) +
  geom_point(aes(text = paste("Country:", name), color = region)) +
  geom_smooth() + 
  facet_grid(.~region) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

ggplotly(p)



4. Part three - Creating an interactive map

In part three, we will create an interactive web map using the gapminder data. Earlier we joined the coordinate information to the gapminder dataset, so we can use the coordinates to show countries on the map. We will use the powerful leaflet package to accomplish this task. First, we will just plot the points on the map, and use addCircleMarker function to visualize variales on the map.

Let’s initiate leaflet and add the empty map tiles. We can use different map tiles available here: http://leaflet-extras.github.io/leaflet-providers/preview/

leaflet(gapminder) %>% addTiles()
# leaflet(gapminder) %>% addProviderTiles(provider = "Stamen.TonerLite")
# leaflet(gapminder) %>% addProviderTiles(provider = "Stamen.Toner")
# leaflet(gapminder) %>% addProviderTiles(provider = "Esri.WorldImagery")

Let’s see what coordinate information we can use. We can see that the variables to use are: long and lat.

head(gapminder)
##                  name   region income lifeExp       lat      long
## 1         Afghanistan     asia   1740    58.0  33.00000  66.00000
## 2             Albania   europe  11400    77.7  41.00000  20.00000
## 3             Algeria   africa  14000    77.4  28.00000   3.00000
## 4             Andorra   europe  48200    82.5  42.50779   1.52109
## 5              Angola   africa   6030    64.7 -12.50000  18.50000
## 6 Antigua and Barbuda americas  20800    77.3  17.05000 -61.80000
##   population
## 1   34700000
## 2    2930000
## 3   40600000
## 4      77300
## 5   28800000
## 6     101000

Now, let’s add the latitude and longitude points on the map. Note that we use the squiggly ~ sign to use the column names without the data name.

leaflet(gapminder) %>% addTiles() %>% addCircleMarkers(~long, ~lat)

Let’s use the variable income to visualize income levels on the map.

leaflet(gapminder) %>% addTiles() %>% addCircleMarkers(~long, ~lat, radius=~income)

What do you see on the screen? Why is it all blue?
The income data ranges from $650 to $118,000.

gapminder %>% summarise(min=min(income), max=max(income))
##   min    max
## 1 625 118000

We need to scale the data to visualize it on the map. We will first divide the income by 1000 and take the square root to scale the data exponentially.

leaflet(gapminder) %>% addTiles() %>% addCircleMarkers(~long, ~lat, radius=~sqrt(income/1000))

Congratulations! You’ve created your first interactive map.

Now, it’s a lot better, but we don’t know which country is which, and it doesn’t do anything if we hover over them. Let’s label the point with the country name.

leaflet(gapminder) %>% addTiles() %>% addCircleMarkers(~long, ~lat, radius=~sqrt(income/1000), label=~name)

We can see the country name when we move the mouse over to each circle. One way to make all these fancy is to create a variable column that computes our variable of interest to scale appropriately and give an appropriate label.

Note that we can chain multiple variables using the mutate function.

gapminder = gapminder %>% 
  mutate(variable = income/1000, label = paste(name, "- Income: ", variable, "k"))

Now, let’s see the final map with an appropriate scale and label.

leaflet(gapminder) %>% addTiles() %>% addCircleMarkers(~long, ~lat, radius=~sqrt(variable), label=~label, weight=2)

For fun, we can color each circle according to its continent and look for any spatial patterns.

pal = colorFactor(rainbow(4), gapminder$region)

leaflet(gapminder) %>% 
  addTiles() %>% 
  addCircleMarkers(~long, ~lat, radius=~sqrt(variable), label=~label, weight=2, color=~pal(region))



5. [Advanced Material] Combining multiple plots

You can assemble multiple plots using the grid.arrange() function from the gridExtra package. We will name the boxplot p1 and the scatter plot p2, and combine them in one figure.

Let’s install and load the grid.extra package first.

install.packages("gridExtra")
## 
## The downloaded binary packages are in
##  /var/folders/kn/rkjcf7yd4s7dlxxrpnm6dkg00000gp/T//RtmpZaCJDL/downloaded_packages
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine

Now, we will create two ggplot objects.

# Box plot
p1 = ggplot(gapminder, aes(x = region, y = lifeExp)) +
  geom_boxplot(outlier.colour = "hotpink") +
  geom_jitter(position = position_jitter(width = 0.1, height = 0), alpha = 1/4) 
  
# Scatter plot
p2 = ggplot(gapminder, aes(x = income, y = lifeExp)) +
  geom_point(aes(color = region)) +
  geom_smooth() + 
  facet_grid(.~region) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) 

Now, we are going to assemble them together.

grid.arrange(p1, p2, nrow = 1)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

We can also assemble vertically.

grid.arrange(p1, p2, ncol = 1)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

There are others ways to customize your plots. For further information, check out this vignette.



Export and send the plot and the map

  1. Create a box plot comparing income levels of the four regions.

  2. Create a scatter plot showing the relationship between life expectancy and income of the four regions. Show life expectancy on the x-axis and income on the y-axis.

  3. Create an interactive map showing the life expectancy of each country. In the label, include country name and life expectancy.

  4. Create a short report that includes the plot and the map, and describe each of your graphics.

  5. For advanced students, use the grid.arrange() function to combine two plots: a box plot showing income levels; a scatter plot showing the relationship between income and life expectancy. Please add the title for each plot using the ggtitle function. Extra 2 points: Along with your assignment 4, add the combined graphs with titles and descriptions.

  6. Send your report to the course email (urbanbigdata2019@gmail.com).