Please locate your vsp_bigdata
folder under “My Documents” and navigate to group-session
. Create 08-lecture
folder under the group-session
folder.
For this group session, we will use the Gapminder database.
Please download this CSV file and save it under the group session folder: Gapminder data 2016
Now, you need to download the gapminder geographic data (coordinates): Gapminder geographic coordinates
The purpose of this group session is to get you familiar with visualization and web mapping. We will go through three parts.
First, we will join the gapminder data with the geographic coordinates.
Second, we will create a few plots showing the relationship between income and life expectancy.
Third, we will create a map showing each of the variables. Then we will export the map as an interactive web map that can be shared with other people.
First, we are going to join the gapminder data with the geographic coordinates data. Let’s load the libraries first.
# Set CRAN repository source
options(repos="https://cran.rstudio.com")
# Install packages
install.packages("leaflet")
##
## The downloaded binary packages are in
## /var/folders/kn/rkjcf7yd4s7dlxxrpnm6dkg00000gp/T//RtmpZaCJDL/downloaded_packages
install.packages("dplyr")
##
## The downloaded binary packages are in
## /var/folders/kn/rkjcf7yd4s7dlxxrpnm6dkg00000gp/T//RtmpZaCJDL/downloaded_packages
install.packages("magrittr")
##
## The downloaded binary packages are in
## /var/folders/kn/rkjcf7yd4s7dlxxrpnm6dkg00000gp/T//RtmpZaCJDL/downloaded_packages
install.packages("ggplot2")
##
## The downloaded binary packages are in
## /var/folders/kn/rkjcf7yd4s7dlxxrpnm6dkg00000gp/T//RtmpZaCJDL/downloaded_packages
install.packages("plotly")
##
## The downloaded binary packages are in
## /var/folders/kn/rkjcf7yd4s7dlxxrpnm6dkg00000gp/T//RtmpZaCJDL/downloaded_packages
# Load packages
library(leaflet)
library(dplyr)
library(magrittr)
library(ggplot2)
library(plotly)
Let’s load the data.
# Paths
gapminder = read.csv(file.choose())
geo = read.csv(file.choose())
Let’s examine the data with the head()
function.
head(gapminder)
## name region income lifeExp
## 1 Afghanistan asia 1740 58.0
## 2 Albania europe 11400 77.7
## 3 Algeria africa 14000 77.4
## 4 Andorra europe 48200 82.5
## 5 Angola africa 6030 64.7
## 6 Antigua and Barbuda americas 20800 77.3
head(geo)
## name lat long population
## 1 Afghanistan 33.00000 66.00000 34700000
## 2 Albania 41.00000 20.00000 2930000
## 3 Algeria 28.00000 3.00000 40600000
## 4 Andorra 42.50779 1.52109 77300
## 5 Angola -12.50000 18.50000 28800000
## 6 Antigua and Barbuda 17.05000 -61.80000 101000
We can also view the data in a more familiar tabular format.
View(gapminder)
View(geo)
Another way to see the “structure” of the dataset is to run the str()
function.
str(gapminder)
## 'data.frame': 187 obs. of 4 variables:
## $ name : Factor w/ 187 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ region : Factor w/ 4 levels "africa","americas",..: 3 4 1 4 1 2 2 4 3 4 ...
## $ income : int 1740 11400 14000 48200 6030 20800 18500 8170 44400 44100 ...
## $ lifeExp: num 58 77.7 77.4 82.5 64.7 77.3 76.7 75.7 82.5 81.5 ...
str(geo)
## 'data.frame': 187 obs. of 4 variables:
## $ name : Factor w/ 187 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ lat : num 33 41 28 42.5 -12.5 ...
## $ long : num 66 20 3 1.52 18.5 ...
## $ population: int 34700000 2930000 40600000 77300 28800000 101000 43800000 2920000 24100000 8710000 ...
You will notice that the first 2 columns/variables “name” and “region” are both “Factor” type variables. This means that they are texts, or more precisely, categorical variables. “income”" and “lifeExp” are “int” Integer and “num” Numeric type variables.
The geo dataset contain name
as well as lat
, long
, and population
columns. “lat” and “long” columns are “num” Numeric type variables, and population
is “int” Interger type variable.
The built in function summary()
in base R does a good simple summary statistics for all variables in the dataset provided. Since this dataset only has 4 variables, we can simply call summary(gapminder)
which will give us the summary statistics for all 4 variables.
summary(gapminder)
## name region income lifeExp
## Afghanistan : 1 africa :54 Min. : 625 Min. :50.30
## Albania : 1 americas:34 1st Qu.: 3325 1st Qu.:66.65
## Algeria : 1 asia :54 Median : 10800 Median :73.50
## Andorra : 1 europe :45 Mean : 17351 Mean :72.21
## Angola : 1 3rd Qu.: 23850 3rd Qu.:77.65
## Antigua and Barbuda: 1 Max. :118000 Max. :83.90
## (Other) :181
It looks like the column name
is common across the two data sets. Now, let’s join the data together to prepare for mapping later. We are going to use inner_join
so that we only choose countries with complete geographic data.
# Join gapminder data and the geographic coordinates
gapminder = gapminder %>% inner_join(geo, by="name")
# Check the joined data
head(gapminder)
## name region income lifeExp lat long
## 1 Afghanistan asia 1740 58.0 33.00000 66.00000
## 2 Albania europe 11400 77.7 41.00000 20.00000
## 3 Algeria africa 14000 77.4 28.00000 3.00000
## 4 Andorra europe 48200 82.5 42.50779 1.52109
## 5 Angola africa 6030 64.7 -12.50000 18.50000
## 6 Antigua and Barbuda americas 20800 77.3 17.05000 -61.80000
## population
## 1 34700000
## 2 2930000
## 3 40600000
## 4 77300
## 5 28800000
## 6 101000
Now, let’s create some plots for exploratory data anlysis. We will first create a simple graph showing the life expectancy grouped by different continents.
ggplot(gapminder, aes(x = region, y = lifeExp)) +
geom_boxplot()
Let’s add all the data points add color for the outliers.
ggplot(gapminder, aes(x = region, y = lifeExp)) +
geom_boxplot(outlier.colour = "hotpink") +
geom_jitter()
Let’s follow the Tufte’s rule 7 of separating layers by changing the opacity of the points.
ggplot(gapminder, aes(x = region, y = lifeExp)) +
geom_boxplot(outlier.colour = "hotpink") +
geom_jitter(position = position_jitter(width = 0.1, height = 0), alpha = 1/4)
Let’s first explore the relationship between income and life expectancy. What relationship do we expect to see?
ggplot(gapminder, aes(x = income, y = lifeExp)) +
geom_point()
We can be a little fancy by adding a smooth trend line.
ggplot(gapminder, aes(x = income, y = lifeExp)) +
geom_point() +
geom_smooth()
We can also color different continents.
ggplot(gapminder, aes(x = income, y = lifeExp)) +
geom_point(aes(color = region)) +
geom_smooth()
We can color the trend lines as well.
ggplot(gapminder, aes(x = income, y = lifeExp)) +
geom_point(aes(color = region)) +
geom_smooth(aes(color = region))
Or we can just cut the data and show each continent separately using the facet_grid(row ~ column)
option.
ggplot(gapminder, aes(x = income, y = lifeExp)) +
geom_point(aes(color = region)) +
geom_smooth() +
facet_grid(.~region)
The X axis is hard to read. Let’s rotate the texts using the theme
option.
ggplot(gapminder, aes(x = income, y = lifeExp)) +
geom_point(aes(color = region)) +
geom_smooth() +
facet_grid(.~region) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Lastly, we can also make the plot interactive, so that we can see which dot represents which country. Note that we added a text
option in geom_point
to include country names.
p = ggplot(gapminder, aes(x = income, y = lifeExp)) +
geom_point(aes(text = paste("Country:", name), color = region)) +
geom_smooth() +
facet_grid(.~region) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
ggplotly(p)
In part three, we will create an interactive web map using the gapminder data. Earlier we joined the coordinate information to the gapminder dataset, so we can use the coordinates to show countries on the map. We will use the powerful leaflet
package to accomplish this task. First, we will just plot the points on the map, and use addCircleMarker
function to visualize variales on the map.
Let’s initiate leaflet
and add the empty map tiles. We can use different map tiles available here: http://leaflet-extras.github.io/leaflet-providers/preview/
leaflet(gapminder) %>% addTiles()
# leaflet(gapminder) %>% addProviderTiles(provider = "Stamen.TonerLite")
# leaflet(gapminder) %>% addProviderTiles(provider = "Stamen.Toner")
# leaflet(gapminder) %>% addProviderTiles(provider = "Esri.WorldImagery")
Let’s see what coordinate information we can use. We can see that the variables to use are: long and lat.
head(gapminder)
## name region income lifeExp lat long
## 1 Afghanistan asia 1740 58.0 33.00000 66.00000
## 2 Albania europe 11400 77.7 41.00000 20.00000
## 3 Algeria africa 14000 77.4 28.00000 3.00000
## 4 Andorra europe 48200 82.5 42.50779 1.52109
## 5 Angola africa 6030 64.7 -12.50000 18.50000
## 6 Antigua and Barbuda americas 20800 77.3 17.05000 -61.80000
## population
## 1 34700000
## 2 2930000
## 3 40600000
## 4 77300
## 5 28800000
## 6 101000
Now, let’s add the latitude and longitude points on the map. Note that we use the squiggly ~
sign to use the column names without the data name.
leaflet(gapminder) %>% addTiles() %>% addCircleMarkers(~long, ~lat)
Let’s use the variable income
to visualize income levels on the map.
leaflet(gapminder) %>% addTiles() %>% addCircleMarkers(~long, ~lat, radius=~income)
What do you see on the screen? Why is it all blue?
The income
data ranges from $650 to $118,000.
gapminder %>% summarise(min=min(income), max=max(income))
## min max
## 1 625 118000
We need to scale the data to visualize it on the map. We will first divide the income by 1000 and take the square root to scale the data exponentially.
leaflet(gapminder) %>% addTiles() %>% addCircleMarkers(~long, ~lat, radius=~sqrt(income/1000))
Congratulations! You’ve created your first interactive map.
Now, it’s a lot better, but we don’t know which country is which, and it doesn’t do anything if we hover over them. Let’s label the point with the country name.
leaflet(gapminder) %>% addTiles() %>% addCircleMarkers(~long, ~lat, radius=~sqrt(income/1000), label=~name)
We can see the country name when we move the mouse over to each circle. One way to make all these fancy is to create a variable
column that computes our variable of interest to scale appropriately and give an appropriate label.
Note that we can chain multiple variables using the mutate
function.
gapminder = gapminder %>%
mutate(variable = income/1000, label = paste(name, "- Income: ", variable, "k"))
Now, let’s see the final map with an appropriate scale and label.
leaflet(gapminder) %>% addTiles() %>% addCircleMarkers(~long, ~lat, radius=~sqrt(variable), label=~label, weight=2)
For fun, we can color each circle according to its continent and look for any spatial patterns.
pal = colorFactor(rainbow(4), gapminder$region)
leaflet(gapminder) %>%
addTiles() %>%
addCircleMarkers(~long, ~lat, radius=~sqrt(variable), label=~label, weight=2, color=~pal(region))
You can assemble multiple plots using the grid.arrange()
function from the gridExtra
package. We will name the boxplot p1 and the scatter plot p2, and combine them in one figure.
Let’s install and load the grid.extra package first.
install.packages("gridExtra")
##
## The downloaded binary packages are in
## /var/folders/kn/rkjcf7yd4s7dlxxrpnm6dkg00000gp/T//RtmpZaCJDL/downloaded_packages
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
Now, we will create two ggplot objects.
# Box plot
p1 = ggplot(gapminder, aes(x = region, y = lifeExp)) +
geom_boxplot(outlier.colour = "hotpink") +
geom_jitter(position = position_jitter(width = 0.1, height = 0), alpha = 1/4)
# Scatter plot
p2 = ggplot(gapminder, aes(x = income, y = lifeExp)) +
geom_point(aes(color = region)) +
geom_smooth() +
facet_grid(.~region) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Now, we are going to assemble them together.
grid.arrange(p1, p2, nrow = 1)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
We can also assemble vertically.
grid.arrange(p1, p2, ncol = 1)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
There are others ways to customize your plots. For further information, check out this vignette.
Create a box plot comparing income levels of the four regions.
Create a scatter plot showing the relationship between life expectancy and income of the four regions. Show life expectancy on the x-axis and income on the y-axis.
Create an interactive map showing the life expectancy of each country. In the label, include country name and life expectancy.
Create a short report that includes the plot and the map, and describe each of your graphics.
For advanced students, use the grid.arrange()
function to combine two plots: a box plot showing income levels; a scatter plot showing the relationship between income and life expectancy. Please add the title for each plot using the ggtitle
function. Extra 2 points: Along with your assignment 4, add the combined graphs with titles and descriptions.
VSP BigData [lecture number] - [group number] - [presenter name]
VSP BigData Lecture 8 - Group 1 - Bill Gates