Instructions

1. Synopsis

The purpose of this assignment is to get you familiar with data visualization in R. By now, you must be quite familiar with the Gapminder data from your previous assignments. For this assignment, you will work with the gapminder data for 2016, and the coordinate data available for each countryin the gapminder data. This assignment will walk you through how to load the data and do a basic data visualization.

2. Basic setup

Install required package

For the assignment 4, you will need to install the five packages listed velow. These packages allow you to use core functions to do a data visualization and web mapping. You will also need to download the gapminder datasets and load the data to R Studio.

install.packages("dplyr")
install.packages("magrittr")
install.packages("ggplot2")
install.packages("plotly")
install.packages("leaflet")

Load the package

library(dplyr)
library(magrittr)
library(ggplot2)
library(plotly)
library(leaflet)

Gapminder dataset

Now, let’s load the gapminder file to the R Studio environment.
Make sure to change [your user name] with your own user name for the path.

# For Windows
gapminder = read.csv("C://Users//[your user name]//Documents//vsp_bigdata//assignments//gapminder_data_2016.csv")
geo = read.csv("C://Users//[your user name]//Documents//vsp_bigdata//assignments//gapminder_geo.csv")
  
# For Mac
gapminder = read.csv("/Users/[your user name]/Documents/vsp_bigdata/assignments/gapminder_data_2016.csv")
geo = read.csv("/Users/[your user name]/Documents/vsp_bigdata/assignments/gapminder_geo.csv")

The gapminder 2016 dataset has 4 variables and 187 observations. Let’s take look at the first 6 rows of the data.

head(gapminder)

##                  name   region income lifeExp
## 1         Afghanistan     asia   1740    58.0
## 2             Albania   europe  11400    77.7
## 3             Algeria   africa  14000    77.4
## 4             Andorra   europe  48200    82.5
## 5              Angola   africa   6030    64.7
## 6 Antigua and Barbuda americas  20800    77.3

Now let’s take a look at the geographic coordinate data.

head(geo)

##                  name       lat      long population
## 1         Afghanistan  33.00000  66.00000   34700000
## 2             Albania  41.00000  20.00000    2930000
## 3             Algeria  28.00000   3.00000   40600000
## 4             Andorra  42.50779   1.52109      77300
## 5              Angola -12.50000  18.50000   28800000
## 6 Antigua and Barbuda  17.05000 -61.80000     101000

3. Data joining in R

We can use the column name to join the two data sets together to prepare for mapping later. We are going to use inner_join so that we only choose countries with complete geographic data.

# Join gapminder data and the geographic coordinates
gapminder = gapminder %>% inner_join(geo, by="name")

# Check the joined data
head(gapminder)

##                  name   region income lifeExp       lat      long
## 1         Afghanistan     asia   1740    58.0  33.00000  66.00000
## 2             Albania   europe  11400    77.7  41.00000  20.00000
## 3             Algeria   africa  14000    77.4  28.00000   3.00000
## 4             Andorra   europe  48200    82.5  42.50779   1.52109
## 5              Angola   africa   6030    64.7 -12.50000  18.50000
## 6 Antigua and Barbuda americas  20800    77.3  17.05000 -61.80000
##   population
## 1   34700000
## 2    2930000
## 3   40600000
## 4      77300
## 5   28800000
## 6     101000

Key functions for data visualization in R

We will use the ggplot2 package, which allows you to use the following data visualization functions.

geom_histogram() # histogram
geom_point() # scatter plot
geom_smooth() # trend line
geom_bar() # bar plot
geom_line() # line plot

For this assignment, we will use the three visualization functions: geom_histogram(), geom_point(), geom_smooth().

Initialize ggplot

To initialize ggplot functions, you will need to use the ggplot function and declare x and y axes. aes is short for “aesthetic”, and what go inside this parameter are the variables that can be pulled from the data.

ggplot(gapminder, aes(x = income))

You will see a blank plot without any data in it, but that’s okay.

geom_histogram( ) function:

This function allows you to create a histogram using the x-axis variable inside the aes parameter.

ggplot(gapminder, aes(x = income)) + geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

geom_point( ) function:

This function allows you to create a scatter plot using x and y axes inside the aes parameter.

ggplot(gapminder, aes(x = income, y=lifeExp)) + geom_point()

The graph looks non-linear, so we can take a log of income or can use a log scale for the x-axis.

# Create a log-transformed income variable
gapminder = gapminder %>% mutate(income_log = log(income))

# Plot again with the log-transformed income
ggplot(gapminder, aes(x = income_log, y=lifeExp)) + geom_point()

Of we can apply a log-scale only for plotting.

# Plot again with a log scale applied to the x-axis
ggplot(gapminder, aes(x = income, y=lifeExp)) + geom_point() + scale_x_log10()

geom_boxplot( ) function:

This function allows you to create a boxplot. The X-axis should be a categorical variable, and the Y-axis a continuous numeric variable. Note that geom_jitter function allows you to spread the points.

ggplot(gapminder, aes(x = region, y = lifeExp)) +
  geom_boxplot(outlier.colour = "hotpink") +
  geom_jitter(position = position_jitter(width = 0.1, height = 0), alpha = 1/4)

The geom_jitter is not necessary but addes more depth into our visualization.

ggplot(gapminder, aes(x = region, y = lifeExp)) +
  geom_boxplot(outlier.colour = "hotpink") +
  geom_jitter(position = position_jitter(width = 0.1, height = 0), alpha = 1/4)

geom_smooth( ) function:

Creates a trend line using models, such as a linear regression, a local regression, and polynomial functions.

# Plot a trend line
ggplot(gapminder, aes(x = income, y=lifeExp)) + geom_point() + geom_smooth() + scale_x_log10()

Now, we will cut the data into four continents using the facet_grid function. It is a very useful function to cut data differently and see ther patterns.

ggplot(gapminder, aes(x = income, y = lifeExp)) +
  geom_point(aes(color = region)) +
  geom_smooth(aes(color = region)) + 
  facet_grid(.~region) + 
  scale_x_log10()

Task 1

Modify the R code chunk above to create one histogram for population, one histogram for income, and one histogram for life expectancy lifeExp. Describe your graphs (e.g. mean and distribution).
Hint: you may use geom_histogram().

Task 2

Modify the R code chunk above to create a scatter plot and a trend line showing the relationship between income and life expectancy by continent. Please transform the variable accordingly. Describe your graph.
Hint: you may use geom_point() and geom_smooth().

Task 3

For advanced students, use the grid.arrange() function to combine two plots: a box plot showing income levels; a scatter plot showing the relationship between income and life expectancy. Add the title for each plot using the ggtitle function. Also use the ggplotGrob function from gtable package to align the graphs by changing their widths. Extra 2 points: Along with your assignment 4, add the combined graphs with titles and descriptions. See this vignette for aligning plot panels.

Task 4

Please email the document to the course email (urbanbigdata2019@gmail.com).
[IMPORTANT] Please use the following email title format:
VSP BigData [assignment number] - [your name]
ex), VSP BigData Assignment 4 - Bill Gates
Assignment 4 is due this Wednesday (Jul 31, 5:00 PM)

Urban Big Data Analytics - Assignment 4

Andy Hong

July 29, 2019

Prerequisites