First, create a folder named assignments
under the vsp_bigdata
folder in your Documents folder.
IMPORTANT - PLEASE FOLLOW THE EXACT FOLDER NAME. OTHERWISE, YOU WILL GET LOST.
Now, let’s download the gapminder data set from this link: Gapminder 2016 CSV file .
Now, download the gapminder geographic coordinate file from this link: Gapminder geographic coordinate file .
Once you downloaded the files, move or copy the files to the assignments
folder you just created.
Next, please download a clean R template file here: Assignment 4 template R file .
Make sure to save this R template file under the assignments
folder you just created.
Finally, let’s run the R Studio program. Once the R Studio is loaded, you can open the assignment 4 template R file. The R template file should be a blank script with the header description below.
###########################################
#
# VSP Urban Big Data Assignment 4
#
# Title: Data visualization with Gapminder data
# Group number: <your input>
# Name: <your input>
# Date: <your input>
#
###########################################
The purpose of this assignment is to get you familiar with data visualization in R. By now, you must be quite familiar with the Gapminder data from your previous assignments. For this assignment, you will work with the gapminder data for 2016, and the coordinate data available for each countryin the gapminder data. This assignment will walk you through how to load the data and do a basic data visualization.
For the assignment 4, you will need to install the five packages listed velow. These packages allow you to use core functions to do a data visualization and web mapping. You will also need to download the gapminder datasets and load the data to R Studio.
install.packages("dplyr")
install.packages("magrittr")
install.packages("ggplot2")
install.packages("plotly")
install.packages("leaflet")
library(dplyr)
library(magrittr)
library(ggplot2)
library(plotly)
library(leaflet)
Now, let’s load the gapminder file to the R Studio environment.
Make sure to change [your user name]
with your own user name for the path.
# For Windows
gapminder = read.csv("C://Users//[your user name]//Documents//vsp_bigdata//assignments//gapminder_data_2016.csv")
geo = read.csv("C://Users//[your user name]//Documents//vsp_bigdata//assignments//gapminder_geo.csv")
# For Mac
gapminder = read.csv("/Users/[your user name]/Documents/vsp_bigdata/assignments/gapminder_data_2016.csv")
geo = read.csv("/Users/[your user name]/Documents/vsp_bigdata/assignments/gapminder_geo.csv")
The gapminder 2016 dataset has 4 variables and 187 observations. Let’s take look at the first 6 rows of the data.
head(gapminder)
## name region income lifeExp
## 1 Afghanistan asia 1740 58.0
## 2 Albania europe 11400 77.7
## 3 Algeria africa 14000 77.4
## 4 Andorra europe 48200 82.5
## 5 Angola africa 6030 64.7
## 6 Antigua and Barbuda americas 20800 77.3
Now let’s take a look at the geographic coordinate data.
head(geo)
## name lat long population
## 1 Afghanistan 33.00000 66.00000 34700000
## 2 Albania 41.00000 20.00000 2930000
## 3 Algeria 28.00000 3.00000 40600000
## 4 Andorra 42.50779 1.52109 77300
## 5 Angola -12.50000 18.50000 28800000
## 6 Antigua and Barbuda 17.05000 -61.80000 101000
We can use the column name
to join the two data sets together to prepare for mapping later. We are going to use inner_join
so that we only choose countries with complete geographic data.
# Join gapminder data and the geographic coordinates
gapminder = gapminder %>% inner_join(geo, by="name")
# Check the joined data
head(gapminder)
## name region income lifeExp lat long
## 1 Afghanistan asia 1740 58.0 33.00000 66.00000
## 2 Albania europe 11400 77.7 41.00000 20.00000
## 3 Algeria africa 14000 77.4 28.00000 3.00000
## 4 Andorra europe 48200 82.5 42.50779 1.52109
## 5 Angola africa 6030 64.7 -12.50000 18.50000
## 6 Antigua and Barbuda americas 20800 77.3 17.05000 -61.80000
## population
## 1 34700000
## 2 2930000
## 3 40600000
## 4 77300
## 5 28800000
## 6 101000
We will use the ggplot2
package, which allows you to use the following data visualization functions.
geom_histogram() # histogram
geom_point() # scatter plot
geom_smooth() # trend line
geom_bar() # bar plot
geom_line() # line plot
For this assignment, we will use the three visualization functions: geom_histogram()
, geom_point()
, geom_smooth()
.
To initialize ggplot functions, you will need to use the ggplot function and declare x and y axes. aes
is short for “aesthetic”, and what go inside this parameter are the variables that can be pulled from the data.
ggplot(gapminder, aes(x = income))
You will see a blank plot without any data in it, but that’s okay.
This function allows you to create a histogram using the x-axis variable inside the aes
parameter.
ggplot(gapminder, aes(x = income)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
This function allows you to create a scatter plot using x and y axes inside the aes
parameter.
ggplot(gapminder, aes(x = income, y=lifeExp)) + geom_point()
The graph looks non-linear, so we can take a log of income or can use a log scale for the x-axis.
# Create a log-transformed income variable
gapminder = gapminder %>% mutate(income_log = log(income))
# Plot again with the log-transformed income
ggplot(gapminder, aes(x = income_log, y=lifeExp)) + geom_point()
Of we can apply a log-scale only for plotting.
# Plot again with a log scale applied to the x-axis
ggplot(gapminder, aes(x = income, y=lifeExp)) + geom_point() + scale_x_log10()
This function allows you to create a boxplot. The X-axis should be a categorical variable, and the Y-axis a continuous numeric variable. Note that geom_jitter
function allows you to spread the points.
ggplot(gapminder, aes(x = region, y = lifeExp)) +
geom_boxplot(outlier.colour = "hotpink") +
geom_jitter(position = position_jitter(width = 0.1, height = 0), alpha = 1/4)
The geom_jitter
is not necessary but addes more depth into our visualization.
ggplot(gapminder, aes(x = region, y = lifeExp)) +
geom_boxplot(outlier.colour = "hotpink") +
geom_jitter(position = position_jitter(width = 0.1, height = 0), alpha = 1/4)
Creates a trend line using models, such as a linear regression, a local regression, and polynomial functions.
# Plot a trend line
ggplot(gapminder, aes(x = income, y=lifeExp)) + geom_point() + geom_smooth() + scale_x_log10()
Now, we will cut the data into four continents using the facet_grid
function. It is a very useful function to cut data differently and see ther patterns.
ggplot(gapminder, aes(x = income, y = lifeExp)) +
geom_point(aes(color = region)) +
geom_smooth(aes(color = region)) +
facet_grid(.~region) +
scale_x_log10()
Modify the R code chunk above to create one histogram for population
, one histogram for income
, and one histogram for life expectancy lifeExp
. Describe your graphs (e.g. mean and distribution).
Hint: you may use geom_histogram()
.
Modify the R code chunk above to create a scatter plot and a trend line showing the relationship between income and life expectancy by continent. Please transform the variable accordingly. Describe your graph.
Hint: you may use geom_point()
and geom_smooth()
.
For advanced students, use the grid.arrange()
function to combine two plots: a box plot showing income levels; a scatter plot showing the relationship between income and life expectancy. Add the title for each plot using the ggtitle
function. Also use the ggplotGrob
function from gtable
package to align the graphs by changing their widths. Extra 2 points: Along with your assignment 4, add the combined graphs with titles and descriptions. See this vignette for aligning plot panels.
VSP BigData [assignment number] - [your name]
VSP BigData Assignment 4 - Bill Gates