Instructions

1. Synopsis

The purpose of this assignment is to get you familiar with the R environment and get your hands wet with real data. You may be familiar with the Gapminder data from your previous assignment. For this assignment, you will work with the actual data behind the gapminder tool. This assignment will walk you through how to load the data and do a basic data wrangling with them.

2. Basic setup

Install required package

For the assignment 3, you will need to install the dplyr package. This package allows you to use core functions to do a data wrangling. You will also need to download the gapminder dataset and load the data to R Studio.

# Set CRAN repository source
options(repos="https://cran.rstudio.com" )

install.packages("dplyr")

## 
## The downloaded binary packages are in
##  /var/folders/kn/rkjcf7yd4s7dlxxrpnm6dkg00000gp/T//RtmpjBUfEV/downloaded_packages

install.packages("tidyverse")

## 
## The downloaded binary packages are in
##  /var/folders/kn/rkjcf7yd4s7dlxxrpnm6dkg00000gp/T//RtmpjBUfEV/downloaded_packages

Load the package

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyverse)

## ── Attaching packages ──────────────────────────────────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 3.2.0     ✔ readr   1.3.1
## ✔ tibble  2.0.1     ✔ purrr   0.3.1
## ✔ tidyr   0.8.3     ✔ stringr 1.4.0
## ✔ ggplot2 3.2.0     ✔ forcats 0.4.0

## ── Conflicts ─────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Gapminder dataset

Now, let’s load the gapminder file to the R Studio environment.
Make sure to change <your user name> with your own user name for the path.

# For Windows
gapminder = read.csv("C:\Users\<your user name>\Documents\vsp_bigdata\assignments\gapminder_data.csv")

# For Mac
gapminder = read.csv("/Users/<your user name>/Documents/vsp_bigdata/assignments/gapminder_data.csv")

The gapminder dataset has 6 variables and 1704 observations. Let’s take look at the first 6 rows of the data.

head(gapminder)

##       country continent year lifeExp      pop gdpPercap
## 1 Afghanistan      Asia 1952  28.801  8425333  779.4453
## 2 Afghanistan      Asia 1957  30.332  9240934  820.8530
## 3 Afghanistan      Asia 1962  31.997 10267083  853.1007
## 4 Afghanistan      Asia 1967  34.020 11537966  836.1971
## 5 Afghanistan      Asia 1972  36.088 13079460  739.9811
## 6 Afghanistan      Asia 1977  38.438 14880372  786.1134

We can list the variables in the dataset, using the names() function.

Variables (columns) in gapminder dataset

names(gapminder)

## [1] "country"   "continent" "year"      "lifeExp"   "pop"       "gdpPercap"

Structure of the gapminder dataset

Another way to see the “structure” of the dataset is to run the str() function.

str(gapminder)

## 'data.frame':    1704 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ pop      : int  8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap: num  779 821 853 836 740 ...

You’ll notice that the 1st 2 columns/variables “country” and “continent” are both “Factor” type variables. This means that they are texts, or more precisely, categorical variables Variables 3 and 5 “year” and “pop” are both “int” Integer type variables. Variables 4 and 6 “lifeExp” and “gdpPercap” are both “num” Numeric type variables.

Summary statistics of the gapminder dataset

The built in function summary() in base R does a good simple summary statistics for all variables in the dataset provided. Since this dataset only has 6 variable, we can simply call summary(gapminder) which will give us the summary statistics for all 6 variables.

summary(gapminder)

##         country        continent        year         lifeExp     
##  Afghanistan:  12   Africa  :624   Min.   :1952   Min.   :23.60  
##  Albania    :  12   Americas:300   1st Qu.:1966   1st Qu.:48.20  
##  Algeria    :  12   Asia    :396   Median :1980   Median :60.71  
##  Angola     :  12   Europe  :360   Mean   :1980   Mean   :59.47  
##  Argentina  :  12   Oceania : 24   3rd Qu.:1993   3rd Qu.:70.85  
##  Australia  :  12                  Max.   :2007   Max.   :82.60  
##  (Other)    :1632                                                
##       pop              gdpPercap       
##  Min.   :6.001e+04   Min.   :   241.2  
##  1st Qu.:2.794e+06   1st Qu.:  1202.1  
##  Median :7.024e+06   Median :  3531.8  
##  Mean   :2.960e+07   Mean   :  7215.3  
##  3rd Qu.:1.959e+07   3rd Qu.:  9325.5  
##  Max.   :1.319e+09   Max.   :113523.1  
##

3. Data wrangling in R

%>% Operator

One of the important features in R is the %>% operator. This operator allows for chaining multiple functions together by incorporating %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. For instance a function to filter data can be written as:

# Normally, you will need to include dataset in all the functions, like this:
filter(data, variable == numeric_value)

# But with the %>% operator, you can simply chain different functions in a forward direciton, like this:
data %>% filter(variable == numeric_value)

As your function tasks get longer the %>% operator becomes more efficient and makes your code more legible. In addition, although not covered in this tutorial, the %>% operator allows you to flow from data manipulation tasks straight into vizualization functions (via ggplot and ggvis) and also into many analytic functions.

To learn more about the %>% operator and the magrittr package visit any of the following:
http://cran.r-project.org/web/packages/magrittr/vignettes/magrittr.html
http://www.r-bloggers.com/simpler-r-coding-with-pipes-the-present-and-future-of-the-magrittr-package/
http://blog.revolutionanalytics.com/2014/07/magrittr-simplifying-r-code-with-pipes.html

Key functions for data wrangling in R

We will use the dplyr package, which allows you to use the following data wrangling functions.

select()
filter()
mutate()
group_by()
summarise()
arrange()
join()

For this assignment, we will walk you through five basic functions: select(), filter(), mutate(), group_by(), summarise().

select( ) function:

Objective: Reduce dataframe size to only desired variables for current task

Description: When working with a sizable dataframe, often we desire to only assess specific variables. The select() function allows you to select and/or rename variables.

# Only select country, continent, year, lifeExp, and pop
core_data = gapminder %>% select(country, continent, year, pop, lifeExp)

head(core_data)

##       country continent year      pop lifeExp
## 1 Afghanistan      Asia 1952  8425333  28.801
## 2 Afghanistan      Asia 1957  9240934  30.332
## 3 Afghanistan      Asia 1962 10267083  31.997
## 4 Afghanistan      Asia 1967 11537966  34.020
## 5 Afghanistan      Asia 1972 13079460  36.088
## 6 Afghanistan      Asia 1977 14880372  38.438

filter( ) function:

Objective: Reduce rows/observations with matching conditions

Description: Filtering data is a common task to identify/select observations in which a particular variable matches a specific value/condition. The filter() function provides this capability.

# Filter data only for years between 1982 and 2002
core_data = gapminder %>% filter(between(year, 1982, 2002))

head(core_data)

##       country continent year lifeExp      pop gdpPercap
## 1 Afghanistan      Asia 1982  39.854 12881816  978.0114
## 2 Afghanistan      Asia 1987  40.822 13867957  852.3959
## 3 Afghanistan      Asia 1992  41.674 16317921  649.3414
## 4 Afghanistan      Asia 1997  41.763 22227415  635.3414
## 5 Afghanistan      Asia 2002  42.129 25268405  726.7341
## 6     Albania    Europe 1982  70.420  2780097 3630.8807

gapminder %>% filter(country == "United States")

##          country continent year lifeExp       pop gdpPercap
## 1  United States  Americas 1952  68.440 157553000  13990.48
## 2  United States  Americas 1957  69.490 171984000  14847.13
## 3  United States  Americas 1962  70.210 186538000  16173.15
## 4  United States  Americas 1967  70.760 198712000  19530.37
## 5  United States  Americas 1972  71.340 209896000  21806.04
## 6  United States  Americas 1977  73.380 220239000  24072.63
## 7  United States  Americas 1982  74.650 232187835  25009.56
## 8  United States  Americas 1987  75.020 242803533  29884.35
## 9  United States  Americas 1992  76.090 256894189  32003.93
## 10 United States  Americas 1997  76.810 272911760  35767.43
## 11 United States  Americas 2002  77.310 287675526  39097.10
## 12 United States  Americas 2007  78.242 301139947  42951.65

Now, let’s filter data to choose only one country, USA. Noice that we use == operator to choose the country.

# Filter data to choose USA
core_data = gapminder %>% filter(country == "United States")

head(core_data)

##         country continent year lifeExp       pop gdpPercap
## 1 United States  Americas 1952   68.44 157553000  13990.48
## 2 United States  Americas 1957   69.49 171984000  14847.13
## 3 United States  Americas 1962   70.21 186538000  16173.15
## 4 United States  Americas 1967   70.76 198712000  19530.37
## 5 United States  Americas 1972   71.34 209896000  21806.04
## 6 United States  Americas 1977   73.38 220239000  24072.63

You can chain the select() and filter() functions in one line.

# Select and filter data all together
core_data = gapminder %>% 
  select(country, continent, year, pop, lifeExp) %>% 
  filter(between(year, 1982, 2002)) %>%
  filter(country == "United States")

head(core_data)

##         country continent year       pop lifeExp
## 1 United States  Americas 1982 232187835   74.65
## 2 United States  Americas 1987 242803533   75.02
## 3 United States  Americas 1992 256894189   76.09
## 4 United States  Americas 1997 272911760   76.81
## 5 United States  Americas 2002 287675526   77.31

mutate( ) function:

Objective: Creates new variables

Description: Often we want to create a new variable that is a function of the current variables in our dataframe or even just add a new variable. The mutate() function allows us to add new variables while preserving the existing variables.

# Let's add a variable that represents gross domestic product (GDP) in 2002
core_data = gapminder %>% 
  filter(year == 2002) %>%
  mutate(gdp = pop * gdpPercap)
  
head(core_data)

##       country continent year lifeExp      pop  gdpPercap          gdp
## 1 Afghanistan      Asia 2002  42.129 25268405   726.7341  18363410424
## 2     Albania    Europe 2002  75.651  3508512  4604.2117  16153932130
## 3     Algeria    Africa 2002  70.994 31287142  5288.0404 165447670333
## 4      Angola    Africa 2002  41.003 10866106  2773.2873  30134833901
## 5   Argentina  Americas 2002  74.340 38331121  8797.6407 337223430800
## 6   Australia   Oceania 2002  80.370 19546792 30687.7547 599847158654

group_by( ) function:

Objective: Group data by categorical variables

Description: Often, observations are nested within groups or categories and our goals is to perform statistical analysis both at the observation level and also at the group level. The group_by() function allows us to create these categorical groupings.

# Now let's group the data by continent
core_data = gapminder %>% group_by(continent)

head(core_data)

## # A tibble: 6 x 6
## # Groups:   continent [1]
##   country     continent  year lifeExp      pop gdpPercap
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan Asia       1952    28.8  8425333      779.
## 2 Afghanistan Asia       1957    30.3  9240934      821.
## 3 Afghanistan Asia       1962    32.0 10267083      853.
## 4 Afghanistan Asia       1967    34.0 11537966      836.
## 5 Afghanistan Asia       1972    36.1 13079460      740.
## 6 Afghanistan Asia       1977    38.4 14880372      786.

The group_by() function is a silent function in which no observable manipulation of the data is performed as a result of applying the function. Rather, the only change you’ll notice is, if you print the dataframe you will notice underneath the Source information and prior to the actual dataframe, an indicator of what variable the data is grouped by will be provided. The real magic of the group_by() function comes when we perform summary statistics which we will cover shortly.

summarise( ) function:

Objective: Perform summary statistics on variables

Description: Obviously the goal of all this data wrangling is to be able to perform statistical analysis on our data. The summarise() function allows us to perform the majority of the initial summary statistics when performing exploratory data analysis.

# Lets get the mean population for each continent in 2007
core_data = gapminder %>% 
  filter(year == 2007) %>%
  group_by(continent) %>%
  summarise(meanPop = mean(pop, na.rm=TRUE))
  
core_data

## # A tibble: 5 x 2
##   continent    meanPop
##   <fct>          <dbl>
## 1 Africa     17875763.
## 2 Americas   35954847.
## 3 Asia      115513752.
## 4 Europe     19536618.
## 5 Oceania    12274974.

Task 1

Modify the R code chunk above to select country and pop. Then filter the data for Canada for year 2007.
Hint: you may use select() and filter().

Task 2

Modify the R code chunk above to group by continent and calculate mean life expectancy by continent for year 2007.
Hint: you may use filter(), group_by(), and summarise().

Task 3

Please email the document to the course email (urbanbigdata2019@gmail.com).
[IMPORTANT] Please use the following email title format:
VSP BigData [assignment number] - [your name]
ex), VSP BigData Assignment 3 - Bill Gates
Assignment 3 is due this Thursday (July 25 5:00 PM)

Urban Big Data Analytics - Assignment 3

Andy Hong

July 22, 2019

Prerequisites