First, create a folder named assignments
under the vsp_bigdata
folder in your Documents folder.
IMPORTANT - PLEASE FOLLOW THE EXACT FOLDER NAME. OTHERWISE, YOU WILL GET LOST.
Now, let’s download the gapminder data set from this link: Gapminder CSV file .
Once you downloaded the file, move or copy the gapminder file to the assignments
folder you just created.
Next, please download a clean R template file here: Assignment 3 template R file .
Make sure to save this R template file under the assignments
folder you just created.
Finally, let’s run the R Studio program. Once the R Studio is loaded, you can open the assignment 3 template R file. The R template file should be a blank script with the header description below.
###########################################
#
# VSP Urban Big Data Assignment 3
#
# Title: Data wrangling with Gapminder data
# Group number: <your input>
# Name: <your input>
# Date: <your input>
#
###########################################
The purpose of this assignment is to get you familiar with the R environment and get your hands wet with real data. You may be familiar with the Gapminder data from your previous assignment. For this assignment, you will work with the actual data behind the gapminder tool. This assignment will walk you through how to load the data and do a basic data wrangling with them.
For the assignment 3, you will need to install the dplyr
package. This package allows you to use core functions to do a data wrangling. You will also need to download the gapminder dataset and load the data to R Studio.
# Set CRAN repository source
options(repos="https://cran.rstudio.com" )
install.packages("dplyr")
##
## The downloaded binary packages are in
## /var/folders/kn/rkjcf7yd4s7dlxxrpnm6dkg00000gp/T//RtmpjBUfEV/downloaded_packages
install.packages("tidyverse")
##
## The downloaded binary packages are in
## /var/folders/kn/rkjcf7yd4s7dlxxrpnm6dkg00000gp/T//RtmpjBUfEV/downloaded_packages
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching packages ──────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.0 ✔ readr 1.3.1
## ✔ tibble 2.0.1 ✔ purrr 0.3.1
## ✔ tidyr 0.8.3 ✔ stringr 1.4.0
## ✔ ggplot2 3.2.0 ✔ forcats 0.4.0
## ── Conflicts ─────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
Now, let’s load the gapminder file to the R Studio environment.
Make sure to change <your user name>
with your own user name for the path.
# For Windows
gapminder = read.csv("C:\Users\<your user name>\Documents\vsp_bigdata\assignments\gapminder_data.csv")
# For Mac
gapminder = read.csv("/Users/<your user name>/Documents/vsp_bigdata/assignments/gapminder_data.csv")
The gapminder dataset has 6 variables and 1704 observations. Let’s take look at the first 6 rows of the data.
head(gapminder)
## country continent year lifeExp pop gdpPercap
## 1 Afghanistan Asia 1952 28.801 8425333 779.4453
## 2 Afghanistan Asia 1957 30.332 9240934 820.8530
## 3 Afghanistan Asia 1962 31.997 10267083 853.1007
## 4 Afghanistan Asia 1967 34.020 11537966 836.1971
## 5 Afghanistan Asia 1972 36.088 13079460 739.9811
## 6 Afghanistan Asia 1977 38.438 14880372 786.1134
We can list the variables in the dataset, using the names()
function.
names(gapminder)
## [1] "country" "continent" "year" "lifeExp" "pop" "gdpPercap"
Another way to see the “structure” of the dataset is to run the str()
function.
str(gapminder)
## 'data.frame': 1704 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ pop : int 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num 779 821 853 836 740 ...
You’ll notice that the 1st 2 columns/variables “country” and “continent” are both “Factor” type variables. This means that they are texts, or more precisely, categorical variables Variables 3 and 5 “year” and “pop” are both “int” Integer type variables. Variables 4 and 6 “lifeExp” and “gdpPercap” are both “num” Numeric type variables.
The built in function summary()
in base R does a good simple summary statistics for all variables in the dataset provided. Since this dataset only has 6 variable, we can simply call summary(gapminder)
which will give us the summary statistics for all 6 variables.
summary(gapminder)
## country continent year lifeExp
## Afghanistan: 12 Africa :624 Min. :1952 Min. :23.60
## Albania : 12 Americas:300 1st Qu.:1966 1st Qu.:48.20
## Algeria : 12 Asia :396 Median :1980 Median :60.71
## Angola : 12 Europe :360 Mean :1980 Mean :59.47
## Argentina : 12 Oceania : 24 3rd Qu.:1993 3rd Qu.:70.85
## Australia : 12 Max. :2007 Max. :82.60
## (Other) :1632
## pop gdpPercap
## Min. :6.001e+04 Min. : 241.2
## 1st Qu.:2.794e+06 1st Qu.: 1202.1
## Median :7.024e+06 Median : 3531.8
## Mean :2.960e+07 Mean : 7215.3
## 3rd Qu.:1.959e+07 3rd Qu.: 9325.5
## Max. :1.319e+09 Max. :113523.1
##
One of the important features in R is the %>%
operator. This operator allows for chaining multiple functions together by incorporating %>%
. This operator will forward a value, or the result of an expression, into the next function call/expression. For instance a function to filter data can be written as:
# Normally, you will need to include dataset in all the functions, like this:
filter(data, variable == numeric_value)
# But with the %>% operator, you can simply chain different functions in a forward direciton, like this:
data %>% filter(variable == numeric_value)
As your function tasks get longer the %>% operator becomes more efficient and makes your code more legible. In addition, although not covered in this tutorial, the %>% operator allows you to flow from data manipulation tasks straight into vizualization functions (via ggplot and ggvis) and also into many analytic functions.
To learn more about the %>% operator and the magrittr package visit any of the following:
http://cran.r-project.org/web/packages/magrittr/vignettes/magrittr.html
http://www.r-bloggers.com/simpler-r-coding-with-pipes-the-present-and-future-of-the-magrittr-package/
http://blog.revolutionanalytics.com/2014/07/magrittr-simplifying-r-code-with-pipes.html
We will use the dplyr
package, which allows you to use the following data wrangling functions.
select()
filter()
mutate()
group_by()
summarise()
arrange()
join()
For this assignment, we will walk you through five basic functions: select()
, filter()
, mutate()
, group_by()
, summarise()
.
Objective: Reduce dataframe size to only desired variables for current task
Description: When working with a sizable dataframe, often we desire to only assess specific variables. The select() function allows you to select and/or rename variables.
# Only select country, continent, year, lifeExp, and pop
core_data = gapminder %>% select(country, continent, year, pop, lifeExp)
head(core_data)
## country continent year pop lifeExp
## 1 Afghanistan Asia 1952 8425333 28.801
## 2 Afghanistan Asia 1957 9240934 30.332
## 3 Afghanistan Asia 1962 10267083 31.997
## 4 Afghanistan Asia 1967 11537966 34.020
## 5 Afghanistan Asia 1972 13079460 36.088
## 6 Afghanistan Asia 1977 14880372 38.438
Objective: Reduce rows/observations with matching conditions
Description: Filtering data is a common task to identify/select observations in which a particular variable matches a specific value/condition. The filter() function provides this capability.
# Filter data only for years between 1982 and 2002
core_data = gapminder %>% filter(between(year, 1982, 2002))
head(core_data)
## country continent year lifeExp pop gdpPercap
## 1 Afghanistan Asia 1982 39.854 12881816 978.0114
## 2 Afghanistan Asia 1987 40.822 13867957 852.3959
## 3 Afghanistan Asia 1992 41.674 16317921 649.3414
## 4 Afghanistan Asia 1997 41.763 22227415 635.3414
## 5 Afghanistan Asia 2002 42.129 25268405 726.7341
## 6 Albania Europe 1982 70.420 2780097 3630.8807
gapminder %>% filter(country == "United States")
## country continent year lifeExp pop gdpPercap
## 1 United States Americas 1952 68.440 157553000 13990.48
## 2 United States Americas 1957 69.490 171984000 14847.13
## 3 United States Americas 1962 70.210 186538000 16173.15
## 4 United States Americas 1967 70.760 198712000 19530.37
## 5 United States Americas 1972 71.340 209896000 21806.04
## 6 United States Americas 1977 73.380 220239000 24072.63
## 7 United States Americas 1982 74.650 232187835 25009.56
## 8 United States Americas 1987 75.020 242803533 29884.35
## 9 United States Americas 1992 76.090 256894189 32003.93
## 10 United States Americas 1997 76.810 272911760 35767.43
## 11 United States Americas 2002 77.310 287675526 39097.10
## 12 United States Americas 2007 78.242 301139947 42951.65
Now, let’s filter data to choose only one country, USA. Noice that we use ==
operator to choose the country.
# Filter data to choose USA
core_data = gapminder %>% filter(country == "United States")
head(core_data)
## country continent year lifeExp pop gdpPercap
## 1 United States Americas 1952 68.44 157553000 13990.48
## 2 United States Americas 1957 69.49 171984000 14847.13
## 3 United States Americas 1962 70.21 186538000 16173.15
## 4 United States Americas 1967 70.76 198712000 19530.37
## 5 United States Americas 1972 71.34 209896000 21806.04
## 6 United States Americas 1977 73.38 220239000 24072.63
You can chain the select()
and filter()
functions in one line.
# Select and filter data all together
core_data = gapminder %>%
select(country, continent, year, pop, lifeExp) %>%
filter(between(year, 1982, 2002)) %>%
filter(country == "United States")
head(core_data)
## country continent year pop lifeExp
## 1 United States Americas 1982 232187835 74.65
## 2 United States Americas 1987 242803533 75.02
## 3 United States Americas 1992 256894189 76.09
## 4 United States Americas 1997 272911760 76.81
## 5 United States Americas 2002 287675526 77.31
Objective: Creates new variables
Description: Often we want to create a new variable that is a function of the current variables in our dataframe or even just add a new variable. The mutate() function allows us to add new variables while preserving the existing variables.
# Let's add a variable that represents gross domestic product (GDP) in 2002
core_data = gapminder %>%
filter(year == 2002) %>%
mutate(gdp = pop * gdpPercap)
head(core_data)
## country continent year lifeExp pop gdpPercap gdp
## 1 Afghanistan Asia 2002 42.129 25268405 726.7341 18363410424
## 2 Albania Europe 2002 75.651 3508512 4604.2117 16153932130
## 3 Algeria Africa 2002 70.994 31287142 5288.0404 165447670333
## 4 Angola Africa 2002 41.003 10866106 2773.2873 30134833901
## 5 Argentina Americas 2002 74.340 38331121 8797.6407 337223430800
## 6 Australia Oceania 2002 80.370 19546792 30687.7547 599847158654
Objective: Group data by categorical variables
Description: Often, observations are nested within groups or categories and our goals is to perform statistical analysis both at the observation level and also at the group level. The group_by() function allows us to create these categorical groupings.
# Now let's group the data by continent
core_data = gapminder %>% group_by(continent)
head(core_data)
## # A tibble: 6 x 6
## # Groups: continent [1]
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
The group_by()
function is a silent function in which no observable manipulation of the data is performed as a result of applying the function. Rather, the only change you’ll notice is, if you print the dataframe you will notice underneath the Source information and prior to the actual dataframe, an indicator of what variable the data is grouped by will be provided. The real magic of the group_by()
function comes when we perform summary statistics which we will cover shortly.
Objective: Perform summary statistics on variables
Description: Obviously the goal of all this data wrangling is to be able to perform statistical analysis on our data. The summarise() function allows us to perform the majority of the initial summary statistics when performing exploratory data analysis.
# Lets get the mean population for each continent in 2007
core_data = gapminder %>%
filter(year == 2007) %>%
group_by(continent) %>%
summarise(meanPop = mean(pop, na.rm=TRUE))
core_data
## # A tibble: 5 x 2
## continent meanPop
## <fct> <dbl>
## 1 Africa 17875763.
## 2 Americas 35954847.
## 3 Asia 115513752.
## 4 Europe 19536618.
## 5 Oceania 12274974.
Modify the R code chunk above to select country
and pop
. Then filter the data for Canada
for year 2007.
Hint: you may use select()
and filter()
.
Modify the R code chunk above to group by continent and calculate mean life expectancy by continent for year 2007.
Hint: you may use filter()
, group_by()
, and summarise()
.
VSP BigData [assignment number] - [your name]
VSP BigData Assignment 3 - Bill Gates