Instruction

1. Synopsis

The purpose of this group session is to build some basic machine learning models.

In part one, we will first run the K-means clustering algorithm using the USArrests data.
In part two, we will use implement the Decision Tree and Random Forest algorithms using the Boston Housing data.

2. Part one - K-means clustering

First, we will install the factoextra package to use the kmeans() function. We will also load other core packages, like ggplot, magrittr, and dplyr.

# Set CRAN repository source
options(repos="https://cran.rstudio.com")

#Load packages
install.packages("factoextra")
## 
## The downloaded binary packages are in
##  /var/folders/kn/rkjcf7yd4s7dlxxrpnm6dkg00000gp/T//RtmpjvTCcC/downloaded_packages
library(factoextra)
library(ggplot2)
library(magrittr)
library(dplyr)

Let’s load the USArrests data. This data is included in R as default. This data contains 50 observations on 4 variables.

Murder - numeric Murder arrests (per 100,000)
Assault - numeric Assault arrests (per 100,000)
UrbanPop - numeric Percent urban population
Rape - numeric Rape arrests (per 100,000)

#Load data
data("USArrests")
head(USArrests)
##            Murder Assault UrbanPop Rape
## Alabama      13.2     236       58 21.2
## Alaska       10.0     263       48 44.5
## Arizona       8.1     294       80 31.0
## Arkansas      8.8     190       50 19.5
## California    9.0     276       91 40.6
## Colorado      7.9     204       78 38.7

For the k-means to work, we will scale the data appropriately and run the k-means algorithm to cluster the data into four groups.

# Scale the data
df = scale(USArrests)

# Compute K-means cluster
result = kmeans(df, 4)

# See cluster
result$cluster
##        Alabama         Alaska        Arizona       Arkansas     California 
##              2              1              1              2              1 
##       Colorado    Connecticut       Delaware        Florida        Georgia 
##              1              3              3              1              2 
##         Hawaii          Idaho       Illinois        Indiana           Iowa 
##              3              4              1              3              4 
##         Kansas       Kentucky      Louisiana          Maine       Maryland 
##              3              4              2              4              1 
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
##              3              1              4              2              1 
##        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
##              4              4              1              4              3 
##     New Mexico       New York North Carolina   North Dakota           Ohio 
##              1              1              2              4              3 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
##              3              3              3              3              2 
##   South Dakota      Tennessee          Texas           Utah        Vermont 
##              4              2              1              3              4 
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
##              3              3              4              4              3

Let’s use the fviz_cluster function to visualize the clustering result.

# Visualize cluster
fviz_cluster(result, data = df)

3. Part two - Decision trees, bagging, and random forest

# Set CRAN repository source
options(repos="https://cran.rstudio.com")

# Install packages
install.packages("MASS")
## 
## The downloaded binary packages are in
##  /var/folders/kn/rkjcf7yd4s7dlxxrpnm6dkg00000gp/T//RtmpjvTCcC/downloaded_packages
install.packages("tree")
## 
## The downloaded binary packages are in
##  /var/folders/kn/rkjcf7yd4s7dlxxrpnm6dkg00000gp/T//RtmpjvTCcC/downloaded_packages
install.packages("randomForest")
## 
## The downloaded binary packages are in
##  /var/folders/kn/rkjcf7yd4s7dlxxrpnm6dkg00000gp/T//RtmpjvTCcC/downloaded_packages
# Load packages
library(MASS)
library(tree)
library(randomForest)

In this example, we will use the Boston housing price dataset. We will use the median housing values in $1000s. The dataset is part of the MASS package. Below are all the variables in the Boston dataset.

# Variable description

# crim - per capita crime rate by town.
# zn - proportion of residential land zoned for lots over 25,000 sq.ft.
# indus - proportion of non-retail business acres per town.
# chas - Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
# nox - nitrogen oxides concentration (parts per 10 million).
# rm - average number of rooms per dwelling.
# age - proportion of owner-occupied units built prior to 1940.
# dis - weighted mean of distances to five Boston employment centres.
# rad - index of accessibility to radial highways.
# tax - full-value property-tax rate per \$10,000.
# ptratio - pupil-teacher ratio by town.
# black - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
# lstat - lower status of the population (percent).
# medv - median value of owner-occupied homes in \$1000s.

head(Boston)
##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12
##   lstat medv
## 1  4.98 24.0
## 2  9.14 21.6
## 3  4.03 34.7
## 4  2.94 33.4
## 5  5.33 36.2
## 6  5.21 28.7

Decision tree model First, we are going to build a decision tree model

# Setting the seed so that we can replicate later
set.seed(1)

# Split training and test data sets, 60:30
train = sample(1:nrow(Boston), nrow(Boston) * 0.6)
test = Boston[-train, "medv"]

# Decision tree
tree.boston = tree(medv~., data=Boston, subset=train)

# Calculate predicted y value (y hat) on the test data set
yhat.tree = predict(tree.boston, newdata = Boston[-train, ])

# Show the decision tree
plot(tree.boston)
text(tree.boston, pretty = 0)
title(main = "Unpruned Classification Tree")

# Calculate mean squared error (MSE)
mean((yhat.tree - test)^2)
## [1] 31.25277
# Cross-validation
cv.boston = cv.tree(tree.boston)
plot(cv.boston$size, cv.boston$dev/length(train), type = "b",
     xlab = "Tree Size", ylab = "CV Root Mean Square Error")

# Show summary of the decision tree model
summary(tree.boston)
## 
## Regression tree:
## tree(formula = medv ~ ., data = Boston, subset = train)
## Variables actually used in tree construction:
## [1] "rm"    "lstat" "crim" 
## Number of terminal nodes:  7 
## Residual mean deviance:  14.17 = 4195 / 296 
## Distribution of residuals:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -10.6000  -2.2950  -0.0907   0.0000   2.0840  28.2100


Bagging model Now, we are going to build a bagging model.

# Setting the seed so that we can replicate later
set.seed(1)

# Random forest (bagging, full predictors)
bag.boston = randomForest(medv~., data=Boston, subset=train, mtry=13, importance=TRUE)

# Calculate predicted y value (y hat) on the test data set
yhat.bag = predict(bag.boston, newdata = Boston[-train, ])

# Calculate mean squared error (MSE)
mean((yhat.bag - test)^2)
## [1] 17.34218
plot(bag.boston)

# Show summary of the bagging model
bag.boston
## 
## Call:
##  randomForest(formula = medv ~ ., data = Boston, mtry = 13, importance = TRUE,      subset = train) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 13
## 
##           Mean of squared residuals: 12.77754
##                     % Var explained: 85.03
# Draw the importance plot
varImpPlot(bag.boston)


Random forest model Finally, this is our random forest model.

# Random forest (partial predictors, m = sqrt(p))
set.seed(1)
rf.boston = randomForest(medv~., data=Boston, subset=train, mtry=4, importance=TRUE)

# Calculate predicted y value (y hat)
yhat.rf = predict(rf.boston, newdata = Boston[-train, ])

# Calculate mean squared error (MSE)
mean((yhat.rf - test)^2)
## [1] 13.3058
plot(rf.boston)

# Draw the importance plot
varImpPlot(rf.boston, main="Importance Score")

# Show summary of the rf model
rf.boston
## 
## Call:
##  randomForest(formula = medv ~ ., data = Boston, mtry = 4, importance = TRUE,      subset = train) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##           Mean of squared residuals: 11.64973
##                     % Var explained: 86.35

Work on your group project

  1. Use either Exploratory or R Studio to working on your group project.

  2. Complete the result section of your report.

  3. Send your preliminary report to the course email (urbanbigdata2019@gmail.com).