The purpose of this group session is to build some basic machine learning models.
In part one, we will first run the K-means clustering algorithm using the USArrests data.
In part two, we will use implement the Decision Tree and Random Forest algorithms using the Boston Housing data.
First, we will install the factoextra
package to use the kmeans()
function. We will also load other core packages, like ggplot, magrittr, and dplyr.
# Set CRAN repository source
options(repos="https://cran.rstudio.com")
#Load packages
install.packages("factoextra")
##
## The downloaded binary packages are in
## /var/folders/kn/rkjcf7yd4s7dlxxrpnm6dkg00000gp/T//RtmpjvTCcC/downloaded_packages
library(factoextra)
library(ggplot2)
library(magrittr)
library(dplyr)
Let’s load the USArrests data. This data is included in R as default. This data contains 50 observations on 4 variables.
Murder - numeric Murder arrests (per 100,000)
Assault - numeric Assault arrests (per 100,000)
UrbanPop - numeric Percent urban population
Rape - numeric Rape arrests (per 100,000)
#Load data
data("USArrests")
head(USArrests)
## Murder Assault UrbanPop Rape
## Alabama 13.2 236 58 21.2
## Alaska 10.0 263 48 44.5
## Arizona 8.1 294 80 31.0
## Arkansas 8.8 190 50 19.5
## California 9.0 276 91 40.6
## Colorado 7.9 204 78 38.7
For the k-means to work, we will scale the data appropriately and run the k-means algorithm to cluster the data into four groups.
# Scale the data
df = scale(USArrests)
# Compute K-means cluster
result = kmeans(df, 4)
# See cluster
result$cluster
## Alabama Alaska Arizona Arkansas California
## 2 1 1 2 1
## Colorado Connecticut Delaware Florida Georgia
## 1 3 3 1 2
## Hawaii Idaho Illinois Indiana Iowa
## 3 4 1 3 4
## Kansas Kentucky Louisiana Maine Maryland
## 3 4 2 4 1
## Massachusetts Michigan Minnesota Mississippi Missouri
## 3 1 4 2 1
## Montana Nebraska Nevada New Hampshire New Jersey
## 4 4 1 4 3
## New Mexico New York North Carolina North Dakota Ohio
## 1 1 2 4 3
## Oklahoma Oregon Pennsylvania Rhode Island South Carolina
## 3 3 3 3 2
## South Dakota Tennessee Texas Utah Vermont
## 4 2 1 3 4
## Virginia Washington West Virginia Wisconsin Wyoming
## 3 3 4 4 3
Let’s use the fviz_cluster
function to visualize the clustering result.
# Visualize cluster
fviz_cluster(result, data = df)
# Set CRAN repository source
options(repos="https://cran.rstudio.com")
# Install packages
install.packages("MASS")
##
## The downloaded binary packages are in
## /var/folders/kn/rkjcf7yd4s7dlxxrpnm6dkg00000gp/T//RtmpjvTCcC/downloaded_packages
install.packages("tree")
##
## The downloaded binary packages are in
## /var/folders/kn/rkjcf7yd4s7dlxxrpnm6dkg00000gp/T//RtmpjvTCcC/downloaded_packages
install.packages("randomForest")
##
## The downloaded binary packages are in
## /var/folders/kn/rkjcf7yd4s7dlxxrpnm6dkg00000gp/T//RtmpjvTCcC/downloaded_packages
# Load packages
library(MASS)
library(tree)
library(randomForest)
In this example, we will use the Boston housing price dataset. We will use the median housing values in $1000s. The dataset is part of the MASS package. Below are all the variables in the Boston dataset.
# Variable description
# crim - per capita crime rate by town.
# zn - proportion of residential land zoned for lots over 25,000 sq.ft.
# indus - proportion of non-retail business acres per town.
# chas - Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
# nox - nitrogen oxides concentration (parts per 10 million).
# rm - average number of rooms per dwelling.
# age - proportion of owner-occupied units built prior to 1940.
# dis - weighted mean of distances to five Boston employment centres.
# rad - index of accessibility to radial highways.
# tax - full-value property-tax rate per \$10,000.
# ptratio - pupil-teacher ratio by town.
# black - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
# lstat - lower status of the population (percent).
# medv - median value of owner-occupied homes in \$1000s.
head(Boston)
## crim zn indus chas nox rm age dis rad tax ptratio black
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12
## lstat medv
## 1 4.98 24.0
## 2 9.14 21.6
## 3 4.03 34.7
## 4 2.94 33.4
## 5 5.33 36.2
## 6 5.21 28.7
Decision tree model First, we are going to build a decision tree model
# Setting the seed so that we can replicate later
set.seed(1)
# Split training and test data sets, 60:30
train = sample(1:nrow(Boston), nrow(Boston) * 0.6)
test = Boston[-train, "medv"]
# Decision tree
tree.boston = tree(medv~., data=Boston, subset=train)
# Calculate predicted y value (y hat) on the test data set
yhat.tree = predict(tree.boston, newdata = Boston[-train, ])
# Show the decision tree
plot(tree.boston)
text(tree.boston, pretty = 0)
title(main = "Unpruned Classification Tree")
# Calculate mean squared error (MSE)
mean((yhat.tree - test)^2)
## [1] 31.25277
# Cross-validation
cv.boston = cv.tree(tree.boston)
plot(cv.boston$size, cv.boston$dev/length(train), type = "b",
xlab = "Tree Size", ylab = "CV Root Mean Square Error")
# Show summary of the decision tree model
summary(tree.boston)
##
## Regression tree:
## tree(formula = medv ~ ., data = Boston, subset = train)
## Variables actually used in tree construction:
## [1] "rm" "lstat" "crim"
## Number of terminal nodes: 7
## Residual mean deviance: 14.17 = 4195 / 296
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -10.6000 -2.2950 -0.0907 0.0000 2.0840 28.2100
Bagging model Now, we are going to build a bagging model.
# Setting the seed so that we can replicate later
set.seed(1)
# Random forest (bagging, full predictors)
bag.boston = randomForest(medv~., data=Boston, subset=train, mtry=13, importance=TRUE)
# Calculate predicted y value (y hat) on the test data set
yhat.bag = predict(bag.boston, newdata = Boston[-train, ])
# Calculate mean squared error (MSE)
mean((yhat.bag - test)^2)
## [1] 17.34218
plot(bag.boston)
# Show summary of the bagging model
bag.boston
##
## Call:
## randomForest(formula = medv ~ ., data = Boston, mtry = 13, importance = TRUE, subset = train)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 13
##
## Mean of squared residuals: 12.77754
## % Var explained: 85.03
# Draw the importance plot
varImpPlot(bag.boston)
Random forest model Finally, this is our random forest model.
# Random forest (partial predictors, m = sqrt(p))
set.seed(1)
rf.boston = randomForest(medv~., data=Boston, subset=train, mtry=4, importance=TRUE)
# Calculate predicted y value (y hat)
yhat.rf = predict(rf.boston, newdata = Boston[-train, ])
# Calculate mean squared error (MSE)
mean((yhat.rf - test)^2)
## [1] 13.3058
plot(rf.boston)
# Draw the importance plot
varImpPlot(rf.boston, main="Importance Score")
# Show summary of the rf model
rf.boston
##
## Call:
## randomForest(formula = medv ~ ., data = Boston, mtry = 4, importance = TRUE, subset = train)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 4
##
## Mean of squared residuals: 11.64973
## % Var explained: 86.35
Use either Exploratory or R Studio to working on your group project.
Complete the result section of your report.
VSP BigData [lecture number] - [group number] - [presenter name]
VSP BigData Lecture 11 - Group 1 - Bill Gates