Register for Exploratory and download/install Exploratory: https://exploratory.io/download
Download and install R: https://mirror.its.sfu.ca/mirror/CRAN
Download and install RStudio: https://www.rstudio.com/products/rstudio/download/
Download and install QGIS: https://qgis.org/
The purpose of this group project is to learn how to create a basic data science report. You will need to wrangle messy data that come in a variety of different formats. You will also need to merge different datasets and conduct an analysis to test your hypotheses or make recommendations based on your findings. We cover the first three parts of your group project in class, but your group will need to work together to complete the final report.
In part 1, you learned how to explore the crime data and applied some data wrangling techniques.
In part 2, you learned how to merge the crime data with other datasets from the open data platform.
In part 3, you will be conducting an exploratory data analysis (EDA) and applying some statistical learning techniques to extract useful information out of the dataset you created.
We will use the Seattle crime data first to show how to use Exploratory to build a linear regression model. Then, we will use R Studio to develop the same model using the codes.
For this group session, we will build on the previous Lecture 7’s group session and use the joined data. Go back to Lecture 7, if you haven’t done the previous session: Lecture 7. Group Session
Exploratory
Open up the Exploratory and choose the Seattle crime 2013 data. After the join step, click mutate
to create a new variable called crime_rate
. For the new variable, use this formula to correct to population. total_crime / pop * 1000
. Usually crime rate is reported as Number of crime per 1,000 population.
Now, go to Analytics
and choose Linear Regression Analysis
for your type. For target variable, choose crime_rate
and for predictor variable(s), choose p_seniors
, p_white
, p_female
, p_kids
,p_poverty
, med_hh_inc
, some_college
.
Mathematically, this can be written as a linear function below.
\(y = \alpha + \beta_{1}x_{1} + \beta_{2}x_{2} + \beta_{3}x_{3} + \beta_{4}x_{4} + \beta_{5}x_{5} + \beta_{6}x_{6} + \beta_{7}x_{7} + \varepsilon\)
where \(y\) is number of crime per 1000 population; \(x_{1}\) is % of senior population; \(x_{2}\) is % of white population; \(x_{3}\) is % of female population; \(x_{4}\) is % of children population; \(x_{5}\) is % of households under poverty; \(x_{6}\) is median household income; \(x_{7}\) is % of population having some college degree; and \(\varepsilon\) is a random error.
R Studio
Start the R Studio and prepare the packages
###########################
#
# Statistical modeling
# Andy Hong
# July 31, 2019
#
###########################
# Set CRAN repository source
options(repos="https://cran.rstudio.com")
# install.packages("dplyr")
# install.packages("magrittr")
library(dplyr)
library(magrittr)
Now, let’s load the data.
seattle_crime_2013 = read.csv("/Users/andyhong/Documents/vsp_bigdata/group-session/07-lecture/seattle_crime_2013.csv")
seattle_census_2013 = read.csv("/Users/andyhong/Documents/vsp_bigdata/group-session/07-lecture/seattle_census_2013.csv")
Let’s do some data wrangling procedures.
# Format tract variable
data = seattle_crime_2013 %>%
mutate(tract = as.numeric(gsub("\\..*", "", CensusTract2000)))
## Warning: NAs introduced by coercion
# Group by tract and generate total crime variable.
data = data %>%
group_by(tract) %>%
summarize(total_crime = n())
# Do a left join with the census data by tract
data = data %>%
left_join(seattle_census_2013, by = c("tract" = "tract"))
# Generate a crime rate variable
data = data %>%
mutate(crime_rate = total_crime/pop * 1000)
Let’s build our linear models. We will build two models. Model 1 is a basic model, and model 2 is a full model.
model1 = lm(data = data,
crime_rate ~
p_seniors +
p_white +
p_female +
p_kids
)
model2 = lm(data = data,
crime_rate ~
p_seniors +
p_white +
p_female +
p_kids +
p_poverty +
med_hh_inc +
some_college
)
Again, mathematically, this can be written as a linear function below.
\(y = \alpha + \beta_{1}x_{1} + \beta_{2}x_{2} + \beta_{3}x_{3} + \beta_{4}x_{4} + \beta_{5}x_{5} + \beta_{6}x_{6} + \beta_{7}x_{7} + \varepsilon\)
where \(y\) is number of crime per 1000 population; \(x_{1}\) is % of senior population; \(x_{2}\) is % of white population; \(x_{3}\) is % of female population; \(x_{4}\) is % of children population; \(x_{5}\) is % of households under poverty; \(x_{6}\) is median household income; \(x_{7}\) is % of population having some college degree; and \(\varepsilon\) is a random error.
A basic approach to show the model results is the summary
function.
summary(model1)
##
## Call:
## lm(formula = crime_rate ~ p_seniors + p_white + p_female + p_kids,
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -190.07 -40.06 -8.83 21.57 706.36
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 757.3106 117.6148 6.439 3.10e-09 ***
## p_seniors 3.5334 1.7796 1.985 0.04953 *
## p_white -1.0657 0.4931 -2.161 0.03283 *
## p_female -10.9987 2.5603 -4.296 3.72e-05 ***
## p_kids -5.3334 1.5422 -3.458 0.00077 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 94.29 on 112 degrees of freedom
## (9 observations deleted due to missingness)
## Multiple R-squared: 0.3871, Adjusted R-squared: 0.3652
## F-statistic: 17.68 on 4 and 112 DF, p-value: 2.818e-11
There are other useful tools to visualize the model results. We will use the new jtools
package for this task. The ggstance
and huxtable
packages are required to use the functions we want from the jtools
package.
# install.packages("jtools")
# install.packages("ggstance")
# install.packages("huxtable")
library(jtools)
library(ggstance)
library(huxtable)
##
## Attaching package: 'huxtable'
## The following object is masked from 'package:dplyr':
##
## add_rownames
The plot_summs()
function generates a coefficient plot that looks very much like what we saw in the Exploratory programe.
plot_summs(model1, model2)
You can also use the coefs
option to use your own labels.
plot_summs(model1, model2,
coefs = c(
"% seniors" = "p_seniors",
"% female" = "p_female",
"% white" = "p_white",
"% children" = "p_kids",
"% under poverty" = "p_poverty",
"median household income" = "med_hh_inc",
"% some college" = "some_college"
))
Lastly, the export_summs()
function generate a nice table that can be exported to a word program.
export_summs(model1, model2)
Model 1 | Model 2 | |
(Intercept) | 757.31 *** | 751.32 *** |
(117.61) | (136.01) | |
p_seniors | 3.53 * | 3.97 * |
(1.78) | (1.86) | |
p_white | -1.07 * | -0.67 |
(0.49) | (0.82) | |
p_female | -11.00 *** | -11.27 *** |
(2.56) | (2.70) | |
p_kids | -5.33 *** | -4.62 * |
(1.54) | (1.91) | |
p_poverty | 0.53 | |
(1.57) | ||
med_hh_inc | -0.00 | |
(0.00) | ||
some_college | -0.01 | |
(0.03) | ||
N | 117 | 117 |
R2 | 0.39 | 0.39 |
*** p < 0.001; ** p < 0.01; * p < 0.05. |
[IMPORTANT] Please move the downloaded file to a specified project folder.
Once you downloaded the crime data for each, download the census data and the corresponding GeoJSON file.
Group 1 - Boston: Boston crime 2016, Boston Census, Boston GeoJSON
Group 2 - Los Angeles: LA crime 2016, LA Census, LA GeoJSON
Group 3 - Los Angeles: LA crime 2016, LA Census, LA GeoJSON
Group 4 - San Francisco: SF crime 2016, SF Census, SF GeoJSON
Group 5 - Washington DC: Washington DC crime 2016, Washington DC Census, Washington DC GeoJSON
Group 6 - New York City: NYC crime 2016, NYC census, NYC GeoJSON
Group 7 - New York City: NYC crime 2016, NYC census, NYC GeoJSON
Group 8 - Philadelphia: Philadelphia crime 2016, Philadelphia Census, Philadelphia GeoJSON
Group 9 - Detroit: Detroit crime 2016, Detroit Census, Detroit GeoJSON
Group 10 - Detroit: Detroit crime 2016, Detroit Census, Detroit GeoJSON
Now that you have learned how to build models, use your city’s crime data to develop some models and describe what the model results tell you.
Issue at hand: What are the issues? What’s troubling the most?
Supporting data: For this project, you are given the crime data. Your job is to merge the the crime data with some other useful data to complete your story.
Relationship: What is the relationship between X and Y? Does the relationship go up or down or stay the same?
Interprtation: Why do you think the relationship between X and Y exists? Do some research. Read newspaper, and use your common sense and judgment to try to understand the observed relationship.
Summary and conclusions: Summarize what you’ve learned and draw a conclusion.
Please format your report using the following section titles and write up the first (Problem Statement) and second sections (Quetions and Hypotheses).
VSP BigData [lecture number] - [group number] - [presenter name]
VSP BigData Lecture 9 - Group 1 - Bill Gates