Prerequisites


Synopsis

The purpose of this group project is to learn how to create a basic data science report. You will need to wrangle messy data that come in a variety of different formats. You will also need to merge different datasets and conduct an analysis to test your hypotheses or make recommendations based on your findings. We cover the first three parts of your group project in class, but your group will need to work together to complete the final report.


Statistical modeling

We will use the Seattle crime data first to show how to use Exploratory to build a linear regression model. Then, we will use R Studio to develop the same model using the codes.

For this group session, we will build on the previous Lecture 7’s group session and use the joined data. Go back to Lecture 7, if you haven’t done the previous session: Lecture 7. Group Session

Exploratory
Open up the Exploratory and choose the Seattle crime 2013 data. After the join step, click mutate to create a new variable called crime_rate. For the new variable, use this formula to correct to population. total_crime / pop * 1000. Usually crime rate is reported as Number of crime per 1,000 population.

Now, go to Analytics and choose Linear Regression Analysis for your type. For target variable, choose crime_rate and for predictor variable(s), choose p_seniors, p_white, p_female, p_kids,p_poverty, med_hh_inc, some_college.

Mathematically, this can be written as a linear function below.

\(y = \alpha + \beta_{1}x_{1} + \beta_{2}x_{2} + \beta_{3}x_{3} + \beta_{4}x_{4} + \beta_{5}x_{5} + \beta_{6}x_{6} + \beta_{7}x_{7} + \varepsilon\)

where \(y\) is number of crime per 1000 population; \(x_{1}\) is % of senior population; \(x_{2}\) is % of white population; \(x_{3}\) is % of female population; \(x_{4}\) is % of children population; \(x_{5}\) is % of households under poverty; \(x_{6}\) is median household income; \(x_{7}\) is % of population having some college degree; and \(\varepsilon\) is a random error.



R Studio
Start the R Studio and prepare the packages

###########################
#
# Statistical modeling
# Andy Hong
# July 31, 2019
# 
###########################

# Set CRAN repository source
options(repos="https://cran.rstudio.com")

# install.packages("dplyr")
# install.packages("magrittr")

library(dplyr)
library(magrittr)

Now, let’s load the data.

seattle_crime_2013 = read.csv("/Users/andyhong/Documents/vsp_bigdata/group-session/07-lecture/seattle_crime_2013.csv")

seattle_census_2013 = read.csv("/Users/andyhong/Documents/vsp_bigdata/group-session/07-lecture/seattle_census_2013.csv")

Let’s do some data wrangling procedures.

# Format tract variable
data = seattle_crime_2013 %>% 
  mutate(tract = as.numeric(gsub("\\..*", "", CensusTract2000))) 
## Warning: NAs introduced by coercion
# Group by tract and generate total crime variable.
data = data %>% 
  group_by(tract) %>%
  summarize(total_crime = n())

# Do a left join with the census data by tract
data = data %>%
  left_join(seattle_census_2013, by = c("tract" = "tract")) 

# Generate a crime rate variable
data = data %>%
  mutate(crime_rate = total_crime/pop * 1000)

Let’s build our linear models. We will build two models. Model 1 is a basic model, and model 2 is a full model.

model1 = lm(data = data, 
            crime_rate ~ 
              p_seniors + 
              p_white +
              p_female +
              p_kids
            )

model2 = lm(data = data, 
            crime_rate ~ 
              p_seniors + 
              p_white +
              p_female +
              p_kids +
              p_poverty +
              med_hh_inc +
              some_college
            )

Again, mathematically, this can be written as a linear function below.

\(y = \alpha + \beta_{1}x_{1} + \beta_{2}x_{2} + \beta_{3}x_{3} + \beta_{4}x_{4} + \beta_{5}x_{5} + \beta_{6}x_{6} + \beta_{7}x_{7} + \varepsilon\)

where \(y\) is number of crime per 1000 population; \(x_{1}\) is % of senior population; \(x_{2}\) is % of white population; \(x_{3}\) is % of female population; \(x_{4}\) is % of children population; \(x_{5}\) is % of households under poverty; \(x_{6}\) is median household income; \(x_{7}\) is % of population having some college degree; and \(\varepsilon\) is a random error.

A basic approach to show the model results is the summary function.

summary(model1)
## 
## Call:
## lm(formula = crime_rate ~ p_seniors + p_white + p_female + p_kids, 
##     data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -190.07  -40.06   -8.83   21.57  706.36 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 757.3106   117.6148   6.439 3.10e-09 ***
## p_seniors     3.5334     1.7796   1.985  0.04953 *  
## p_white      -1.0657     0.4931  -2.161  0.03283 *  
## p_female    -10.9987     2.5603  -4.296 3.72e-05 ***
## p_kids       -5.3334     1.5422  -3.458  0.00077 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 94.29 on 112 degrees of freedom
##   (9 observations deleted due to missingness)
## Multiple R-squared:  0.3871, Adjusted R-squared:  0.3652 
## F-statistic: 17.68 on 4 and 112 DF,  p-value: 2.818e-11

There are other useful tools to visualize the model results. We will use the new jtools package for this task. The ggstance and huxtable packages are required to use the functions we want from the jtools package.

# install.packages("jtools")
# install.packages("ggstance")
# install.packages("huxtable")

library(jtools)
library(ggstance)
library(huxtable)
## 
## Attaching package: 'huxtable'
## The following object is masked from 'package:dplyr':
## 
##     add_rownames

The plot_summs() function generates a coefficient plot that looks very much like what we saw in the Exploratory programe.

plot_summs(model1, model2)

You can also use the coefs option to use your own labels.

plot_summs(model1, model2,
          coefs = c(
             "% seniors" = "p_seniors",
             "% female" = "p_female",
             "% white" = "p_white",
             "% children" = "p_kids",
             "% under poverty" = "p_poverty",
             "median household income" = "med_hh_inc",
             "% some college" = "some_college"
           ))

Lastly, the export_summs() function generate a nice table that can be exported to a word program.

export_summs(model1, model2)

Model 1 Model 2
(Intercept) 757.31 *** 751.32 ***
(117.61)    (136.01)   
p_seniors 3.53 *   3.97 *  
(1.78)    (1.86)   
p_white -1.07 *   -0.67    
(0.49)    (0.82)   
p_female -11.00 *** -11.27 ***
(2.56)    (2.70)   
p_kids -5.33 *** -4.62 *  
(1.54)    (1.91)   
p_poverty         0.53    
        (1.57)   
med_hh_inc         -0.00    
        (0.00)   
some_college         -0.01    
        (0.03)   
N 117        117       
R2 0.39     0.39    
*** p < 0.001; ** p < 0.01; * p < 0.05.


Each group is assigned one city to complete this group project

[IMPORTANT] Please move the downloaded file to a specified project folder.
Once you downloaded the crime data for each, download the census data and the corresponding GeoJSON file.


Statistical analysis of your city’s crime data

Now that you have learned how to build models, use your city’s crime data to develop some models and describe what the model results tell you.

Remember the five elements of good story telling

  • Issue at hand: What are the issues? What’s troubling the most?

  • Supporting data: For this project, you are given the crime data. Your job is to merge the the crime data with some other useful data to complete your story.

  • Relationship: What is the relationship between X and Y? Does the relationship go up or down or stay the same?

  • Interprtation: Why do you think the relationship between X and Y exists? Do some research. Read newspaper, and use your common sense and judgment to try to understand the observed relationship.

  • Summary and conclusions: Summarize what you’ve learned and draw a conclusion.

Send your initial template report

Please format your report using the following section titles and write up the first (Problem Statement) and second sections (Quetions and Hypotheses).

  1. Problem Statement
  2. Quetions and Hypotheses
  3. Data and Methods
  4. Results and Interpretation
  5. Conclusions