Urban Big Data Analytics - Lecture 10 Group Session: Advanced Modeling

Prerequisites

Please locate your vsp_bigdata folder under “My Documents” and navigate to group-session. Create 10-lecture folder under the group-session folder.

For this group session, we will use the 2011 Gapminder database.
Please download this CSV file and save it under the group session folder: Gapminder data 2011

Instruction

1. Synopsis

The purpose of this group session is to understand the concept of logistic regression.

First, we will build a logistic regression model using Exploratory. Then, we will build a logistic regression model using R Studio.

2. Part one - Building a logistical model in Exploratory

Let’s load the gapminder 2011 data.
Each country has a corruption perception index ranging from 0 to 100, and based on this index, we assigned 1 for countries that have the index greater than 50; and 0 for countries that have the index less than 50. The corruption variable is an arbitrary variable only for a demonstration purpose.

Go to Analytics and choose Logistic Regression Analysis for the type.
Your target variable is corruption, and your predictor variable(s) are: income, population, and democracy.

Mathematically, what we are trying to estimate is the following function.

\(logit(p) = log(\frac{p}{1-p}) = \beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{2} + \beta_{3}x_{3}\)

where \(log(\frac{p}{1-p})\) is the log odds of being in corruption; \(x_{1}\)is income; \(x_{2}\)is population; and \(x_{3}\)is democracy.

When you click run, you will see the coefficient plot of the logit model results.
Take a look at the result of democracy. The odds ratio of 0.79 means that for 1-unit increase in democracy, we expect to see about 21% decrease in the odds of being in corruption.

3. Part two - Building a logistical model in R Studio

We are going to join the gapminder data with the corruption data. Let’s load the libraries first.

# Set CRAN repository source
options(repos="https://cran.rstudio.com")

# Install packages
# install.packages("dplyr")
# install.packages("magrittr")

# Load packages
library(dplyr)
library(magrittr)

Let’s load and examine the gapminder data.

gapminder = read.csv("/Users/andyhong/Documents/vsp_bigdata/group-session/10-lecture/gapminder_data_2011.csv")

head(gapminder)

##          name   region population income lifeExp democracy corruption
## 1 Afghanistan     asia   29700000   1660    56.7        NA          1
## 2     Albania   europe    2930000  10200    76.7         9          1
## 3     Algeria   africa   36800000  13000    76.7         2          1
## 4      Angola   africa   24200000   5910    60.9        -2          1
## 5   Argentina americas   41700000  19600    76.0         8          1
## 6     Armenia   europe    2880000   7020    73.8         5          1
##   corruption_index
## 1                8
## 2               33
## 3               34
## 4               22
## 5               35
## 6               34

We are going to build the basic and the full models. We will use the glm function, which is a generalized linear model function. We will use the binomial family to implement the logistic function.

model1 = glm(data = gapminder, corruption ~ income, family = binomial)
model2 = glm(data = gapminder, corruption ~ income + population + democracy, family = binomial)

Now we can see the results of the models using the summary function.

# Show the original results
summary(model1)

## 
## Call:
## glm(formula = corruption ~ income, family = binomial, data = gapminder)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4951   0.2889   0.3345   0.4841   3.2432  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.223e+00  4.378e-01   7.362 1.82e-13 ***
## income      -1.094e-04  1.787e-05  -6.120 9.37e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 180.52  on 160  degrees of freedom
## Residual deviance: 114.34  on 159  degrees of freedom
## AIC: 118.34
## 
## Number of Fisher Scoring iterations: 5

summary(model2)

## 
## Call:
## glm(formula = corruption ~ income + population + democracy, family = binomial, 
##     data = gapminder)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2691  -0.0196   0.1788   0.4261   2.3690  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  4.520e+00  7.187e-01   6.289 3.19e-10 ***
## income      -1.143e-04  2.037e-05  -5.612 2.01e-08 ***
## population   7.916e-09  5.854e-09   1.352 0.176304    
## democracy   -2.244e-01  5.983e-02  -3.751 0.000176 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 178.788  on 157  degrees of freedom
## Residual deviance:  92.263  on 154  degrees of freedom
##   (3 observations deleted due to missingness)
## AIC: 100.26
## 
## Number of Fisher Scoring iterations: 6

For interpretation, we can exponentiate the results to get the odds ratio.

# Show the exponentiated results
exp(cbind(Odds=coef(model1), confint(model1)))

## Waiting for profiling to be done...

##                   Odds      2.5 %    97.5 %
## (Intercept) 25.0937760 11.4025724 64.308892
## income       0.9998906  0.9998524  0.999923

exp(cbind(Odds=coef(model2), confint(model2)))

## Waiting for profiling to be done...

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

##                   Odds      2.5 %      97.5 %
## (Intercept) 91.8588022 26.2262274 451.8015542
## income       0.9998857  0.9998411   0.9999217
## population   1.0000000  1.0000000   1.0000000
## democracy    0.7989896  0.7001597   0.8887006

Now, we can also generate the coefficient plot using the plot_summs function from the jtools package. To get the exponentiated results, we will use the option exp=TRUE.

# Load the jtools package
library(jtools)

# Show coefficient plot
plot_summs(model1, model2, exp=TRUE)

Send your report with the logistic model results and the interpretation

Use either Exploratory or R Studio to develop the best fit logistic model to estimate corruption
Create a short report that includes the model results and interpret your results.
Send your report to the course email (urbanbigdata2019@gmail.com).

[IMPORTANT] Please use the following email title format:
VSP BigData [lecture number] - [group number] - [presenter name]
ex), VSP BigData Lecture 11 - Group 1 - Bill Gates