Please locate your vsp_bigdata folder under “My Documents” and navigate to group-session. Create 10-lecture folder under the group-session folder.
For this group session, we will use the 2011 Gapminder database. 
 Please download this CSV file and save it under the group session folder:  Gapminder data 2011 
The purpose of this group session is to understand the concept of logistic regression. 
First, we will build a logistic regression model using Exploratory. Then, we will build a logistic regression model using R Studio. 
Let’s load the gapminder 2011 data. 
 Each country has a corruption perception index ranging from 0 to 100, and based on this index, we assigned 1 for countries that have the index greater than 50; and 0 for countries that have the index less than 50. The corruption variable is an arbitrary variable only for a demonstration purpose.
Go to Analytics and choose Logistic Regression Analysis for the type. 
 Your target variable is corruption, and your predictor variable(s) are: income, population, and democracy.
Mathematically, what we are trying to estimate is the following function.
\(logit(p) = log(\frac{p}{1-p}) = \beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{2} + \beta_{3}x_{3}\)
where \(log(\frac{p}{1-p})\) is the log odds of being in corruption; \(x_{1}\)is income; \(x_{2}\)is population; and \(x_{3}\)is democracy. 
When you click run, you will see the coefficient plot of the logit model results.
 Take a look at the result of democracy. The odds ratio of 0.79 means that for 1-unit increase in democracy, we expect to see about 21% decrease in the odds of being in corruption. 
We are going to join the gapminder data with the corruption data. Let’s load the libraries first.
# Set CRAN repository source
options(repos="https://cran.rstudio.com")
# Install packages
# install.packages("dplyr")
# install.packages("magrittr")
# Load packages
library(dplyr)
library(magrittr)Let’s load and examine the gapminder data.
gapminder = read.csv("/Users/andyhong/Documents/vsp_bigdata/group-session/10-lecture/gapminder_data_2011.csv")
head(gapminder)##          name   region population income lifeExp democracy corruption
## 1 Afghanistan     asia   29700000   1660    56.7        NA          1
## 2     Albania   europe    2930000  10200    76.7         9          1
## 3     Algeria   africa   36800000  13000    76.7         2          1
## 4      Angola   africa   24200000   5910    60.9        -2          1
## 5   Argentina americas   41700000  19600    76.0         8          1
## 6     Armenia   europe    2880000   7020    73.8         5          1
##   corruption_index
## 1                8
## 2               33
## 3               34
## 4               22
## 5               35
## 6               34We are going to build the basic and the full models. We will use the glm function, which is a generalized linear model function. We will use the binomial family to implement the logistic function.
model1 = glm(data = gapminder, corruption ~ income, family = binomial)
model2 = glm(data = gapminder, corruption ~ income + population + democracy, family = binomial)Now we can see the results of the models using the summary function.
# Show the original results
summary(model1)## 
## Call:
## glm(formula = corruption ~ income, family = binomial, data = gapminder)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4951   0.2889   0.3345   0.4841   3.2432  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.223e+00  4.378e-01   7.362 1.82e-13 ***
## income      -1.094e-04  1.787e-05  -6.120 9.37e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 180.52  on 160  degrees of freedom
## Residual deviance: 114.34  on 159  degrees of freedom
## AIC: 118.34
## 
## Number of Fisher Scoring iterations: 5summary(model2)## 
## Call:
## glm(formula = corruption ~ income + population + democracy, family = binomial, 
##     data = gapminder)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2691  -0.0196   0.1788   0.4261   2.3690  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  4.520e+00  7.187e-01   6.289 3.19e-10 ***
## income      -1.143e-04  2.037e-05  -5.612 2.01e-08 ***
## population   7.916e-09  5.854e-09   1.352 0.176304    
## democracy   -2.244e-01  5.983e-02  -3.751 0.000176 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 178.788  on 157  degrees of freedom
## Residual deviance:  92.263  on 154  degrees of freedom
##   (3 observations deleted due to missingness)
## AIC: 100.26
## 
## Number of Fisher Scoring iterations: 6For interpretation, we can exponentiate the results to get the odds ratio.
# Show the exponentiated results
exp(cbind(Odds=coef(model1), confint(model1)))## Waiting for profiling to be done...##                   Odds      2.5 %    97.5 %
## (Intercept) 25.0937760 11.4025724 64.308892
## income       0.9998906  0.9998524  0.999923exp(cbind(Odds=coef(model2), confint(model2)))## Waiting for profiling to be done...## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred##                   Odds      2.5 %      97.5 %
## (Intercept) 91.8588022 26.2262274 451.8015542
## income       0.9998857  0.9998411   0.9999217
## population   1.0000000  1.0000000   1.0000000
## democracy    0.7989896  0.7001597   0.8887006Now, we can also generate the coefficient plot using the plot_summs function from the jtools package. To get the exponentiated results, we will use the option exp=TRUE.
# Load the jtools package
library(jtools)
# Show coefficient plot
plot_summs(model1, model2, exp=TRUE)Use either Exploratory or R Studio to develop the best fit logistic model to estimate corruption
Create a short report that includes the model results and interpret your results.
VSP BigData [lecture number] - [group number] - [presenter name] VSP BigData Lecture 11 - Group 1 - Bill Gates