Please locate your vsp_bigdata
folder under “My Documents” and navigate to group-session
. Create 10-lecture
folder under the group-session
folder.
For this group session, we will use the 2011 Gapminder database.
Please download this CSV file and save it under the group session folder: Gapminder data 2011
The purpose of this group session is to understand the concept of logistic regression.
First, we will build a logistic regression model using Exploratory. Then, we will build a logistic regression model using R Studio.
Let’s load the gapminder 2011 data.
Each country has a corruption perception index ranging from 0 to 100, and based on this index, we assigned 1 for countries that have the index greater than 50; and 0 for countries that have the index less than 50. The corruption variable is an arbitrary variable only for a demonstration purpose.
Go to Analytics
and choose Logistic Regression Analysis
for the type.
Your target variable is corruption
, and your predictor variable(s) are: income
, population
, and democracy
.
Mathematically, what we are trying to estimate is the following function.
\(logit(p) = log(\frac{p}{1-p}) = \beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{2} + \beta_{3}x_{3}\)
where \(log(\frac{p}{1-p})\) is the log odds of being in corruption; \(x_{1}\)is income; \(x_{2}\)is population; and \(x_{3}\)is democracy.
When you click run, you will see the coefficient plot of the logit model results.
Take a look at the result of democracy. The odds ratio of 0.79 means that for 1-unit increase in democracy, we expect to see about 21% decrease in the odds of being in corruption.
We are going to join the gapminder data with the corruption data. Let’s load the libraries first.
# Set CRAN repository source
options(repos="https://cran.rstudio.com")
# Install packages
# install.packages("dplyr")
# install.packages("magrittr")
# Load packages
library(dplyr)
library(magrittr)
Let’s load and examine the gapminder data.
gapminder = read.csv("/Users/andyhong/Documents/vsp_bigdata/group-session/10-lecture/gapminder_data_2011.csv")
head(gapminder)
## name region population income lifeExp democracy corruption
## 1 Afghanistan asia 29700000 1660 56.7 NA 1
## 2 Albania europe 2930000 10200 76.7 9 1
## 3 Algeria africa 36800000 13000 76.7 2 1
## 4 Angola africa 24200000 5910 60.9 -2 1
## 5 Argentina americas 41700000 19600 76.0 8 1
## 6 Armenia europe 2880000 7020 73.8 5 1
## corruption_index
## 1 8
## 2 33
## 3 34
## 4 22
## 5 35
## 6 34
We are going to build the basic and the full models. We will use the glm
function, which is a generalized linear model function. We will use the binomial family to implement the logistic function.
model1 = glm(data = gapminder, corruption ~ income, family = binomial)
model2 = glm(data = gapminder, corruption ~ income + population + democracy, family = binomial)
Now we can see the results of the models using the summary
function.
# Show the original results
summary(model1)
##
## Call:
## glm(formula = corruption ~ income, family = binomial, data = gapminder)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4951 0.2889 0.3345 0.4841 3.2432
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.223e+00 4.378e-01 7.362 1.82e-13 ***
## income -1.094e-04 1.787e-05 -6.120 9.37e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 180.52 on 160 degrees of freedom
## Residual deviance: 114.34 on 159 degrees of freedom
## AIC: 118.34
##
## Number of Fisher Scoring iterations: 5
summary(model2)
##
## Call:
## glm(formula = corruption ~ income + population + democracy, family = binomial,
## data = gapminder)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.2691 -0.0196 0.1788 0.4261 2.3690
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.520e+00 7.187e-01 6.289 3.19e-10 ***
## income -1.143e-04 2.037e-05 -5.612 2.01e-08 ***
## population 7.916e-09 5.854e-09 1.352 0.176304
## democracy -2.244e-01 5.983e-02 -3.751 0.000176 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 178.788 on 157 degrees of freedom
## Residual deviance: 92.263 on 154 degrees of freedom
## (3 observations deleted due to missingness)
## AIC: 100.26
##
## Number of Fisher Scoring iterations: 6
For interpretation, we can exponentiate the results to get the odds ratio.
# Show the exponentiated results
exp(cbind(Odds=coef(model1), confint(model1)))
## Waiting for profiling to be done...
## Odds 2.5 % 97.5 %
## (Intercept) 25.0937760 11.4025724 64.308892
## income 0.9998906 0.9998524 0.999923
exp(cbind(Odds=coef(model2), confint(model2)))
## Waiting for profiling to be done...
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Odds 2.5 % 97.5 %
## (Intercept) 91.8588022 26.2262274 451.8015542
## income 0.9998857 0.9998411 0.9999217
## population 1.0000000 1.0000000 1.0000000
## democracy 0.7989896 0.7001597 0.8887006
Now, we can also generate the coefficient plot using the plot_summs
function from the jtools
package. To get the exponentiated results, we will use the option exp=TRUE
.
# Load the jtools package
library(jtools)
# Show coefficient plot
plot_summs(model1, model2, exp=TRUE)
Use either Exploratory or R Studio to develop the best fit logistic model to estimate corruption
Create a short report that includes the model results and interpret your results.
VSP BigData [lecture number] - [group number] - [presenter name]
VSP BigData Lecture 11 - Group 1 - Bill Gates