Urban Big Data Analytics - Lecture 7 Group Session: Exploratory Data Analysis

Instruction

1. Synopsis

The purpose of this group session is to get you familiar with Exploratory Data Analysis (EDA). There are three steps in this group session. First, we will create a histogram for median household income. We will also create a bar plot using the crime types.

Second, we will create a scatter plot using population and total income variables. We will also create a box plot using the crime type and the total number of crimes. We will also do a log-transformation of the total crime data.

Third, we will create a correlation coefficient plot using some of the key variables in the dataset.

2. Initial preparation

First, open Exploratory and import the Seattle crime data only for the year 2013.

Now create a new column named tract using mutate and the the following calculation:

as.numeric(gsub("\\..*", "", CensusTract2000))

This calculation extracts only the tract number from the data.
Further information for advanced students: the gsub function uses regular expression to filter and replace characters. In this case the regular expression finds all the characters that go after . and replace them with a blank character. For example, 6200.1043 becomes 6200.
You can read more about this function here: http://stat.ethz.ch/R-manual/R-devel/library/base/html/grep.html

Now, we will join the crime data with the census data. To do this, let’s first group the data by census tract. Click + and choose Group_by and choose tract. And click + again and choose Group_by and choose SummarizedOffenseDescription. Group_by is not going to do anything unless we summarize the data. Let’s summarise the data and get the total number of crime by census tract. We can do this by clicking + and choose Summarize and choose # Number of rows for the Column, and give the New column name as total_crime.

Next, we will join the Seattle census data to add demographic information to the crime data. Add File Data by using the + sign and select seattle_census_2013.csv file.

Now, go back to the Seattle crime data, and click + and select Join.
In the target data frame, choose seattle_census_2013 and for both the Current Column and Target Column, choose tract. We can see that Census data are joined to the Seattle crime data.

3. First part - exploring one variable

Now, we are going to use EDA to explore one variable. Go to Chart and choose Histogram for the type. For X Axis, choose med_hh_inc. You will see a distribution of median household income in Seattle.

Now, let’s create another Chart by clicking +. This time, choose Bar for the type, and for X Axis, choose SummarizedOffenseDescription and for Y Axis, choose total_crime. You can usually leave the Y Axis as Number of Rows, but in this case we need to use the total crime. Once we are done, we can make the plot pretty by sorting the data by Y Axis and change the orientation from Vertical to Horizontal.

4. Second part - Exploring two variables

Now, let’s explore two variables. For this to happen, we will need to turn off the crime type. So, go back to the fourth step on your right panel, and remove SummarizedOffenseDescription and click OK. We will only group the data by tract.

For the scatter plot, go to Chart and choose Scatter for the type. On X Axis, choose pop and for Y Axis, choose total_crime. We can see the relationship between population density and total crime.

We can make the plot fancier by adding a trend line. Click the accordian-style menu next the Y Axis, and click Trend Line and choose Linear Regression for the type. It shows a beautiful scatter plot with a trend line showing the relationship between population and crime.

Note that the distribution of crime data is skewed to the right, meaning there are values close to 0s. We can transform the crime data to overcome this non-normality issue.

To do that, click + and choose Mutate. On the new Column name, type in total_crime_log and in the calculation field, type in log(total_crime). Hit Run and we see that a new variable called total_crime_log was created.

For the scatter plot, let’s use total_crime_log for the Y Axis and see how the plot changes.

Now, we will create anothe Chart by clicking +. For this to work, we will need to include crime type again for the grouping variable. Go back to the fourth step on the right panel, and click Add New below tract and choose SummarizedOffenseDescription.

Select the Summarize step on your right panel, and choose BoxPlot for the type. For the X Axis, choose SummarizedOffenseDescription and choose total_crime for the Y Axis. We can see that the boxplot is created. One problem is that there are too many crime types.

Let’s limit the crime types to top five crimes. We can do this by clicking + and choose Filter and select SummarizedOffenseDescription for the column and is in for the opreator. On the Values, pick CAR PROWL, BURGLARY, PROPERTY DAMAGE, WARRANT ARREST, and STOLEN PROPERTY. Go back to Chart and examine the boxplot with only the top five crime types.

5. Third part - Exploring multiple variables (only numeric)

Now that we explored the two variables, we can move onto exploring multiple variables. For this, choose Analytics, and select Correlation by Columns. For the colums, select the following columns:

total_crime
pop
pop_den
p_kids
p_seniors
p_female
p_white
some_college
med_hh_inc
p_poverty

When you are done selecting the variables, click Run. We can see a correlation plot.

Urban Big Data Analytics - Lecture 7 Group Session: Exploratory Data Analysis

Andy Hong

July 25, 2019

Prerequisites