Please locate your vsp_bigdata
folder under “My Documents” and navigate to group-session
. Create 07-lecture
folder under the group-session
folder.
For this group session, we will use the Seattle crime database.
If you don’t have the Seattle crime database, please download this CSV file and load it into Exploratory: Seattle crime data 2013
Now, you need to download the US census data to add demographic information: Seattle Census ACS data 2013
The purpose of this group session is to get you familiar with Exploratory Data Analysis (EDA). There are three steps in this group session. First, we will create a histogram for median household income. We will also create a bar plot using the crime types.
Second, we will create a scatter plot using population and total income variables. We will also create a box plot using the crime type and the total number of crimes. We will also do a log-transformation of the total crime data.
Third, we will create a correlation coefficient plot using some of the key variables in the dataset.
First, open Exploratory and import the Seattle crime data only for the year 2013.
Now create a new column named tract
using mutate
and the the following calculation:
as.numeric(gsub("\\..*", "", CensusTract2000))
This calculation extracts only the tract number from the data.
Further information for advanced students: the gsub
function uses regular expression to filter and replace characters. In this case the regular expression finds all the characters that go after . and replace them with a blank character. For example, 6200.1043 becomes 6200.
You can read more about this function here: http://stat.ethz.ch/R-manual/R-devel/library/base/html/grep.html
Now, we will join the crime data with the census data. To do this, let’s first group the data by census tract
. Click +
and choose Group_by
and choose tract
. And click +
again and choose Group_by
and choose SummarizedOffenseDescription
. Group_by
is not going to do anything unless we summarize the data. Let’s summarise the data and get the total number of crime by census tract. We can do this by clicking +
and choose Summarize
and choose # Number of rows
for the Column, and give the New column name as total_crime
.
Next, we will join the Seattle census data to add demographic information to the crime data. Add File Data by using the +
sign and select seattle_census_2013.csv
file.
Now, go back to the Seattle crime data, and click +
and select Join
.
In the target data frame, choose seattle_census_2013
and for both the Current Column and Target Column, choose tract
. We can see that Census data are joined to the Seattle crime data.
Now, we are going to use EDA to explore one variable. Go to Chart
and choose Histogram
for the type. For X Axis, choose med_hh_inc
. You will see a distribution of median household income in Seattle.
Now, let’s create another Chart
by clicking +
. This time, choose Bar
for the type, and for X Axis, choose SummarizedOffenseDescription
and for Y Axis, choose total_crime
. You can usually leave the Y Axis as Number of Rows
, but in this case we need to use the total crime. Once we are done, we can make the plot pretty by sorting the data by Y Axis and change the orientation from Vertical to Horizontal.
Now, let’s explore two variables. For this to happen, we will need to turn off the crime type. So, go back to the fourth step on your right panel, and remove SummarizedOffenseDescription
and click OK. We will only group the data by tract
.
For the scatter plot, go to Chart
and choose Scatter
for the type. On X Axis, choose pop
and for Y Axis, choose total_crime
. We can see the relationship between population density and total crime.
We can make the plot fancier by adding a trend line. Click the accordian-style menu next the Y Axis, and click Trend Line
and choose Linear Regression
for the type. It shows a beautiful scatter plot with a trend line showing the relationship between population and crime.
Note that the distribution of crime data is skewed to the right, meaning there are values close to 0s. We can transform the crime data to overcome this non-normality issue.
To do that, click +
and choose Mutate
. On the new Column name, type in total_crime_log
and in the calculation field, type in log(total_crime)
. Hit Run and we see that a new variable called total_crime_log
was created.
For the scatter plot, let’s use total_crime_log
for the Y Axis and see how the plot changes.
Now, we will create anothe Chart
by clicking +
. For this to work, we will need to include crime type again for the grouping variable. Go back to the fourth step on the right panel, and click Add New
below tract
and choose SummarizedOffenseDescription
.
Select the Summarize
step on your right panel, and choose BoxPlot
for the type. For the X Axis, choose SummarizedOffenseDescription
and choose total_crime
for the Y Axis. We can see that the boxplot is created. One problem is that there are too many crime types.
Let’s limit the crime types to top five crimes. We can do this by clicking +
and choose Filter
and select SummarizedOffenseDescription
for the column and is in
for the opreator. On the Values, pick CAR PROWL
, BURGLARY
, PROPERTY DAMAGE
, WARRANT ARREST
, and STOLEN PROPERTY
. Go back to Chart
and examine the boxplot with only the top five crime types.
Now that we explored the two variables, we can move onto exploring multiple variables. For this, choose Analytics
, and select Correlation by Columns
. For the colums, select the following columns:
When you are done selecting the variables, click Run. We can see a correlation plot.
Create a bar plot of crime types and total crime. Sort the plot by y axis.
Create a scatter plot of pop and total_crime.
Create a correlation plot of at least 5 key variables of your choice
VSP BigData [lecture number] - [group number] - [presenter name]
VSP BigData Lecture 7 - Group 1 - Bill Gates