Mac: Documents/vsp_bigdata/group_session/03-lecture
Download the Vancouver Crime CSV file
Move the CSV file to the following folder: vsp_bigdata/group_session/03-lecture
Run Exploratory
The purpose of this group session is to get you familiar with data wrangling. You will explore a crime dataset from the Vancouver Open Data Catalogue. We will first explore how crime rate has changed over time. Then, we will explore four specific variables in the dataset, namely crime type, neighborhood, season, and time of day. Crime type and neighborhood are already in the dataset, but you will need to wrangle the other two variables to complete your analysis.
Remember, the general approach to data science is to understand relationships among variables. In the Gapminder example, we exmaine the relationship between income and life expectancy. In other words, we are almost always interested in understanding how X affects Y. In this exercise, we are interested in looking at the relationship betwen some possible factors (X) and crime rates (Y).
First you need to convert character variables to numeric variables.
# Make sure to convert character to numeric
mutate(
YEAR = as.numeric(YEAR),
MONTH = as.numeric(MONTH),
DAY = as.numeric(DAY),
HOUR = as.numeric(HOUR),
MINUTE = as.numeric(MINUTE)
)
Some data are not complete. For example, there are only about 7 months for the year 2018. Also, usually quality of data for the first year of data collection is not very consistent. So we will only use data from 2004 to 2017 for the purpose of this exercise.
Let’s filter the data from to 2004 to 2017.
# Filter data from 2004 to 2017
filter(between(YEAR, 2004, 2017))
The next step to regroup the time of day varaible (HOUR).
# Regroup HOUR to four time groupings
mutate(
time_of_day =
cut(HOUR,
breaks=c(0, 6, 12, 18, 23),
labels=c("Dawn","Morning","Afternoon","Night")
)
)
Do you see anything strange? Can you tell what’s going on with the new variable?
As you can see, the new time_of_day
variable has missing data NA
. However, the original data HOUR
has no missing data. What’s going on? Well.. it turns out that the cut
function does not include the lowest break value. In order to include 0 in the break value, you will need to start the break value lower than 0.
This is the correct way of cutting the data.
# Repeat the same code above but replace the initial value with -1.
mutate(
time_of_day =
cut(HOUR,
breaks=c(-1, 6, 12, 18, 23),
labels=c("Dawn","Morning","Afternoon","Night")
)
)
The next step to regroup the month varaible (MONTH).
# Regroup MONTH to four seasons
mutate(
season =
recode(MONTH,
"1" = "Winter",
"2" = "Winter",
"3" = "Spring",
"4" = "Spring",
"5" = "Spring",
"6" = "Summer",
"7" = "Summer",
"8" = "Summer",
"9" = "Fall",
"10" = "Fall",
"11" = "Fall",
"12" = "Winter"
)
)
Once you are done with data wrangling, it’s easy to explore data by visualizing them.
In Exploratory, go to Chart, and select the plot type as Area
On the X Axis, choose # YEAR
On the Y Axis, choose # Number of rows
For Color By, choose the factor you want to explore
For example, choose TYPE
to explore the effect of crime type on crime rates
You can also use the bar chart to visualize frequency
In Exploratory, go to Viz, and select the plot type as Bar
On the X Axis, choose TYPE
On the Y Axis, choose # Number of rows
For Color By, choose other factors you want to explore
For example, choose time_of_day
to explore the effect of different times on crime rates for different types of crime
You can do exactly the same visualizations using R codes. If you are a beginner R user, this will be very complicated and unnecessarily difficult. But if you are familiar with R, this will be a piece of cake for you.
If you get lost, don’t worry. This is for more experienced R users. You will do just fine without these additional steps. The goal is to get you familiar with R programming. You are expected to get lost because R has a very steep learning curve.
Replicating the Exploratory steps in R
To replicate what we’ve just done in Exploratory, you will need to export the R script and copy the codes.
Use the base R code to implement the Exploratory steps: Replication R codes
Open the R codes in R Studio
Plug in the R scrips copied from Exploratory
In front of “exploratory::read_delim_file()”, insert “df =”. The first couple of lines should look something like this.
# Steps to produce the output
df = exploratory::read_delim_file("/Users/andyhong/Documents/vsp_bigdata/group_session/03-lecture/vancouver_crime_2003-2018.csv" , ",", quote = "\"", skip = 0 , col_names = TRUE , na = c('','NA') , locale=readr::locale(encoding = "UTF-8", decimal_mark = "."), trim_ws = TRUE , progress = FALSE)
Pick one person in your group to present your story. Each group has three minutes to present.
All group members should work on the data wrangling and exploration.
However, the presenter should have the final plot ready to share with the classroom.
Please export the chart as PNG and send the PNG file to the course email (urbanbigdata2019@gmail.com).
[IMPORTANT] Please use the following email title format:
VSP BigData [lecture number] - [group number] - [presenter name]
ex), VSP BigData Lecture 3 - Group 1 - Bill Gates