Urban Big Data Analytics - Lecture 5 Group Session: Spatial data

Instruction

1. Synopsis

The purpose of this group session is to get you familiar with data joining and spatial data. There are three steps in this group session. First, you will use the coordinate data (latitude and longitude) to do some cool mapping with Exploratory. To do that, you will filter the data and create a heatmap of crime activities.

Second, you will join the Seattle Census data to the crime data using the common field named tract. Then you will explore the data using the Chart tool in Exploratory.

Third, you will load the Census tract geojson file and and map other demographic data from the Census.

2. Initial preparation

First, open Exploratory and import the Seattle crime data only for the year 2013 using the following code.

library("RSQLite")
library("DBI")

# Define database path
filename = "/Users/[user name]/Documents/vsp_bigdata/group_session/04-lecture/seattle_crime_data.db"
#filename = "C://Users//[user name]//Documents//vsp_bigdata//group_session//04-lecture//seattle_crime_data.db"

# Connect to db
con = dbConnect(drv=RSQLite::SQLite(), dbname=filename)

# Issue a query to read the data
# my_data = dbGetQuery(con, "SELECT * FROM crime")

my_data = dbGetQuery(con, "SELECT Year, Month, SummarizedOffenseDescription, DateReported, DistrictSector, Latitude, Longitude, CensusTract2000 FROM crime WHERE Year = 2013")

Now create a new column named tract using mutate and the the following calculation.

as.numeric(gsub("\\..*", "", CensusTract2000))

This calculation extracts only the tract number from the data.
Further information for advanced students: the gsub function uses regular expression to filter and replace characters. In this case the regular expression finds all the characters that go after . and replace them with a blank character. For example, 6200.1043 becomes 6200.
You can read more about this function here: http://stat.ethz.ch/R-manual/R-devel/library/base/html/grep.html

3. First part - Mapping crime activities using points data

Now, we are going to map crime activities. Go to Chart and choose Map - Long/Lat for the type. Exploratory will automatically populate the Longitude and Latitude, but if it doesn’t, choose Longitude for Longitude and Latitude for Latitude.

You will notice that some coodinates are erronous, so we will need to restrict our data only within the Seattle area using the bounding box as shown below. The correct range of latitude is 47.484 ~ 47.767, and longitude is -122.446 ~ -122.183.

For latitude, click + and select Filter. Choose column Latitude and the opeator between to filter data between 47.484 and 47.767. For longitude, click + and select Filter. Choose column Latitude and the opeator between to filter data between -122.446 and -122.183.

Or you can add a custom filter, and copy and paste the following code.

filter(between(Latitude, 47.484, 47.767) & between(Longitude, -122.446, -122.183))

There are too many criminal activities in one year, so we will need to filter for one type of crime.

Click + and select Filter. For the column, choose SummarizedOffenseDescription and leave the operator as equal to and select BIKE THEFT for the value. You can see that the downtown and the university areas have a lot of bike theft incidents.

Now, let’s create a heatmap. Click + on your Chart, and Choose Map - Heatmap. It automatically creates the heatmap of bike theft. We can clearly see that the bike thefts are concentrated around the downtown and the university areas.

4. Second part - Data joining and exploration

For data exploration, we will need to group the data by census tract. Go back to one step before we filtered the data for Bike Theft. Click + and choose Group_by and choose tract. Group_by is not going to do anything unless we summarize the data. Let’s summarise the data and get the total number of crime by census tract. We can do this by clicking + and choose Summarize and choose # Number of rows for the Column, and give the New column name as total_crime.

Next, we will join the Seattle census data to add demographic information to the crime data. Add File Data by using the + sign and select seattle_census_2013.csv file we downloaded earlier.

Now, go back to the Seattle crime data. We will join the crime data and the census data. But before we do the join, plesae review these three dffierent types of data join.

If you are comfortable with the concept of data join, click + and select Join.

In the target data frame, choose seattle_census_2013 and for both the Current Column and Target Column, choose tract. We can see that Census data are joined to the Seattle crime data.

Once we joined the data by census tract, we can go to Chart and choose Scatter for the type. On X Axis, choose pop_den and for Y Axis, choose total_crime. We can see the relationship between population density and total crime.

We can make the plot fancier by adding a trend line. Click the accordian-style menu next the Y Axis, and click Trend Line and choose Linear Regression for the type. It shows a beautiful scatter plot with a trend line showing the relationship between population density and crime.

5. Third part - Mapping demogrphic characteristics using polygon data

Now that we joined the Census data, we can map our data by census tract. Click + on your Chart, and choose Map - Area for the type. On the Area Type, click the Setup button. Click Add New and Add from Local. From your group session folder, choose seattle_tract.geojson file. Close the dialog.

For the Area Type, now we can see that there is seattle_tract on the select menu. Choose seattle_tract. For the Key Property, scroll all the way down, and choose tract. For the Key Column, choose tract. Now we can choose any columns in the Color By to show their patterns on the map. For example, choose total_crime.

6. [Advanced material] Thinking about numerators and denominators

We used total_crime as a proxy for indicating the intensity of criminal activities in certain neighborhoods (i.e. census tract). Do you think this is an appropriate measure of criminal activity? How would you adjust for the number of people in the neighborhood? Add this extra step to calculate the new proxy variable, and compare your map using the total_crime variable vs. the map using this new variable.

[Hint] Criminal activities in certain neighborhood will be porportional to the number of people living there, so population will be your denominator.

Urban Big Data Analytics - Lecture 5 Group Session: Spatial data

Andy Hong

July 23, 2019

Prerequisites