Urban Big Data Analytics

Lecture 3
Data Wrangling

July 19, 2019

Instructor: Andy Hong, PhD
Lead Urban Health Scientist
The George Institute for Global Health
University of Oxford

Assignment 2 check in

  • Gapminder offline tool
  • If your country is not included, then just pick one country you like
  • If the export function doesn't work, just take a screen shot

Data wrangling

What is data wrangling?

  • A process of turning raw data into a more appropriate format

  • Data cleaning
  • Data munging
  • Manipulation
  • Transformation
  • Janitor work

Life of data scientists

Clean data are rare

Data come in a messy form

Example - Twitter data

Principles of data wrangling







Four principles

  • Never touch the original data
  • Inspect data types
  • Check each step
  • Examine missing data
  • Remove weird data

Never touch the original data

Inspect data types

Check each step

Examine missing data

Remove weird data

Data wrangling in R

Common procedures

  1. Select specific columns: select()
  2. Select specific rows: filter()
  3. Create new columns: mutate()
  4. Aggregate data by certain groups: group_by()
  5. Change the unit of analysis: summarise()

Vancouver crime data demo

  1. Create a group session folder first
  2. Windows: My Documents\vsp_bigdata\group_session\03-lecture
  3. Mac: Documents/vsp_bigdata/group_session/03-lecture

Vancouver crime data demo

  1. Download the CSV file
  2. Move the CSV file to the following folder: vsp_bigdata/group_session/03-lecture
  3. Run Exploratory

Class 3 - Group Session

Instruction

Any questions?

For all the course materials, go to urbanbigdata.github.io