Visualizing 911 calls using #ggplot2

Introduction

911 is an emergency telephone number used in North America in the case of emergency situations. In over 98% of locations in the USA and Canada dialing “9-1-1” from any telephone will link the caller to an emergency dispatch center.

 
 

Dataset

The dataset we are going to use for this project is obtained from Kaggle. It contains a description of the different 911 calls made in Montgomery County,Pennsylvania. The calls can be divided into 3 types, EMS, FIRE and TRAFFIC.

 

What is the data made of?

We import the data set and view its summary.

emer <- read.csv('911.csv',header=T,stringsAsFactors = F)
emer <- emer[,c(1:8)]
summary(emer)
##       lat             lng             desc                zip       
##  Min.   :30.33   Min.   :-95.60   Length:97114       Min.   :17752  
##  1st Qu.:40.10   1st Qu.:-75.39   Class :character   1st Qu.:19038  
##  Median :40.15   Median :-75.30   Mode  :character   Median :19401  
##  Mean   :40.16   Mean   :-75.32                      Mean   :19238  
##  3rd Qu.:40.23   3rd Qu.:-75.21                      3rd Qu.:19446  
##  Max.   :41.17   Max.   :-75.00                      Max.   :77316  
##                                                      NA's   :12516  
##     title            timeStamp             twp           
##  Length:97114       Length:97114       Length:97114      
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##      addr          
##  Length:97114      
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 
 

We see that we have 8 variables in this dataset which includes a timestamp. We modify this timestamp variable to extract the days and the time.

emer$timeStamp <- as.POSIXct(emer$timeStamp,'%Y-%d-%m %H:%M:%S',tz='GMT')
emer<- na.omit(emer)
emer$hour  <- hour(ydm_hms(emer$timeStamp))
emer$times <- as.POSIXct(strftime(ymd_hms(emer$timeStamp), format="%H:%M:%S"), format="%H:%M:%S")
emer$day   <- wday(ydm_hms(emer$timeStamp), label=TRUE)
 

Now that we are done with the modification of our data, we ask a few question? How many cases occur per day? How many of those cases are divided into fire related, EMS related and traffic related incidents?

How many cases by day?

We see that the largest number of cases have been on a Friday.The proportion of cases related to EMS seem to be higher for all days followed by the other two. We would like to see the percentage of each type of case by day. For this, we modify the above to include proportions rather than frequencies.

library(plyr)
## Warning: package 'plyr' was built under R version 3.2.5
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
## 
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## The following object is masked from 'package:lubridate':
## 
##     here
ce <-ddply(temp, "day",transform,
           percent_composition =n /sum(n) * 100)
ggplot(ce,aes(x=day,y=percent_composition,fill=reason)) +
  geom_bar(stat="identity")
 
 

detach('package:plyr')
 

by_day

 

From the above we see that proportions of Traffic cases seem to be higher on weekdays. This is probably due to the need for using vehicular transport more on weekdays than weekends. EMS related issues tend to have the lowest proportion on Wednesday. Fire related incidents have the highest proportion on Sunday. This could be attributed to the dangers of fire use for recreational purposes(cooking,fireworks), which are mostly done on weekends.

How do the numbers vary by time and day?

To look at the trends, we need to dig deeper. We need to find how each type of case varies by time and day.

temp <- emer
filter='EMS:'
temp$n <- vapply(temp$title, function(x) sum(stri_count_fixed(x, filter)), 1L)
temp %>% group_by(times,day) %>% summarize(freq=sum(n)) %>%ungroup() -> temp_ems
p_ems <- ggplot(temp_ems, aes(x=times, y=freq, color=day)) +
  geom_smooth(se=FALSE, fill=NA, size=2) +
  theme_light(base_size=20) +
  xlab("Hour of the Day") +
  scale_x_datetime(breaks = date_breaks("4 hours"), labels=date_format("%I:%M %p")) + 
  ylab("Number of EMS cases") +
  scale_color_discrete("") +
  ggtitle("Number of EMS cases by day and time") +
  theme(plot.title=element_text(size=11),axis.text.x = element_text(angle=90, vjust=1))
filter = 'Fire:'
temp$m <- vapply(temp$title, function(x) sum(stri_count_fixed(x, filter)), 1L)
temp %>% group_by(times,day) %>% summarize(freq=sum(m)) %>%ungroup() -> temp_fire
p_fire <- ggplot(temp_fire, aes(x=times, y=freq, color=day)) +
  geom_smooth(se=FALSE, fill=NA, size=2) +
  theme_light(base_size=20) +
  xlab("Hour of the Day") +
  scale_x_datetime(breaks = date_breaks("4 hours"), labels=date_format("%I:%M %p")) + 
  ylab("Number of Fire related cases") +
  scale_color_discrete("") +
  ggtitle("Number of Fire cases by day and time") +
  theme(plot.title=element_text(size=11),axis.text.x = element_text(angle=90, vjust=1))
filter = 'Traffic:'
temp$t <- vapply(temp$title, function(x) sum(stri_count_fixed(x, filter)), 1L)
temp %>% group_by(times,day) %>% summarize(freq=sum(t)) %>%ungroup() -> temp_traffic
p_traffic <- ggplot(temp_traffic, aes(x=times, y=freq, color=day)) +
  geom_smooth(se=FALSE, fill=NA, size=2) +
  theme_light(base_size=20) +
  xlab("Hour of the Day") +
  scale_x_datetime(breaks = date_breaks("4 hours"), labels=date_format("%I:%M %p")) + 
  ylab("Number of Traffic related cases") +
  scale_color_discrete("") +
  ggtitle("Number of Traffic cases by day and time") +
  theme(plot.title=element_text(size=11),axis.text.x = element_text(angle=90, vjust=1))
 
 

How does number of fire related cases vary?

fire_cases_daytime.png

The number of fire cases dip in the wee hours. This is probably due to the lack of human activity. They peak between 8:30AM to 12:30PM. The numbers tend to increase after 4.30PM on Wednesdays and Fridays.

What about EMS?

ems_daytime.png

The trend here is very similar to the plot for fire related cases we saw before. Number of cases tend to be the lowest after 4.30PM for Thursdays and highest for Saturdays.

What about traffic?

traffic_daytime.png

As expected, weekdays tend to have similar trends. On weekends, traffic related cases seem to be more during the wee hours. For the weekdays, we see a wiggly graph after 12:30PM. The highest number of cases after 4.30PM seems to be on Tuesdays. The dip in the trend happens between 12:30Am and 4:30AM.

What is the relationship between cases and time?

How do the number of cases differ by time?

temp <- emer
temp$reason <- ifelse(grepl('\\bEMS\\b',temp$title),'EMS','')
temp$reason <- ifelse(grepl('\\bTraffic\\b',temp$title)&temp$reason=='','Traffic',temp$reason)
temp$reason <- ifelse(grepl('\\bFire\\b',temp$title),'Fire',temp$reason)
temp <- temp %>% group_by(times,reason) %>% summarise(n=n())
p<- ggplot(temp, aes(x=times, y=n, color=reason)) +
  geom_smooth(se=FALSE, fill=NA, size=2) +
  theme_light(base_size=20) +
  xlab("Hour of the Day") +
  scale_x_datetime(breaks = date_breaks("4 hours"), labels=date_format("%I:%M %p")) + 
  ylab("Number of cases") +
  scale_color_discrete("") +
  ggtitle("Number of  cases by time") +
  theme(plot.title=element_text(size=18),axis.text.x = element_text(angle=90, vjust=1))
p

 

caes_time.png

Number of fire related cases seem to remain constant after 12:30PM. The highest number of cases are EMS related, which show a sinusoidal trend. There is a local maxima at around 12.30PM for the traffic related cases’ trend.

Plotting incidents location wise

We use the package ggmap() provide by R which utilizes GoogleMaps for visualization.

Map

We see that the number of emergency calls increase as we get closer to the river. Population density might be higher as we near a water body. This would in turn increase the chances of getting a 911 emergency call from these areas.

Thank You all For Reading !

The dataset used in this project can be found here

Have a nice weekend!

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s