The motivation for choosing this data set to explore is straightforward: I wanted a real-world data set for which additional meta data would be easy to find, making NYC data perfect; I wanted something which did not require specialised domain knowledge, and so buses are a good choice; and I wanted a data set that might produce some interesting insights. As an additional bonus, this data set had very few downloads or kernels on Kaggle, so it seemed like mostly untrodden ground.
Data loading and initial exporation
Data are available from the NYC open data site. Linked is the main data page, which contains the bus breakdown ID number, as well as the route number, the schools serviced, the borough through which the bus travels, and so on.
The readme file included in this bundle also contains references to linking information. Specifically, the drivers and attendants, routes, transportation sites, vehicles, pre-k riders by transportation site, routes by transportation site, and pre-k vendors by transportation site. All of this data can be combined (see later) to provide route- and vendor- level data to the main delay data set. For example, to quanitfy the total number of drivers employed by each company, the total number of students or schools that they service, and so on.
Our goal is to be able to make some kind of useful prediction about how long a bus is going to be delayed, at the time its delay is called into the operations centre. So we take a look at our primary 'ii_' data set.
in_csv <- "../data/II_Bus_Breakdown_and_Delays.csv"
ii_breakdowns <- read_csv(in_csv)
ii_breakdowns %>% glimpse
## Observations: 259,637
## Variables: 21
## $ School_Year <chr> "2015-2016", "2015-2016", "201...
## $ Busbreakdown_ID <int> 1212699, 1212700, 1212701, 121...
## $ Run_Type <chr> "Special Ed AM Run", "Special ...
## $ Bus_No <chr> "48186", "2518", "235", "2102"...
## $ Route_Number <chr> "N758", "L530", "K168", "K216"...
## $ Reason <chr> "Other", "Mechanical Problem",...
## $ Schools_Serviced <chr> "75485", "21854", "18366", "21...
## $ Occurred_On <dttm> 2015-09-02 06:27:00, 2015-09-...
## $ Created_On <dttm> 2015-09-02 06:29:00, 2015-09-...
## $ Boro <chr> "Nassau County", "Brooklyn", "...
## $ Bus_Company_Name <chr> "BORO TRANSIT, INC.", "RELIANT...
## $ How_Long_Delayed <chr> "25 minutes", NA, "30MINS", "2...
## $ Number_Of_Students_On_The_Bus <int> 0, 0, 0, 1, 0, 0, 0, 9, 0, 2, ...
## $ Has_Contractor_Notified_Schools <chr> "Yes", "Yes", "Yes", "Yes", "Y...
## $ Has_Contractor_Notified_Parents <chr> "No", "Yes", "Yes", "Yes", "Ye...
## $ Have_You_Alerted_OPT <chr> "No", "Yes", "No", "No", "No",...
## $ Informed_On <dttm> 2015-09-02 06:29:00, 2015-09-...
## $ Incident_Number <chr> NA, NA, NA, NA, NA, NA, NA, NA...
## $ Last_Updated_On <dttm> 2015-09-02 06:29:16, 2015-09-...
## $ Breakdown_or_Running_Late <chr> "Running Late", "Breakdown", "...
## $ School_Age_or_PreK <chr> "School-Age", "School-Age", "S...
ii_breakdowns %>% distinct(How_Long_Delayed) %>% head(n=20)
## # A tibble: 20 x 1
## How_Long_Delayed
## <chr>
## 1 25 minutes
## 2 <NA>
## 3 30MINS
## 4 20 min
## 5 30 min
## 6 90 MIN
## 7 20 MINS
## 8 20
## 9 30 mins
## 10 15 MINS
## 11 20 MIN
## 12 15 mins
## 13 30 MINS
## 14 25
## 15 0
## 16 10 mins
## 17 25 min
## 18 1hr
## 19 20 mins
## 20 45 mins
There is an obvious problem: these delays are 'human recorded', meaning that the format is inconsistent. Sometimes we have minutes, min, mts, hours, hrs; sometimes just two numbers separated by ':', and so on. Additionally, the times are not accurately recorded, as we will verify in a moment.
Wrangling the delay times into something machine readable takes some doing. I don't doubt that there is a more general path available, but I went through sequentially, case by case, as generally as I could; then cleaned up the rest at the end.
#--sequential clean
ii_times <- ii_breakdowns %>%
mutate(units_delayed = str_extract(How_Long_Delayed, "[:alpha:]+"),
time_delayed = as.numeric(str_extract(How_Long_Delayed, "[:digit:]+")),
units_delayed = ifelse(str_detect(units_delayed, "^[mM]"), "minutes", "other"),
units_delayed = ifelse(is.na(units_delayed), "other", units_delayed)
) %>%
#remove "in" case, exclude any flagged with "[hH}"
mutate(
units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "in") & !str_detect(How_Long_Delayed, "[hH]"),"minutes", units_delayed)
) %>%
#deal with "to" case with minutes by taking only the first number. There are apparently no 'to' strings with hours in them.
mutate(
units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "to") & str_detect(How_Long_Delayed, "[mM]"),"minutes", units_delayed)
) %>%
#deal with the - cases which don't involve hours at all
mutate(
units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "-") & !str_detect(How_Long_Delayed, "[hH]"),"minutes", units_delayed)
) %>%
#deal with the double digit minute-1hr:
mutate(
units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "-") & str_detect(How_Long_Delayed, "^[1-5][0-9]"),"minutes", units_delayed)
) %>%
#everything left with a hyphen is a 1 hour or more delay
mutate(
time_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "-"), 60, time_delayed),
units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "-"), "minutes", units_delayed)
) %>%
#-all hyphens dealt with at this point
#Now work on the "/": first take the minute timers: all those with double digits first
mutate(
units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "/") & str_detect(How_Long_Delayed, "^[1-9][0-9]"), "minutes", units_delayed)
) %>%
#All those starting with 1/2 go to 30 minutes
mutate(
time_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1/2"), 30, time_delayed),
units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1/2"), "minutes", units_delayed)
) %>%
#All those starting with 1 1/2 go to 90 minutes
mutate(
time_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1 1/2"), 90, time_delayed),
units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1 1/2"), "minutes", units_delayed)
) %>%
#---specific cases
#"1 HOUR1/2"
mutate(
time_delayed = ifelse(units_delayed=="other" & How_Long_Delayed == "1 HOUR1/2", 90, time_delayed),
units_delayed = ifelse(units_delayed=="other" & How_Long_Delayed == "1 HOUR1/2", "minutes", units_delayed)
) %>%
#"@least 1/2"
mutate(
time_delayed = ifelse(units_delayed=="other" & How_Long_Delayed == "@least 1/2", 30, time_delayed),
units_delayed = ifelse(units_delayed=="other" & How_Long_Delayed == "@least 1/2", "minutes", units_delayed)
) %>%
#"1HR/20MIN"
mutate(
time_delayed = ifelse(units_delayed=="other" & How_Long_Delayed == "1HR/20MIN", 80, time_delayed),
units_delayed = ifelse(units_delayed=="other" & How_Long_Delayed == "1HR/20MIN", "minutes", units_delayed)
)%>%
#--end specific cases
#1hr/x case
#This is modifying things it shouldn't. Why is this even here?
mutate(
time_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "1hr"), 60, time_delayed),
units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "1hr"), "minutes", units_delayed)
)%>%
#---cases with ":"
#remove cases which erroneously report a time instead of a duration, since these are not reliable
mutate(
time_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, ":") & str_detect(How_Long_Delayed, "[aA][mM]"), NA, time_delayed)
)%>%
#2:hr cases
mutate(
time_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^2:[hH][rR]"), 120, time_delayed),
units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^2:[hH][rR]"), "minutes", units_delayed)
)%>%
#1:45 cases
mutate(
time_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1:45"), 105, time_delayed),
units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1:45"), "minutes", units_delayed)
)%>%
#1:30 cases
mutate(
time_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1:30"), 90, time_delayed),
units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1:30"), "minutes", units_delayed)
)%>%
#1: cases
mutate(
time_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1:"), 60, time_delayed),
units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1:"), "minutes", units_delayed)
)%>%
#--end cases with ":"
#--cases with "^2"
#2 hr cases
mutate(
time_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^2") & str_detect(How_Long_Delayed, "[hH]"), 120, time_delayed),
units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^2") & str_detect(How_Long_Delayed, "[hH]"), "minutes", units_delayed)
)%>%
#double digit minute cases:
mutate(
units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^2[0-9]"), "minutes", units_delayed) #times are already correct
)%>%
#---"^1"
#1 hour cases
mutate(
time_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1.*[rR]$") & !(str_detect(How_Long_Delayed, "^1.5.*[rR]$")), 60, time_delayed),
units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1.*[rR]$") & !(str_detect(How_Long_Delayed, "^1.5.*[rR]$")), "minutes", units_delayed)
)%>%
#1.5 hour cases
mutate(
time_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1.5"), 90, time_delayed),
units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1.5"), "minutes", units_delayed)
)%>%
#NR: some of the ^1 x min cases going in here are already tagged as minutes.
#1 hr 30 cases
mutate(
time_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1") & str_detect(How_Long_Delayed, "30"), 90, time_delayed),
units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1") & str_detect(How_Long_Delayed, "30"), "minutes", units_delayed)
)%>%
#1 hr 15 cases
mutate(
time_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1") & str_detect(How_Long_Delayed, "15"), 75, time_delayed),
units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1") & str_detect(How_Long_Delayed, "15"), "minutes", units_delayed)
)%>%
#1 hr 20 cases
mutate(
time_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1") & str_detect(How_Long_Delayed, "20"), 80, time_delayed),
units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1") & str_detect(How_Long_Delayed, "20"), "minutes", units_delayed)
)%>%
#1 hr 45 cases
mutate(
time_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1") & str_detect(How_Long_Delayed, "45"), 105, time_delayed),
units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1") & str_detect(How_Long_Delayed, "45"), "minutes", units_delayed)
)%>%
#1 hr half
mutate(
time_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1") & str_detect(How_Long_Delayed, "half"), 90, time_delayed),
units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1") & str_detect(How_Long_Delayed, "half"), "minutes", units_delayed)
)%>%
#1 h cases
mutate(
time_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1") & str_detect(How_Long_Delayed, "[hH]"), 60, time_delayed),
units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1") & str_detect(How_Long_Delayed, "[hH]"), "minutes", units_delayed)
)%>%
#---has pattern "^[1-9][oO]", that is, o replaces 0
mutate(
time_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^[1-9][oO]"), time_delayed*10, time_delayed),
units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^[1-9][oO]") , "minutes", units_delayed)
) %>%
#---final outliers and rejects
#just switch to minutes
mutate(
units_delayed = ifelse(How_Long_Delayed == "35 SM", "minutes", units_delayed),
units_delayed = ifelse(How_Long_Delayed == "30am", "minutes", units_delayed),
units_delayed = ifelse(How_Long_Delayed == "30INS", "minutes", units_delayed),
units_delayed = ifelse(How_Long_Delayed == "45 late", "minutes", units_delayed),
units_delayed = ifelse(How_Long_Delayed == "40n", "minutes", units_delayed),
units_delayed = ifelse(How_Long_Delayed == "45 am", "minutes", units_delayed),
units_delayed = ifelse(How_Long_Delayed == "45NIN", "minutes", units_delayed),
units_delayed = ifelse(How_Long_Delayed == "45 INUTES", "minutes", units_delayed),
units_delayed = ifelse(How_Long_Delayed == "45IN", "minutes", units_delayed)
) %>%
#needs time modifications
mutate(
time_delayed = ifelse(How_Long_Delayed == "!0 mins", 1, time_delayed),
time_delayed = ifelse(How_Long_Delayed == "0-15 Min", 5, time_delayed),
time_delayed = ifelse(How_Long_Delayed == "0-15 Min", 5, time_delayed),
time_delayed = ifelse(How_Long_Delayed == "MAYBE 1/2", 30, time_delayed),
units_delayed = ifelse(How_Long_Delayed == "half hour", "minutes", units_delayed),
time_delayed = ifelse(How_Long_Delayed == "Half hour", 30, time_delayed),
units_delayed = ifelse(How_Long_Delayed == "Half hour", "minutes", units_delayed),
time_delayed = ifelse(How_Long_Delayed == "one hour", 60, time_delayed),
units_delayed = ifelse(How_Long_Delayed == "one hour", "minutes", units_delayed),
time_delayed = ifelse(How_Long_Delayed == "ONE HOUR", 60, time_delayed),
units_delayed = ifelse(How_Long_Delayed == "ONE HOUR", "minutes", units_delayed),
time_delayed = ifelse(How_Long_Delayed == "HOUR", 60, time_delayed),
units_delayed = ifelse(How_Long_Delayed == "HOUR", "minutes", units_delayed),
time_delayed = ifelse(How_Long_Delayed == "hour", 60, time_delayed),
units_delayed = ifelse(How_Long_Delayed == "hour", "minutes", units_delayed),
time_delayed = ifelse(How_Long_Delayed == "HR1", 60, time_delayed),
units_delayed = ifelse(How_Long_Delayed == "HR1", "minutes", units_delayed),
time_delayed = ifelse(How_Long_Delayed == "1/ 15 min", 75, time_delayed),
time_delayed = ifelse(How_Long_Delayed == "1:20MIN", 80, time_delayed),
time_delayed = ifelse(How_Long_Delayed == "1 20 min", 80, time_delayed),
time_delayed = ifelse(How_Long_Delayed == "1:30 min", 90, time_delayed),
time_delayed = ifelse(How_Long_Delayed == "1:30?mins", 90, time_delayed),
time_delayed = ifelse(How_Long_Delayed == "1 30mnts", 90, time_delayed),
time_delayed = ifelse(How_Long_Delayed == "1 30 min", 90, time_delayed),
time_delayed = ifelse(How_Long_Delayed == "1:40 MINS", 90, time_delayed),
time_delayed = ifelse(How_Long_Delayed == "IHR40MIN", 100, time_delayed),
units_delayed = ifelse(How_Long_Delayed == "IHR40MIN", "minutes", units_delayed),
time_delayed = ifelse(How_Long_Delayed == "3 HRS", 180, time_delayed),
units_delayed = ifelse(How_Long_Delayed == "3 HRS", "minutes", units_delayed),
time_delayed = ifelse(How_Long_Delayed == "3 hr", 180, time_delayed),
units_delayed = ifelse(How_Long_Delayed == "3 hr", "minutes", units_delayed),
time_delayed = ifelse(How_Long_Delayed == "3 HOUR", 180, time_delayed),
units_delayed = ifelse(How_Long_Delayed == "3 HOUR", "minutes", units_delayed),
time_delayed = ifelse(How_Long_Delayed == "4 hours", 240, time_delayed),
units_delayed = ifelse(How_Long_Delayed == "4 hours", "minutes", units_delayed),
time_delayed = ifelse(How_Long_Delayed == "4 hrs", 240, time_delayed),
units_delayed = ifelse(How_Long_Delayed == "4 hrs", "minutes", units_delayed)
) %>%
#sometimes the record just reads 'mins' or equivalent. My 'units_delayed' picks this up as 'minutes'. We take a guess and set this to 5 minutes.
mutate(
time_delayed = ifelse(units_delayed == "minutes" & str_detect(How_Long_Delayed, "[mM][iI][nN]") & is.na(time_delayed), 5, time_delayed)
) %>%
filter(!(is.na(time_delayed))) %>%
filter(time_delayed <= 300) %>% #Any delays beyond 5 hours are probably typos, but also problematic if real.
filter(units_delayed == "minutes") %>% #222 820 / 259 637. The rest are garbage or NA.
select(
Busbreakdown_ID, School_Year, Route_Number, Schools_Serviced, Bus_Company_Name,
Has_Contractor_Notified_Schools, Has_Contractor_Notified_Parents, Have_You_Alerted_OPT, Reason,
Boro, Number_Of_Students_On_The_Bus, School_Age_or_PreK,
Created_On, Occurred_On,
time_delayed
)
#---plot of the time_delayed
library(RColorBrewer)
pal <- brewer.pal(12, "Paired")
reason_levels <- ii_times %>%
group_by(Reason) %>%
mutate(
total_of_reason = n()
) %>% distinct(Reason, total_of_reason) %>%
arrange(total_of_reason) %>%
ungroup() %>%
mutate(total_of_reason = row_number())
ii_times %>%
mutate(Reason = factor(Reason, levels = reason_levels[[1]])) %>%
mutate(time_delayed = time_delayed) %>%
ggplot(aes(x = time_delayed, fill = Reason)) +
geom_bar() +
theme_classic() +
scale_fill_manual(values = pal) +
coord_cartesian(expand = c(0,0), xlim = c(0, 100)) +
labs(x="time_delayed [min]",y="Observations") +
theme(text=element_text(family="Arial", size=16)) +
theme(axis.line = element_line(colour = 'black', size = 1)) +
theme(axis.ticks = element_line(colour = "black", size = 1)) +
ggtitle("Observations by Reason")
A bar graph broken up Reason shows that, as observed when cleaning up the delay data, these are 'human approximated delays'. The driver or attendant (or perhaps someone at operations) has approximated the delay to the nearest convenient time package: 5 minutes, 10 minutes, 15 minutes, 20, 25, 30, 40, 45, 50, 60, 90 minutes, and so on. Some delays have been figured out more accurately by comparing time differences, but most have not. One of the consequences is that if we decide to run a regression model for this data, we can never expect it to be more precise than 5-10 minutes of RMS error. Because that is how the data have been recorded.
Note that there are a few data with delays beyond 100 minutes, and these points will show up later. The graph is presented this way for clarity. Traffic constitutes the major source of delay in the data set; accident is the most rare problem. 'Other', meaning a missing reason or undisclosed reason, is the second most common value in this field.
We can have a look at some of the structure via density plots:
ii_times %>%
mutate(Reason = factor(Reason, levels = reason_levels[[1]])) %>%
mutate(time_delayed = time_delayed) %>%
ggplot(aes(x = time_delayed, fill = Reason)) +
geom_density(aes(y=..count..), size = 0, adjust = 4, position = "stack", color = NA) +
theme_classic() +
scale_fill_manual(values = pal) +
coord_cartesian(expand = c(0,0), xlim = c(0, 100)) +
labs(x="time_delayed [min]",y="Observation count") +
theme(text=element_text(family="Arial", size=16)) +
theme(axis.line = element_line(colour = 'black', size = 1)) +
theme(axis.ticks = element_line(colour = "black", size = 1)) +
ggtitle("Density: observations by Reason")
boro_levels <- ii_times %>%
mutate(
Boro = ifelse(is.na(Boro), "All Boroughs", Boro),
) %>%
group_by(Boro) %>%
mutate(
total_of_boro = n()
) %>% distinct(Boro, total_of_boro) %>%
arrange(total_of_boro) %>%
ungroup() %>%
mutate(total_of_boro = row_number())
ii_times %>%
mutate(
Boro = ifelse(is.na(Boro), "All Boroughs", Boro),
) %>%
mutate(Boro = factor(Boro, levels = boro_levels[[1]])) %>%
ggplot(aes(x = time_delayed, fill = Boro)) +
geom_density(aes(y=..count..), size = 0, adjust = 4, position = "stack", color = NA) +
theme_classic() +
scale_fill_manual(values = pal) +
coord_cartesian(expand = c(0,0), xlim = c(0, 100)) +
labs(x="time_delayed [min]",y="Observation count") +
theme(text=element_text(family="Arial", size=16)) +
theme(axis.line = element_line(colour = 'black', size = 1)) +
theme(axis.ticks = element_line(colour = "black", size = 1)) +
ggtitle("Density: observations by Borough")
Here we must be careful with interpretation, since the density plot presents counts convolved with a gaussian density kernel. This helps us to see structure, but we must be careful interpreting the counts. There are still very few observations with delay time of 22 minutes, for example.
But we can see some structure a little more clearly. For example, Manhattan produces more long-time delays than the Bronx or Staten Island.
In the next set of notes we will try to fit a regression, based only on this data. However, we can already see the need to join more data on as we have few predictors. Before building a model or joining to the other data sets, there is a little more cleaning and dummifying of categorical variables to do:
ii_spread <- ii_times %>%
mutate(
time_diff_report = abs(as.numeric(difftime(Created_On, Occurred_On, units = "mins")))
) %>%
mutate(
reported_before_resolved = (time_delayed > time_diff_report)*1L,
Created_On = NULL,
Has_Contractor_Notified_Schools = (Has_Contractor_Notified_Schools == "Yes")*1L,
Has_Contractor_Notified_Parents = (Has_Contractor_Notified_Parents == "Yes")*1L,
Have_You_Alerted_OPT = (Have_You_Alerted_OPT == "Yes")*1L,
School_Age = (School_Age_or_PreK == "School-Age")*1L,
School_Age_or_PreK = NULL
) %>%
mutate(dummy = 1L,
Reason = str_replace_all(Reason, " ", ""),
Reason = str_replace_all(Reason, "\`", "")
) %>%
spread(key = Reason, value = dummy, fill = 0L, sep = "_") %>%
mutate(Reason_Other = NULL) %>% #default case is Other
mutate(dummy = 1L,
Boro = ifelse(is.na(Boro), "All Boroughs",Boro),
Boro = str_replace_all(Boro, " ", "")
) %>%
spread(key = Boro, value = dummy, fill = 0L, sep = "_") %>%
mutate(Boro_AllBoroughs = NULL) %>% #default case is all
mutate(
Number_Of_Students_On_The_Bus = as.numeric(Number_Of_Students_On_The_Bus) #This field has some problems we will deal with later
) %>%
filter(Number_Of_Students_On_The_Bus <= 80) #Although this cut is arbitrary, we choose not to believe the records with more than 80 people on the bus.