This notebook is a companion to the MTAccessibility web dashboard which displays a subset of the project’s findings. The notebook and dashboard, taken together, are a submission to the 2024 MTA Open Data Challenge, and so between the two, there is a bit of overlap. The main difference is that in this notebook, we show the data transformations that led to the results, and a few other ancillary insights gained during exploratory data analysis which ultimately led to the creation of the dashboard (which itself focuses on only one finding related to Senior and Fair Fare ridership).
2 Introduction
New York City’s public transit system is a lifeline for millions, including seniors and low-income residents who rely on it for daily commutes, errands, and access to essential services. Understanding how these populations use the subway system is vital for ensuring that transit services remain accessible, equitable, and responsive to their needs.
This accessibility analysis aims to shed light on ridership patterns, particularly among seniors and Fair Fare riders (low-income individuals who qualify for reduced fare programs). By studying the spatial patterns of subway ridership relative to the demographics of surrounding neighborhoods, we can uncover critical insights into how well the MTA serves vulnerable populations and where opportunities for strategic intervention might exist. These insights will enable the MTA to allocate resources more effectively, improve accessibility where it is most needed, and ensure that no community is left behind in the provision of transit services.
The findings of this study could have significant implications for policy and planning. For instance, above-average ridership in certain areas might point to unmet transportation needs, while below-average ridership could reveal opportunities to enhance alternative modes of transit or improve service in underutilized areas.
3 Data and Methods
For this analysis, we used three key datasets:
MTA Subway Hourly Ridership: Beginning February 2022: This dataset provides ridership counts by fare group (seniors/disability and Fair Fare riders). It gives us a clear picture of how these populations use the subway system.
MTA Subway Entrances and Exits: 2024: This station-level dataset provides information about station accessibility, including elevators and escalators, which is crucial for evaluating station suitability for vulnerable populations.
Census Tract Data from the 2022 American Community Survey (ACS): Census tract data is used to estimate the proportion of seniors (age 65+) and low-income individuals (those below the poverty line), serving as proxies for the two ridership groups of interest.
The goal of this study was to assess the relationship between the demographic characteristics of census tracts and the ridership patterns of seniors and low-income populations at nearby subway stations. We assigned each census tract to its nearest subway station, then used linear regression to model the relationship between census data and station data in terms of senior and Fair Fare population proportion versus ridership proportion. Census tracts were then classified as having average, above average, or below average ridership based on how their ridership patterns compared to ridership at all other census tracts.
4 Exploratory Data Analysis
Load some libraries for this analysis.
Code
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Code
library(sf)
Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE
Attaching package: 'janitor'
The following objects are masked from 'package:stats':
chisq.test, fisher.test
Code
library(data.table)
Attaching package: 'data.table'
The following objects are masked from 'package:lubridate':
hour, isoweek, mday, minute, month, quarter, second, wday, week,
yday, year
The following objects are masked from 'package:dplyr':
between, first, last
The following object is masked from 'package:purrr':
transpose
Our analysis is built around two main sources of Open MTA data, accessed 2024-10-13 from the NYC Open data Portal:
Hourly ridership data from February 2022 to October 2024. These represent entries to a station and contain payment information.
MTA station geolocation. We will use the station complex geolocation in this analysis, but a future study could analyze all entrances and exits in this dataset to determine how physical accessibility impacts different communities.
Let’s do some light exploratory data analysis (EDA) to get familiar with the data.
Code
# Data accessed 2024-10-13.# Read stations data frame and convert to spatial object. sta <-read_csv("data/MTA_Subway_Entrances_and_Exits__2024_20241013.csv") %>%select(-entrance_georeference) %>%clean_names() %>%mutate(entrance_type2 =ifelse(entrance_type =="Elevator", "Elevator", "Other")) %>%st_as_sf(coords =c("entrance_longitude", "entrance_latitude"), crs =4329) # hourly ridershiphr <-fread("data/MTA_Subway_Hourly_Ridership__Beginning_February_2022_20241013.csv")
Let’s view the stations on a map. Stations with elevators appear relatively even distributed across space.
Next, we filter the hourly ridership data to one calendar year, from 2023-01-01 to 2023-12-31, and aggregate all ridership groups to one of the following four groups: full fare, fair fare, student, and senior/disability. Then, we summarize the ridership count per fare group, day of week, hour of the day, and borough.
We observe that full fare riders comprise around 80-90% of riders and they are more like to ride during rush hour, and at night. Seniors and students don’t ride as much at night. Students interestingly are less likely to ride on the weekend.
Code
group_full_fare <-c("Metrocard - Full Fare", "Metrocard - Unlimited 7-Day", "Metrocard - Unlimited 30-Day", "OMNY - Full Fare", "Metrocard - Other", "OMNY - Other")group_fair_fare <-c("Metrocard - Fair Fare", "OMNY - Fair Fare")group_student <-c("Metrocard - Students", "OMNY - Students")group_senior <-c("OMNY - Seniors & Disability", "Metrocard - Seniors & Disability")# by boroughhrb <- hr %>%mutate(transit_timestamp =mdy_hms(transit_timestamp),hour =hour(transit_timestamp), dow = lubridate::wday(transit_timestamp, label =TRUE),fare_group =case_when( fare_class_category %in% group_senior ~"Seniors & Disability", fare_class_category %in% group_full_fare ~"Full Fare", fare_class_category %in% group_student ~"Students", fare_class_category %in% group_fair_fare ~"Fair Fare",TRUE~"Other"# optional to catch anything not covered by the case_when ) ) %>%filter( transit_timestamp >=ymd_hms("2023-01-01 00:00:00") & transit_timestamp <=ymd_hms("2023-12-31 24:00:00")) %>%group_by(fare_group, dow, hour, borough) %>%# because we're grouping by the day of week, and because, ignoring leap years,# there are 52 Mondays in a year (and so on), we divide the annual ridership# by 52 to arrive at an average weekly ridershipsummarize(ridership =sum(ridership)/52) %>%ungroup()hrb %>%group_by(dow, hour, borough) %>%mutate(total_ridership =sum(ridership), ridership_proportion = ridership / total_ridership) %>%ungroup() %>%ggplot(aes(hour, ridership_proportion, color = borough)) +geom_line() +scale_x_continuous(limits =c(0, 25),breaks =seq(0, 24, by =6), # Breaks every 6 hourslabels =sprintf("%02d:00", seq(0, 24, by =6)) # Format labels as "HH:00" ) + rcartocolor::scale_color_carto_d(palette ="Bold") +labs(title ="Average daily ridership, per hour, per bouroughs, across fare groups",x ="",y ="",color ="" ) +facet_grid(fare_group~dow, scales ="free") +theme_minimal(base_size =9) +theme(panel.grid.minor =element_blank(), panel.grid.major.y =element_blank(),axis.text.x =element_text(angle =90, vjust =0.5, hjust =1),legend.position ="bottom" )
4.2 Hourly Ridership Curves
By borough average stats are interesting, but let’s go one spatial level deeper and recalculate ridership per station complex. This will not go into our map, but it is a level of grouped summary that we can visualize for each station, and average ridership proportions per station, per fare group, per hour of the day will reveal the flow dynamics of urban commuters.
Code
# by station ridershiphrs <- hr %>%mutate(transit_timestamp =mdy_hms(transit_timestamp),hour =hour(transit_timestamp),dow = lubridate::wday(transit_timestamp, label =TRUE),fare_group =case_when( fare_class_category %in% group_senior ~"Seniors & Disability", fare_class_category %in% group_full_fare ~"Full Fare", fare_class_category %in% group_student ~"Students", fare_class_category %in% group_fair_fare ~"Fair Fare",TRUE~"Other"# optional to catch anything not covered by the case_when ) ) %>%filter( transit_timestamp >=ymd_hms("2023-01-01 00:00:00") & transit_timestamp <=ymd_hms("2023-12-31 24:00:00") ) %>%# Calculate the average annual ridership count per day of week and hour of daygroup_by(fare_group, dow, hour, station_complex, station_complex_id) %>%summarize(ridership =mean(ridership)) %>%ungroup()# For each fare group, calculate the total ridership and ridership proportionhrs <- hrs %>%group_by(dow, hour, station_complex, station_complex_id) %>%mutate(total_ridership =sum(ridership), ridership_proportion = ridership / total_ridership) %>%ungroup() %>%mutate(fare_group =factor( fare_group,levels =c("Seniors & Disability", "Fair Fare", "Full Fare", "Students") ))sta_mapgl <- hr %>%group_by(station_complex_id) %>%slice(1) %>%ungroup() %>%st_as_sf(coords =c("longitude", "latitude"), crs =4269) %>%select(station_complex_id, station_complex)# Write all plots to a PDFplots <-vector("list", nrow(sta_mapgl))for(i inseq_along(sta_mapgl$station_complex)){ station = sta_mapgl$station_complex[i] plots[[i]] = hrs %>%filter(station_complex == station) %>%mutate(dow_upscale =ifelse(dow %in%c("Sat","Sun"),"weekend","weekday")) %>%ggplot(aes(hour, ridership, group = dow, color = dow_upscale)) +geom_line(alpha =0.7, key_glyph ="timeseries") +scale_color_manual(values =c("#7570b3", "#1b9e77")) +scale_x_continuous(limits =c(0, 25),breaks =seq(0, 24, by =6), # Breaks every 6 hourslabels =sprintf("%02d:00", seq(0, 24, by =6)) # Format labels as "HH:00" ) +facet_wrap(~fare_group, ncol =1, scales ="free") +labs(title =NULL,subtitle =glue("{station}\nAverage hourly ridership"),y =NULL,x =NULL,color ="",caption ="Data from 2023-01-01 to 2023-12-31" ) +theme_minimal() +theme(panel.grid.minor =element_blank(), legend.position ="bottom" )}# Ran this once, but don't need to do it for the report# pdf("out/station_hourly_avg_per_rider.pdf", height = 4, width = 6)# plots# dev.off()
Morning peaking stations likely reflect the inflow of workers into the dense urban core, whereas evening peaking stations might reflect the opposite: the outflow of workers away from urban core. The examples below shown below seem to support this hypothesis. Stations with a more balanced morning and evening peak might be explained by a balance of residential and commercial surroundings that both supply and receive commuters.