This notebook is a companion to the MTAccessibility web dashboard which displays a subset of the project’s findings. The notebook and dashboard, taken together, are a submission to the 2024 MTA Open Data Challenge, and so between the two, there is a bit of overlap. The main difference is that in this notebook, we show the data transformations that led to the results, and a few other ancillary insights gained during exploratory data analysis which ultimately led to the creation of the dashboard (which itself focuses on only one finding related to Senior and Fair Fare ridership).
2 Introduction
New York City’s public transit system is a lifeline for millions, including seniors and low-income residents who rely on it for daily commutes, errands, and access to essential services. Understanding how these populations use the subway system is vital for ensuring that transit services remain accessible, equitable, and responsive to their needs.
This accessibility analysis aims to shed light on ridership patterns, particularly among seniors and Fair Fare riders (low-income individuals who qualify for reduced fare programs). By studying the spatial patterns of subway ridership relative to the demographics of surrounding neighborhoods, we can uncover critical insights into how well the MTA serves vulnerable populations and where opportunities for strategic intervention might exist. These insights will enable the MTA to allocate resources more effectively, improve accessibility where it is most needed, and ensure that no community is left behind in the provision of transit services.
The findings of this study could have significant implications for policy and planning. For instance, above-average ridership in certain areas might point to unmet transportation needs, while below-average ridership could reveal opportunities to enhance alternative modes of transit or improve service in underutilized areas.
3 Data and Methods
For this analysis, we used three key datasets:
MTA Subway Hourly Ridership: Beginning February 2022: This dataset provides ridership counts by fare group (seniors/disability and Fair Fare riders). It gives us a clear picture of how these populations use the subway system.
MTA Subway Entrances and Exits: 2024: This station-level dataset provides information about station accessibility, including elevators and escalators, which is crucial for evaluating station suitability for vulnerable populations.
Census Tract Data from the 2022 American Community Survey (ACS): Census tract data is used to estimate the proportion of seniors (age 65+) and low-income individuals (those below the poverty line), serving as proxies for the two ridership groups of interest.
The goal of this study was to assess the relationship between the demographic characteristics of census tracts and the ridership patterns of seniors and low-income populations at nearby subway stations. We assigned each census tract to its nearest subway station, then used linear regression to model the relationship between census data and station data in terms of senior and Fair Fare population proportion versus ridership proportion. Census tracts were then classified as having average, above average, or below average ridership based on how their ridership patterns compared to ridership at all other census tracts.
4 Exploratory Data Analysis
Load some libraries for this analysis.
Code
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Code
library(sf)
Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE
Attaching package: 'janitor'
The following objects are masked from 'package:stats':
chisq.test, fisher.test
Code
library(data.table)
Attaching package: 'data.table'
The following objects are masked from 'package:lubridate':
hour, isoweek, mday, minute, month, quarter, second, wday, week,
yday, year
The following objects are masked from 'package:dplyr':
between, first, last
The following object is masked from 'package:purrr':
transpose
Our analysis is built around two main sources of Open MTA data, accessed 2024-10-13 from the NYC Open data Portal:
Hourly ridership data from February 2022 to October 2024. These represent entries to a station and contain payment information.
MTA station geolocation. We will use the station complex geolocation in this analysis, but a future study could analyze all entrances and exits in this dataset to determine how physical accessibility impacts different communities.
Let’s do some light exploratory data analysis (EDA) to get familiar with the data.
Code
# Data accessed 2024-10-13.# Read stations data frame and convert to spatial object. sta <-read_csv("data/MTA_Subway_Entrances_and_Exits__2024_20241013.csv") %>%select(-entrance_georeference) %>%clean_names() %>%mutate(entrance_type2 =ifelse(entrance_type =="Elevator", "Elevator", "Other")) %>%st_as_sf(coords =c("entrance_longitude", "entrance_latitude"), crs =4329) # hourly ridershiphr <-fread("data/MTA_Subway_Hourly_Ridership__Beginning_February_2022_20241013.csv")
Let’s view the stations on a map. Stations with elevators appear relatively even distributed across space.
Next, we filter the hourly ridership data to one calendar year, from 2023-01-01 to 2023-12-31, and aggregate all ridership groups to one of the following four groups: full fare, fair fare, student, and senior/disability. Then, we summarize the ridership count per fare group, day of week, hour of the day, and borough.
We observe that full fare riders comprise around 80-90% of riders and they are more like to ride during rush hour, and at night. Seniors and students don’t ride as much at night. Students interestingly are less likely to ride on the weekend.
Code
group_full_fare <-c("Metrocard - Full Fare", "Metrocard - Unlimited 7-Day", "Metrocard - Unlimited 30-Day", "OMNY - Full Fare", "Metrocard - Other", "OMNY - Other")group_fair_fare <-c("Metrocard - Fair Fare", "OMNY - Fair Fare")group_student <-c("Metrocard - Students", "OMNY - Students")group_senior <-c("OMNY - Seniors & Disability", "Metrocard - Seniors & Disability")# by boroughhrb <- hr %>%mutate(transit_timestamp =mdy_hms(transit_timestamp),hour =hour(transit_timestamp), dow = lubridate::wday(transit_timestamp, label =TRUE),fare_group =case_when( fare_class_category %in% group_senior ~"Seniors & Disability", fare_class_category %in% group_full_fare ~"Full Fare", fare_class_category %in% group_student ~"Students", fare_class_category %in% group_fair_fare ~"Fair Fare",TRUE~"Other"# optional to catch anything not covered by the case_when ) ) %>%filter( transit_timestamp >=ymd_hms("2023-01-01 00:00:00") & transit_timestamp <=ymd_hms("2023-12-31 24:00:00")) %>%group_by(fare_group, dow, hour, borough) %>%# because we're grouping by the day of week, and because, ignoring leap years,# there are 52 Mondays in a year (and so on), we divide the annual ridership# by 52 to arrive at an average weekly ridershipsummarize(ridership =sum(ridership)/52) %>%ungroup()hrb %>%group_by(dow, hour, borough) %>%mutate(total_ridership =sum(ridership), ridership_proportion = ridership / total_ridership) %>%ungroup() %>%ggplot(aes(hour, ridership_proportion, color = borough)) +geom_line() +scale_x_continuous(limits =c(0, 25),breaks =seq(0, 24, by =6), # Breaks every 6 hourslabels =sprintf("%02d:00", seq(0, 24, by =6)) # Format labels as "HH:00" ) + rcartocolor::scale_color_carto_d(palette ="Bold") +labs(title ="Average daily ridership, per hour, per bouroughs, across fare groups",x ="",y ="",color ="" ) +facet_grid(fare_group~dow, scales ="free") +theme_minimal(base_size =9) +theme(panel.grid.minor =element_blank(), panel.grid.major.y =element_blank(),axis.text.x =element_text(angle =90, vjust =0.5, hjust =1),legend.position ="bottom" )
4.2 Hourly Ridership Curves
By borough average stats are interesting, but let’s go one spatial level deeper and recalculate ridership per station complex. This will not go into our map, but it is a level of grouped summary that we can visualize for each station, and average ridership proportions per station, per fare group, per hour of the day will reveal the flow dynamics of urban commuters.
Code
# by station ridershiphrs <- hr %>%mutate(transit_timestamp =mdy_hms(transit_timestamp),hour =hour(transit_timestamp),dow = lubridate::wday(transit_timestamp, label =TRUE),fare_group =case_when( fare_class_category %in% group_senior ~"Seniors & Disability", fare_class_category %in% group_full_fare ~"Full Fare", fare_class_category %in% group_student ~"Students", fare_class_category %in% group_fair_fare ~"Fair Fare",TRUE~"Other"# optional to catch anything not covered by the case_when ) ) %>%filter( transit_timestamp >=ymd_hms("2023-01-01 00:00:00") & transit_timestamp <=ymd_hms("2023-12-31 24:00:00") ) %>%# Calculate the average annual ridership count per day of week and hour of daygroup_by(fare_group, dow, hour, station_complex, station_complex_id) %>%summarize(ridership =mean(ridership)) %>%ungroup()# For each fare group, calculate the total ridership and ridership proportionhrs <- hrs %>%group_by(dow, hour, station_complex, station_complex_id) %>%mutate(total_ridership =sum(ridership), ridership_proportion = ridership / total_ridership) %>%ungroup() %>%mutate(fare_group =factor( fare_group,levels =c("Seniors & Disability", "Fair Fare", "Full Fare", "Students") ))sta_mapgl <- hr %>%group_by(station_complex_id) %>%slice(1) %>%ungroup() %>%st_as_sf(coords =c("longitude", "latitude"), crs =4269) %>%select(station_complex_id, station_complex)# Write all plots to a PDFplots <-vector("list", nrow(sta_mapgl))for(i inseq_along(sta_mapgl$station_complex)){ station = sta_mapgl$station_complex[i] plots[[i]] = hrs %>%filter(station_complex == station) %>%mutate(dow_upscale =ifelse(dow %in%c("Sat","Sun"),"weekend","weekday")) %>%ggplot(aes(hour, ridership, group = dow, color = dow_upscale)) +geom_line(alpha =0.7, key_glyph ="timeseries") +scale_color_manual(values =c("#7570b3", "#1b9e77")) +scale_x_continuous(limits =c(0, 25),breaks =seq(0, 24, by =6), # Breaks every 6 hourslabels =sprintf("%02d:00", seq(0, 24, by =6)) # Format labels as "HH:00" ) +facet_wrap(~fare_group, ncol =1, scales ="free") +labs(title =NULL,subtitle =glue("{station}\nAverage hourly ridership"),y =NULL,x =NULL,color ="",caption ="Data from 2023-01-01 to 2023-12-31" ) +theme_minimal() +theme(panel.grid.minor =element_blank(), legend.position ="bottom" )}# Ran this once, but don't need to do it for the report# pdf("out/station_hourly_avg_per_rider.pdf", height = 4, width = 6)# plots# dev.off()
Morning peaking stations likely reflect the inflow of workers into the dense urban core, whereas evening peaking stations might reflect the opposite: the outflow of workers away from urban core. The examples below shown below seem to support this hypothesis. Stations with a more balanced morning and evening peak might be explained by a balance of residential and commercial surroundings that both supply and receive commuters.
Finally, we calculate a third grouped summary from our hourly data (the annual ridership count per station complex), which we will compare against Census tract level data. From this summary level, we can visualize the total ridership count per station. Let’s look at the top 25 stations. Just over 40 million riders entered Times Square in 2023.
Code
# this is for the maphrs_annual <- hr %>%mutate(transit_timestamp =mdy_hms(transit_timestamp),hour =hour(transit_timestamp),dow = lubridate::wday(transit_timestamp, label =TRUE),fare_group =case_when( fare_class_category %in% group_senior ~"Seniors & Disability", fare_class_category %in% group_full_fare ~"Full Fare", fare_class_category %in% group_student ~"Students", fare_class_category %in% group_fair_fare ~"Fair Fare",TRUE~"Other"# optional to catch anything not covered by the case_when ) ) %>%filter( transit_timestamp >=ymd_hms("2023-01-01 00:00:00") & transit_timestamp <=ymd_hms("2023-12-31 24:00:00") ) %>%# per fare group and station complex, what's the annual ridership countgroup_by(fare_group, station_complex, station_complex_id) %>%summarize(ridership =sum(ridership, na.rm =TRUE)) %>%ungroup()# now calculate the proportion of ridership at each station per fare grouphrs_annual <- hrs_annual %>%group_by(station_complex, station_complex_id) %>%mutate(total_ridership =sum(ridership), ridership_proportion = ridership / total_ridership) %>%ungroup() # And visualize the top 25 stations in YC by total annual ridershiphrs_annual %>%mutate(fare_group =factor( fare_group, levels =c("Fair Fare", "Seniors & Disability", "Students", "Full Fare")) ) %>%arrange(desc(total_ridership)) %>%slice(1:100) %>%# 4 fare groups, so this takes the top 25 stations# filter(station_complex %in% unique(hr$station_complex)[1:30]) %>%ggplot(aes(ridership, fct_reorder(station_complex, total_ridership), fill = fare_group)) +geom_col() + rcartocolor::scale_fill_carto_d(palette ="Bold", guide =guide_legend(reverse =TRUE) ) +scale_x_continuous(labels = scales::label_comma()) +labs(title ="Total annual ridership (2023) for the top 25 stations in NYC",x ="Annual Ridership Count",y ="",fill ="" ) +theme_minimal(base_size =9) +theme(panel.grid.minor =element_blank(), panel.grid.major.y =element_blank(),legend.position ="bottom" )
Now we join ridership data to the station spatial data.
Code
sta_complex_sf <- hr %>%group_by(station_complex_id) %>%slice(1) %>%ungroup() %>%st_as_sf(coords =c("longitude", "latitude"), crs =4269) %>%select(station_complex_id)# Who takes transit (full fare, seniors/disability, students, fair fare) by station complexsta_ridership <- hrs_annual %>%left_join(sta_complex_sf) # data frame for seniors and fare faresta_seniors <- sta_ridership %>%filter(fare_group =="Seniors & Disability") %>%st_as_sf() %>%rename(seniors_prop_sta = ridership_proportion,seniors_ridership_sta = ridership )sta_fair <- sta_ridership %>%filter(fare_group =="Fair Fare") %>%st_as_sf() %>%rename(low_income_prop_sta = ridership_proportion,low_income_ridership_sta = ridership )
We can visualize the ridership proportion of Seniors/Disability and Fair Fare populations per station and see that seniors1 and Fair Fare riders tend to use stations outside of denser urban metros. This trend is particularity evident for Fair Fare riders, where ridership proportions in Harlem and Queens can be 3-5 times the ridership proportion in downtown Manhattan.
Now it’s time to pull Census Tract data2. Critically, we are interested in total population, senior population, and population below the below the poverty line, which we will assume is a proxy for the Fair Fare riders3. From these variables we can calculate the proportion of potential senior and fair fare riders. From the Census tract spatial boundaries, we can find the nearest MTA Station.
Code
variables <-c(median_age ="B01002_001", median_income ="B19013_001",total_population ="B17021_001", below_poverty ="B17021_002")# Get the most recent ACS data with spatial polygons for New York Citynyc_acs <-get_acs(geography ="tract", # Get data at the census tract levelvariables = variables, # Age and income variablesstate ="NY", # State of New Yorkcounty =c("Bronx", "Kings", "New York", "Queens", "Richmond"), # NYC boroughsyear =2022, # Most recent ACS yearsurvey ="acs5", # 5-year ACS surveygeometry =TRUE# Include spatial data (polygons))# Define the breaks and custom labelsbreaks <-seq(0, 1, by =0.2)labels <-c("0-0.2", "0.2-0.4", "0.4-0.6", "0.6-0.8", "0.8-1")nyc <- nyc_acs %>%pivot_wider(names_from = variable, values_from =c(estimate, moe)) %>%mutate(estimate_below_poverty_prop = estimate_below_poverty/estimate_total_population,estimate_below_poverty_prop_bin =cut(estimate_below_poverty_prop, breaks = breaks, labels = labels, include.lowest =TRUE) )
Now we calculate the population and proportion of seniors (65 years of age or greater).
Code
# add population seniors# Variables for men and women 65+variables <-c(males_65_to_66 ="B01001_020",males_67_to_69 ="B01001_021",males_70_to_74 ="B01001_022",males_75_to_79 ="B01001_023",males_80_to_84 ="B01001_024",males_85_and_over ="B01001_025",females_65_to_66 ="B01001_044",females_67_to_69 ="B01001_045",females_70_to_74 ="B01001_046",females_75_to_79 ="B01001_047",females_80_to_84 ="B01001_048",females_85_and_over ="B01001_049")# Get the data for New York Citynyc_seniors_combined <-get_acs(geography ="tract",variables = variables,state ="NY",county =c("Bronx", "Kings", "New York", "Queens", "Richmond"), # NYC boroughsyear =2022,survey ="acs5")# Summarize the total senior population (65+) by GEOIDnyc_seniors_combined_totals <- nyc_seniors_combined %>%group_by(GEOID) %>%summarize(seniors_population_acs =sum(estimate, na.rm =TRUE))# join back to nycnyc <- nyc %>%left_join(nyc_seniors_combined_totals) %>%mutate(seniors_prop_acs = seniors_population_acs/estimate_total_population)
Next, we find the nearest MTA station to each Census Tract and join this back to our main dataframe of Census Tract data.
Code
# Find the index of the nearest subway station for each polygon in 'nyc'nearest_station_index <-st_nearest_feature(nyc, sta_complex_sf)# Use the index to add the closest subway station information to the 'nyc' polygonsnyc <- nyc %>%mutate(station_complex_id = sta_complex_sf$station_complex_id[nearest_station_index])# add seniors proportion from station datanyc <- nyc %>%left_join( sta_seniors %>%select(-fare_group) %>%st_drop_geometry() ) # add low income proportion from station datanyc <- nyc %>%rename(low_income_prop_acs = estimate_below_poverty_prop) %>%left_join( sta_fair %>%select(-fare_group) %>%st_drop_geometry() )
4.4 Senior and Fair Fare Ridership
Next, we rank every Census Tract in terms of its actual Senior and Fair Fare ridership proportion relative to what would theoretically be expected based on the population of these individuals in the Census Tract. We can think of this as a “station utilization” metric for these populations and use the results as a springboard for hypothesis generation and interventions.
To proceed, we assume a roughly linear relationship between the proportion of Seniors in a Census Tract and the proportion of Senior ridership at that tract’s closest MTA Station4. We make the same assumption for Fair Fare riders.
It follows then, that outliers from this linear trend–which we can quantify with the standard error of the least squares regression fit–represent deviations from “average ridership,” relative to all other Census Tracts. In other words, we ask: compared to all other census tracts, is Senior and Fair Fare Ridership average, above average, or below average? Put yet another way, if the station ridership proportion (y) falls within the standard error (σy^) envelope of the least squares fit (y^), which is the average trend, then the station has average ridership, and if the ridership proportion falls outside of this envelope, it has above or below ridership.
Average Ridership:Below Average Ridership:Above Average Ridership:y^−σy^≤y≤y^+σy^y<y^−σy^y>y^+σy^
Next, we fit the models and show the graphical interpretation of the approach.
Code
# cleaning for linear modelnyc <- nyc %>%# rm 4 rows where total population is 0, so seniors prop is infinitefilter(estimate_total_population >0& seniors_prop_acs <0.9) nyc <- nyc %>%mutate(residual_senior =residuals(lm(seniors_prop_sta ~ seniors_prop_acs, data = nyc)),residual_low_income =residuals(lm(low_income_prop_sta ~ low_income_prop_acs, data = nyc)) ) %>%# Get the standard error from the model for residual_senior and residual_low_incomemutate(se_senior =summary(lm(seniors_prop_sta ~ seniors_prop_acs, data = nyc))$sigma,se_low_income =summary(lm(low_income_prop_sta ~ low_income_prop_acs, data = nyc))$sigma ) %>%# Create categories based on residuals using standard error (SE) thresholdsmutate(seniors_category =case_when(# residual_senior > 2 * se_senior ~ "Highly Overutilized", residual_senior > se_senior ~"Above Average", residual_senior >=-se_senior & residual_senior <= se_senior ~"Average", # residual_senior < -2 * se_senior ~ "Highly Underutilized", residual_senior <-se_senior ~"Below Average" ),low_income_category =case_when(# residual_low_income > 2 * se_low_income ~ "Highly Overutilized", residual_low_income > se_low_income ~"Above Average", residual_low_income >=-se_low_income & residual_low_income <= se_low_income ~"Average",# residual_low_income < -2 * se_low_income ~ "Highly Underutilized", residual_low_income <-se_low_income ~"Below Average" ) )lvls <-c("Above Average", "Average", "Below Average")nyc <- nyc %>%# Convert utilization categories to an ordered factormutate(seniors_category =factor(seniors_category, levels = lvls, ordered =TRUE),low_income_category =factor(low_income_category, levels = lvls, ordered =TRUE) )# Fair Farep1 <- nyc %>%ggplot(aes(low_income_prop_acs, low_income_prop_sta)) +geom_point(aes(color = low_income_category), alpha =0.3) +geom_smooth(method ="lm", se =FALSE) +labs(x ="Proportion Below Poverty (Census)",y ="Proportion Fair Fare Ridership (MTA)",color ="" ) + rcartocolor::scale_color_carto_d(palette ="Bold") +theme_minimal() +theme(panel.grid.minor =element_blank(), legend.position ="bottom" )p2 <- nyc %>%ggplot(aes(low_income_prop_acs, low_income_prop_sta)) +geom_point(aes(color = residual_low_income)) +scale_color_gradient2(low ="red", mid ="white", high ="blue", midpoint =0) +geom_smooth(method ="lm", se =FALSE) +labs(x ="Proportion Below Poverty (Census)",y ="Proportion Fair Fare Ridership (MTA)",color ="Residual" ) +theme_minimal() +theme(panel.grid.minor =element_blank(), legend.position ="bottom" )# Seniorsp3 <- nyc %>%ggplot(aes(seniors_prop_acs, seniors_prop_sta)) +geom_point(aes(color = seniors_category), alpha =0.3) +geom_smooth(method ="lm", se =FALSE) +labs(x ="Proportion Senior (Census)",y ="Proportion Senior Ridership (MTA)",color ="" ) + rcartocolor::scale_color_carto_d(palette ="Vivid") +theme_minimal() +theme(panel.grid.minor =element_blank(), legend.position ="bottom" )p4 <- nyc %>%ggplot(aes(seniors_prop_acs, seniors_prop_sta)) +geom_point(aes(color = residual_senior)) +scale_color_gradient2(low ="red", mid ="white", high ="blue", midpoint =0) +geom_smooth(method ="lm", se =FALSE) +labs(x ="Proportion Senior (Census)",y ="Proportion Senior Ridership (MTA)",color ="Residual" ) +theme_minimal() +theme(panel.grid.minor =element_blank(), legend.position ="bottom" )