Novel Three-Stage Framework for Prioritizing and Selecting Feature Variables for Short-Term Metro Passenger Flow Prediction

Document Type

Journal Article

Publication Date


Subject Area

mode - subway/metro, place - asia, ridership - demand, planning - methods, technology - passenger information, technology - ticketing systems


Metro, Automatic Fare Collection (AFC), passenger flow prediction


Short-term metro passenger flow prediction is vital for the operation and management of metro systems. Most studies focus on the higher prediction accuracy with statistical and machine learning methods, but little attention has been paid to the prioritization and selection of feature variables, especially for different metro station types. This study aims to analyze the effect of feature variables on the prediction results, and then select appropriate predictor variables accordingly. A novel three-stage framework is proposed to prioritize feature variables for short-term metro passenger flow prediction, including station clustering, feature extraction, and variable prioritization. A hierarchical clustering algorithm (AHC) is developed for station clustering, the results of which are verified by the K-means and Davies-Bouldin (DB) statistical index. We then extract the temporal, spatial, and external features. Finally, the association between the variables and the prediction results is explored using tree-based models. The proposed framework is demonstrated and validated with data collected from Shanghai Metro Automatic Fare Collection (AFC) system. The results highlight that the importance of feature variables for developing models varies between stations, whereas only a few variables are found to explain most of the variation in the testing dataset; different feature variables lead to distinct differences in prediction accuracy, and simply adding more predictor variables does not necessarily lead to higher prediction accuracy. In addition, the station type and prediction type (i.e., tap-in and tap-out) have little influence on the selection of feature variables.


Permission to publish the abstract has been given by SAGE, copyright remains with them.