Have you ever wondered how technology can revolutionize the way we predict the time of arrival for taxis? Imagine a future where uncertainty in arrival times is a thing of the past, thanks to cutting-edge deep learning algorithms. In this post, we dive into neural networks (more precisely KANs that we reviewed in a previous post) and feature engineering to explore the Estimated Time of Arrival (ETA) of taxis in the bustling streets of New York City.
What is the ETA Prediction
Estimating the time of arrival (ETA) for taxis has long been a challenge, influenced by myriad factors such as traffic conditions, weather, and driver behavior. Traditional methods often struggled to provide precise ETAs, leading to frustration and inconvenience for passengers. However, the emergence of deep learning has unlocked new possibilities in this domain.
NYC Taxi Dataset
The dataset for this competition is derived from the 2016 NYC Yellow Cab trip records, which are available in BigQuery on the Google Cloud Platform. This data, initially released by the NYC Taxi and Limousine Commission (TLC), has been sampled and cleaned specifically for this competition. Participants are tasked with predicting the duration of each trip in the test set based on various trip attributes. The dataset is comprised of 1,458,644 records, with trip details such as IDs, timestamps, passenger counts, locations, and trip durations. Of course, with some feature engineering operations which we will describe, the final dataset contains many more features.
Deep Learning: A Game-Changer in ETA Prediction
Deep learning, a subset of artificial intelligence, leverages neural networks to analyze vast amounts of data and extract meaningful patterns. By feeding these networks with historical taxi trip data, weather conditions, traffic updates, and other relevant variables, we can train them to make accurate predictions. In the context of NYC taxi services, these networks can learn from past experiences to anticipate future arrival times, taking into account real-time factors that impact travel durations. For this usecase, we will use KANs that we reviewed in a previous post but also simpler neural networks. We will compare the deep neural network to XGBoost as well as a local OSM route engine.
Feature Engineering Operations
Many times we need few basic features to generate important features which will help us model the problem at hand. For the ETA task, the GPS points and timestamp can take us a long way. But first lets see some data quality issues:
Seems that there are some trips which are way out of bounds, unless the passengers decided to live in the cab for some years. Of course, these samples are removed. In addition, some samples seem to be in other countries or even in the water based on the GPS coordinates. We also exclude these samples from the analysis.
The following features were engineered to enhance the predictive power of the model. Many of these features are merged from other sources e.g., weather data, holiday api, osm route engine (most of the effort was put in building the feature space rather finding the best model architecture).
PCA Transformed Coordinates:
pickup_pca0, pickup_pca1: Principal Component Analysis (PCA) was applied to the pickup coordinates (pickup_latitude, pickup_longitude) to create two new features representing the first and second principal components. These components capture the most significant directions of variance in the pickup location data.
dropoff_pca0, dropoff_pca1: Similarly, PCA was applied to the dropoff coordinates (dropoff_latitude, dropoff_longitude) to generate the first and second principal components for the dropoff locations.
Distance Features:
distance_haversine: This feature represents the Haversine distance, which is the shortest path between two points on the Earth's surface. It is calculated using the latitude and longitude of the pickup and dropoff points.
distance_dummy_manhattan: This feature approximates the Manhattan distance (or "taxicab" distance) between the pickup and dropoff locations by summing the Haversine distances in the latitude and longitude directions separately.
Direction Feature:
direction: This feature represents the compass bearing from the pickup location to the dropoff location. It provides information on the direction of travel.
PCA-based Manhattan Distance:
pca_manhattan: This feature calculates a Manhattan-like distance in the PCA-transformed space. It sums the absolute differences between the PCA components of pickup and dropoff points, providing a different perspective on distance measurement.
Geographical Center:
center_latitude, center_longitude: These features represent the geographical midpoint between the pickup and dropoff locations. They provide a sense of the central point of each trip.
Temporal Features:
pickup_weekday: The day of the week on which the pickup occurred, extracted from pickup_datetime.
pickup_hour: The hour of the day when the pickup took place, derived from pickup_datetime.
pickup_minute: The minute of the hour when the pickup occurred, also derived from pickup_datetime.
pickup_dt: The total number of seconds from the earliest pickup time in the dataset to the current pickup time. This feature provides a continuous time measure.
pickup_week_hour: The hour of the week when the pickup occurred, combining pickup_weekday and pickup_hour into a single feature ranging from 0 to 167.
Peak_Hour: This feature categorizes the pickup hour into three groups based on New York City's typical peak hours for taxi demand:
16:00-19:59 is assigned a value of 2, indicating a peak hour.
20:00-23:59 is assigned a value of 1, indicating a moderate demand period.
All other hours are assigned a value of 0, indicating off-peak hours.
Clustering-Based Features:
pickup_cluster and dropoff_cluster: These features were created using the KMeans clustering algorithm with 50 clusters, applied to a random sample of the latitude and longitude coordinates. Each pickup and dropoff location was assigned a cluster label, which helps capture common pickup and dropoff zones in the city, potentially identifying high-traffic areas or frequent routes.
Weather-Related Features:
Weather data was merged with the trip dataset to incorporate external conditions that might affect trip durations. The following weather-related features were added:
temperature: The temperature at the time of the trip, measured in degrees Celsius.
humidity: The humidity percentage at the time of the trip.
wind_dir: The wind direction in degrees at the time of the trip.
wind_speed: The wind speed in meters per second at the time of the trip.
weather_descr: A textual description of the weather conditions at the time of the trip (e.g., clear, rain, snow).
Holiday Indicators:
holidays_today: This feature indicates whether the pickup date (pickup_datetime) is a public holiday. It checks against a pre-loaded cache of holidays (holiday_cache) for the United States (US) considering the specific state if applicable. A value of 1 means it is a holiday, while 0 means it is not.
holidays_yesterday: This feature checks if the day before the pickup date (date_yesterday) was a public holiday. It follows the same logic as holidays_today, determining if the prior day being a holiday could impact travel patterns due to extended holidays or altered travel plans.
holidays_tomorrow: This feature indicates if the day after the pickup date (date_tomorrow) is a public holiday. The feature helps capture potential pre-holiday travel patterns, where travel behaviors might change in anticipation of a holiday.
Intersection Features:
nIntersection: The number of intersections encountered along the route. This feature indicates how many times the route crosses other roads, which can affect travel speed and complexity.
Traffic Control Features:
nStop: The number of stop signs along the route. Stops can contribute to delays and variations in travel time.
nCrossing: The number of pedestrian crossings along the route. Crossings may affect the flow of traffic due to frequent stops for pedestrians.
nTrafficSignals: The number of traffic signals present on the route. Traffic signals can impact travel time by causing stops and delays.
Road Type Features:
primary: A boolean feature indicating whether the route includes primary roads. Primary roads are major roads designed for higher traffic volumes and generally have higher speed limits.
secondary: A boolean feature indicating the presence of secondary roads. These roads serve as major connections but are typically less significant than primary roads.
residential: A boolean feature indicating whether the route includes residential streets. Residential streets are usually narrower and may have lower speed limits.
tertiary: A boolean feature indicating the presence of tertiary roads. These roads are less important than primary and secondary roads but still contribute to the route.
Route Metrics:
duration: The total time required to travel along the route. This metric helps understand the expected travel time based on the route characteristics.
distance: The total distance covered by the route, measured in meters. This feature provides the length of the route and helps calculate average speeds and other metrics.
trunk: A boolean feature indicating whether the route includes trunk roads. Trunk roads are major roads connecting cities and regions, typically designed for higher speed and heavy traffic.
Now that we describe most of the features, lets check how they correlate to our target.
Experimental Results: Method Comparison and Insights
Our task is to predict the trip duration. For this task, we will use the features from the previous section and for the neural networks we will also add some extra features that area geospatial embeddings. For the evaluation, we perform a 10-fold cross validation evaluation and report on the mean performance below.
For the comparison, we employ the following methods:
KAN: it leverages the Kolmogorov-Arnold theorem to efficiently approximate complex functions and capture intricate patterns in data. It combines theoretical insights with neural network architectures to model non-linear relationships with high accuracy. We use 4 varying size layers of KANs with RMSE loss and 15 training epochs.
NeuralNet: The neural network approach utilizes deep learning techniques to model complex patterns and relationships in the data. By employing 3 fully connected feedforward layers of varying size and 15 training epochs, it captures intricate dependencies and non-linear interactions.
XGBoost (Extreme Gradient Boosting): XGBoost is a powerful and efficient gradient boosting algorithm known for its high performance in classification and regression tasks. It builds an ensemble of 100 decision trees sequentially.
OSM Route Engine: The OSM Route Engine uses data from OpenStreetMap to provide detailed route information, including road types, traffic signals, and intersections. This method helps in understanding and predicting travel times by analyzing route-specific features and road conditions, offering insights into the impact of route characteristics on travel duration.
The above figure shows the performance in seconds. The KAN and the neural network seem to be the winners and very close to each other (around 2 minute error on average) while XGBoost follows with 3 minutes error. Last is OSM route engine which is not an ML based approach with 8 minutes error.
Conclusion
The role of artificial intelligence (AI) in taxi Estimated Time of Arrival (ETA) prediction has proven to be transformative and highly effective. By leveraging advanced machine learning techniques, AI models can analyze vast amounts of historical and real-time data to make accurate predictions about travel times. This capability significantly outperforms traditional methods, providing more precise and reliable ETAs for both drivers and passengers. The integration of AI allows for the consideration of complex variables, such as traffic patterns, weather conditions, and historical trends, which traditional approaches often struggle to incorporate effectively.
AI's ability to continually learn and adapt from new data further enhances its accuracy and robustness in ETA predictions. As the technology evolves, it promises even greater improvements in prediction quality, efficiency, and user satisfaction. Embracing AI in the taxi industry not only optimizes operational efficiency but also elevates the overall passenger experience, demonstrating AI's crucial role in modern transportation solutions. Stay tuned for the exciting journey ahead as deep learning continues to revolutionize the way we navigate urban landscapes!
# Stay tuned for more insights and innovations from the world of AI and transportation!
** If you are interested in other ML use-cases, please contact me using the form (and also include a publicly available dataset for this case, I'm always curious to explore new problems).
Comments