Challenge 23: Predicting Flight Delays with Big Data
Level: Medium
Description: You are a data scientist working at a regional airport authority that’s grappling with a familiar problem: unpredictable flight delays. Every week, thousands of flight records are logged, containing valuable insights about scheduling, departures, arrivals, and delays. However, with such massive volumes of data (> 1 million rows), relying on local data processing struggles to scale.
Therefore, you are tasked with designing a predictive system to predict flight delays before they happen by building a solution powered by distributed computing. For example, by creating a local big data environment to handle the heavy data lifting. With the environment in place, load the Parquet dataset of historical flight status records into the Spark context and perform the data cleaning operations that you deem relevant (e.g, column filtering, missing value handling, dimensionality reduction, etc.) within the Spark context. With the cleaned and transformed data, train and apply two classification models in parallel—a Decision Tree and a Random Forest—to predict delays. Score both models visually and using scoring metrics. Make sure that also model training and the computation of scoring metrics are performed within the Spark context. Finally, save the best model for future use. Can you help the regional airport authority predict flight delays with an accuracy above 80%?
Beginner-friendly objectives: 1. Set up a local big data environment and load the flight status data from the Parquet file into the Spark context. 2. Clean and pre-process the data directly within the Spark context, ensuring each step enhances data quality to support effective classifier training.
Intermediate-friendly objectives: 1. Within the Spark context, train in parallel a Decision Tree model and a Random Forest model to predict flight delays, ensuring reproducibility and model robustness. 2. Apply the models and evaluate them using scoring metrics. 3. Evaluate the models also visually, balancing execution efficiency and informativeness. 4. Save the best model for future use in the local big data environment.
Solution Summary: To solve this challenge, we set up a local big data environment with a Spark context to process flight status data from a large Parquet file (> 1 million rows). We perform all data operations, model training and scoring within the Spark context, using the Spark nodes. The data undergoes cleaning and pre-processing, including filtering columns and handling missing values. Principal Component Analysis (PCA) is applied to reduce dimensionality while retaining essential information. Next, we train in parallel two machine learning models, a Decision Tree and a Random Forest, to predict flight delays. Finally, we evaluate the model using accuracy metrics and ROC curves, and save the best model for future use in the local big data environment.
Solution Details: We begin our solution by using the Create Local Big Data Environment node to set up a local Spark context, configured for processing a large dataset stored in the current workflow data area. With the environment ready, we use the Parquet to Spark node to load flight status data from the Parquet file. We use the Spark Column Filter node to select relevant columns for analysis and exclude those that are not needed. Next, we partition the data using stratified sampling on the target column (train: 80%, test: 20%) with the Spark Partitioning node. We remove rows containing missing values with the Spark Missing Value node. To reduce dimensionality while preserving key information, we apply Principal Component Analysis (PCA) using the Spark PCA node, extracting principal components that retain 95% of the data’s variance. We replicate the same processing steps on the test set using the Spark (Apply) nodes, if available. At this point, the workflow splits into two branches for model training. We train a Decision Tree model with the Spark Decision Tree Learner node and, simultaneously, a Random Forest model using the Spark Random Forest Learner node. We use Spark Predictor (Classification) nodes to apply both models, storing predictions in columns with meaningful names. We then evaluate their performance with the Spark Scorer nodes, achieving around 70% accuracy. For the visual evaluation, we sample 2,500 rows from both prediction datasets using the Spark Row Sampling node with stratified sampling. We transfer the sampled data into KNIME tables via the Spark to Table node and visualize model performance through the ROC Curve node. Finally, we save the best-performing model to the file system of the local big data environment adding an additional input port in the Model Writer node.