Hub
Pricing About
WorkflowWorkflow

JKI4_023_Predicting_Flight_Delays_with_Big_Data

KNIMEST#zassou.syk#KD勉強会JKISeason4-23
knimest profile image
Draft Latest edits on 
Oct 19, 2025 2:59 AM
Drag & drop
Like
Download workflow
Workflow preview

Challenge 23: Predicting Flight Delays with Big Data

Level: Medium

Description: You are a data scientist working at a regional airport authority that’s grappling with a familiar problem: unpredictable flight delays. Every week, thousands of flight records are logged, containing valuable insights about scheduling, departures, arrivals, and delays. However, with such massive volumes of data (> 1 million rows), relying on local data processing struggles to scale. You are then tasked with building a predictive system powered by distributed computing. For example, by creating a local big data environment to handle the heavy data lifting. With the environment in place, load the Parquet dataset of historical flight status records into the Spark context and perform the data cleaning operations that you deem relevant within the Spark context. With the cleaned and transformed data, train and apply two classification models in parallel—a Decision Tree and a Random Forest—to predict delays. Score both models visually and using scoring metrics. Make sure that model training and the computation of scoring metrics are performed within the Spark context. Finally, save the best model for future use. Can you help the regional airport authority predict flight delays with an accuracy above 80%?

Beginner-friendly objectives: 1. Set up a local big data environment and load the flight status data from the Parquet file into the Spark context. 2. Clean and pre-process the data directly within the Spark context, ensuring each step enhances data quality to support effective classifier training.

Intermediate-friendly objectives: 1. Within the Spark context, train in parallel a Decision Tree model and a Random Forest model to predict flight delays, ensuring reproducibility and model robustness. 2. Apply the models and evaluate them using scoring metrics. 3. Evaluate the models also visually, balancing execution efficiency and informativeness. 4. Save the best model for future use in the local big data environment.

External resources

  • Deciphering Air Travel Disruptions: A Machine Learning Approach
  • KNIME Extension for Apache Spark
  • 04 Model Building on Big Data - Solution
  • Building a Predictive Model on Big Data
  • Flight delay data on KNIME Community Hub
Loading deploymentsLoading ad hoc jobs

Used extensions & nodes

Created with KNIME Analytics Platform version 5.5.1
  • Go to item
    KNIME Base nodesTrusted extension

    KNIME AG, Zurich, Switzerland

    Version 5.5.1

    knime
  • Go to item
    KNIME Extension for Apache SparkTrusted extension

    KNIME AG, Zurich, Switzerland

    Version 5.5.1

    knime
  • Go to item
    KNIME Extension for Big Data File FormatsTrusted extension

    KNIME AG, Zurich, Switzerland

    Version 5.5.1

    knime
  • Go to item
    KNIME Extension for Local Big Data EnvironmentsTrusted extension

    KNIME AG, Zurich, Switzerland

    Version 5.5.1

    knime
  • Go to item
    KNIME JavaScript ViewsTrusted extension

    KNIME AG, Zurich, Switzerland

    Version 5.5.0

    knime
  • Go to item
    KNIME Statistics NodesTrusted extension

    KNIME AG, Zurich, Switzerland

    Version 5.5.0

    knime

Legal

By using or downloading the workflow, you agree to our terms and conditions.

KNIME
Open for Innovation

KNIME AG
Talacker 50
8001 Zurich, Switzerland
  • Software
  • Getting started
  • Documentation
  • Courses + Certification
  • Solutions
  • KNIME Hub
  • KNIME Forum
  • Blog
  • Events
  • Partner
  • Developers
  • KNIME Home
  • Careers
  • Contact us
Download KNIME Analytics Platform Read more about KNIME Business Hub
© 2025 KNIME AG. All rights reserved.
  • Trademarks
  • Imprint
  • Privacy
  • Terms & Conditions
  • Data Processing Agreement
  • Credits