Hub
Pricing About
WorkflowWorkflow

Cleaning the NYC taxi dataset on Spark

Big dataExplorationVisualizationInteractiveWebportal
+2
knime profile image
Draft Latest edits on 
Sep 5, 2018 10:56 AM
Drag & drop
Like
Download workflow
Workflow preview
This workflow handles the preprocessing of the NYC taxi dataset (loading, cleaning, filtering, etc). The NYC taxi dataset contains over 1 billion taxi trips in New York City between January 2009 and December 2017 and is provided by the NYC Taxi and Limousine Commision (TLC)[1]. It contains not only information about the regular yellow cabs, but also green taxis, which started in August 2013, and For-Hire Vehicle (e.g Uber) starting from January 2015. In the data, each taxi trip is recorded with information such as the pickup and dropoff locations, datetime, number of passengers, trip distance, fare amount, tip amount, etc. Since the dataset was first published, the TLC has made several changes to it, e.g renaming, adding, removing some columns. Therefore, we need to do some preprocessing steps before loading the data into the database. The goal of this workflow is to get the dataset from [1], then load them onto Spark for preprocessing. The preprocessing includes unifying the columns (names, values, datatypes), reverse geocoding (assigning GPS coordinates or location IDs to their corresponding taxi zones), and filtering negative values that don't make sense. At the end, the cleaned data are stored on an Amazon S3 bucket in Parquet format, ready for further analysing. [1] http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml

External resources

  • Interactive Big Data Exploration and Visualization
Loading deploymentsLoading ad hoc jobs

Used extensions & nodes

Created with KNIME Analytics Platform version 4.0.0
  • Go to item
    KNIME Amazon Cloud ConnectorsTrusted extension

    KNIME AG, Zurich, Switzerland

    Version 3.7.0

    knime profile image
    knime
  • Go to item
    KNIME CoreTrusted extension

    KNIME AG, Zurich, Switzerland

    Version 4.0.0

    knime profile image
    knime
  • Go to item
    KNIME Extension for Apache SparkTrusted extension

    KNIME AG, Zurich, Switzerland

    Version 2.4.0

    knime profile image
    knime
  • Go to item
    KNIME Shapefile Support

    Federal Institute for Risk Assessment (BfR)

    Version 1.5.0

    cthoens
  • Go to item
    KNIME XML-ProcessingTrusted extension

    KNIME AG, Zurich, Switzerland

    Version 4.0.0

    knime profile image
    knime
  • Go to item
    Palladian for KNIMEUnknown extension

    This is an unpublished or unknown extension.

    palladian.ws; Philipp Katz, Klemens Muthmann, David Urbansky.

    Version 1.7.0

Legal

By using or downloading the workflow, you agree to our terms and conditions.

KNIME
Open for Innovation

KNIME AG
Talacker 50
8001 Zurich, Switzerland
  • Software
  • Getting started
  • Documentation
  • Courses + Certification
  • Solutions
  • KNIME Hub
  • KNIME Forum
  • Blog
  • Events
  • Partner
  • Developers
  • KNIME Home
  • Careers
  • Contact us
Download KNIME Analytics Platform Read more about KNIME Business Hub
© 2025 KNIME AG. All rights reserved.
  • Trademarks
  • Imprint
  • Privacy
  • Terms & Conditions
  • Data Processing Agreement
  • Credits