Hub
Pricing About
  • Software
  • Blog
  • Forum
  • Events
  • Documentation
  • About KNIME
  • KNIME Community Hub
  • knime
  • Spaces
  • Examples
  • 50_Applications
  • 49_NYC_Taxi_Visualization
  • Data_Preparation
WorkflowWorkflow

Cleaning the NYC taxi dataset on Spark

Big data Exploration Visualization Interactive Webportal
+2
KNIME profile image

Last edit:

Drag & drop
Like
Download workflow
Copy short link
Workflow preview
This workflow handles the preprocessing of the NYC taxi dataset (loading, cleaning, filtering, etc). The NYC taxi dataset contains over 1 billion taxi trips in New York City between January 2009 and December 2017 and is provided by the NYC Taxi and Limousine Commision (TLC)[1]. It contains not only information about the regular yellow cabs, but also green taxis, which started in August 2013, and For-Hire Vehicle (e.g Uber) starting from January 2015. In the data, each taxi trip is recorded with information such as the pickup and dropoff locations, datetime, number of passengers, trip distance, fare amount, tip amount, etc. Since the dataset was first published, the TLC has made several changes to it, e.g renaming, adding, removing some columns. Therefore, we need to do some preprocessing steps before loading the data into the database. The goal of this workflow is to get the dataset from [1], then load them onto Spark for preprocessing. The preprocessing includes unifying the columns (names, values, datatypes), reverse geocoding (assigning GPS coordinates or location IDs to their corresponding taxi zones), and filtering negative values that don't make sense. At the end, the cleaned data are stored on an Amazon S3 bucket in Parquet format, ready for further analysing. [1] http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml

External resources

  • Interactive Big Data Exploration and Visualization

Used extensions & nodes

Created with KNIME Analytics Platform version 4.0.0
  • Go to item
    KNIME Amazon Cloud Connectors Trusted extension

    KNIME AG, Zurich, Switzerland

    Version 3.7.0

    KNIME profile image
    knime
  • Go to item
    KNIME Core Trusted extension

    KNIME AG, Zurich, Switzerland

    Version 4.0.0

    KNIME profile image
    knime
  • Go to item
    KNIME Extension for Apache Spark Trusted extension

    KNIME AG, Zurich, Switzerland

    Version 2.4.0

    KNIME profile image
    knime
  • Go to item
    KNIME Shapefile Support

    Federal Institute for Risk Assessment (BfR)

    Version 1.5.0

    cthoens
  • Go to item
    KNIME XML-Processing Trusted extension

    KNIME AG, Zurich, Switzerland

    Version 4.0.0

    KNIME profile image
    knime
  • Go to item
    Palladian for KNIME Unknown extension

    This is an unpublished or unknown extension.

    palladian.ws; Philipp Katz, Klemens Muthmann, David Urbansky.

    Version 1.7.0

  1. Go to item
  2. Go to item
  3. Go to item
  4. Go to item
  5. Go to item
  6. Go to item
Loading deployments
Loading ad hoc executions

Legal

By using or downloading the workflow, you agree to our terms and conditions.

Discussion
Discussions are currently not available, please try again later.

KNIME
Open for Innovation

KNIME AG
Talacker 50
8001 Zurich, Switzerland
  • Software
  • Getting started
  • Documentation
  • E-Learning course
  • Solutions
  • KNIME Hub
  • KNIME Forum
  • Blog
  • Events
  • Partner
  • Developers
  • KNIME Home
  • KNIME Open Source Story
  • Careers
  • Contact us
Download KNIME Analytics Platform Read more on KNIME Business Hub
© 2023 KNIME AG. All rights reserved.
  • Trademarks
  • Imprint
  • Privacy
  • Terms & Conditions
  • Credits