Incremental Data Processing with Parquet

Workflow

Incremental Data Processing with Parquet

Draft Latest edits on

In this workflow, we will use the NYC taxi dataset to show case a continous preprocessing and publishing of event data. Instead of the Group Loop Start node this workflow could executed once per week in order to preprocess and publish all data that has arrived during the week. The result is written as a separate Parquet file within the same folder for each run. To ensure the uniquness of the file for each run we use the year and week of each run as file prefix that is set via flow variable. Since the folder stays the same and Parquet is reading all files within the same folder independent of their file name, this folder can be exposed as external table (e.g. in Hive or Impala) to power further analysis processes.

External resources

KNIME File Handling Guide

Loading deploymentsLoading manual runs

Legal

By using or downloading the workflow, you agree to our terms and conditions.