In this workflow, we will use the NYC taxi dataset to show case a continous preprocessing and publishing of event data. Instead of the Group Loop Start node this workflow could executed once per week in order to preprocess and publish all data that has arrived during the week. The result is written as a separate Parquet file within the same folder for each run. To ensure the uniquness of the file for each run we use the year and week of each run as file prefix that is set via flow variable. Since the folder stays the same and Parquet is reading all files within the same folder independent of their file name, this folder can be exposed as external table (e.g. in Hive or Impala) to power further analysis processes.
Used extensions & nodes
Created with KNIME Analytics Platform version 4.3.0
By using or downloading the workflow, you agree to our terms and conditions.
Discussions are currently not available, please try again later.