Splitting a table into two outputs with a set number or percentage of records is a common process, especially when preparing data for predictive modeling. A common action is to split a table into two tables at random with 70% and 30% of the original records. These are frequently referred to as 'training' and 'testing' sets. This action is most easily accomplished using the Partitioning node.
There are two techniques to determine how many records flow into each output:
- Absolute: You choose a specific number of records
- Relative: You choose a specific percentage of records
Once you determine how many records to pass through each output port, there are four methods by which records can be chosen:
- Take from top: The specified number or percentage of records will come from the first record on down.
- Linear sampling: Includes the first and last rows and then samples every N records based on the selection above (absolute/relative).
- Draw randomly: Based on a random number generator (or the specific seed set below), records are chosen at random. Pick a specific random seed to ensure reproducibility.
- Stratified: Select a column and the output will approximately match the distribution of values in the selected column.
Workflow
Partitioning
Used extensions & nodes
Created with KNIME Analytics Platform version 4.7.2
- Go to item
- Go to item
Legal
By using or downloading the workflow, you agree to our terms and conditions.