This Component automatically trains supervised machine learning models for regression. The component is able to automate the whole ML cycle by performing some data preparation, parameter optimization with cross validation, scoring, evaluation and selection. The component also captures the entire end-to-end process and outputs the deployment workflow using the KNIME Integrated Deployment Extension.
For solving an ML classification task, check instead the “AutoML” component (kni.me/c/33fQGaQzuZByy6hE).
STEP-BY-STEP GUIDE:
- Drag&drop the Component from KNIME Hub to KNIME Analytics Platform.
- Connect with your data table of features and target column. Consider using a subsample first.
- IMPORTANT! Execute all up-stream nodes.
- Double click Component to open its dialogue.
- Save your settings with “OK” and execute the Component.
- Wait for models to train, tune, validate, etc. and the best one to be selected and exported.
- Connect the Workflow Executor/Writer node to the Component output to reuse the model.
- (OPTIONAL) Right click Component : “Component” > “Open” to inspect our implementation for you to customize.
- (IF PREVIOUSLY ENABLED) Right click Component : “Open Interactive View: AutoML” to inspect all trained models. Selecting one manually (with “Apply&Close” in local View bottom right corner controls) unfortunately requires training all models again.
DATA PREPARATION:
Before training the models the data is cleaned by replacing the missing values with the categorical column most frequent value or the mean for the numerical columns. Optionally the categorical data can be one-hot encoded and columns with too many unique values are removed based on a user-defined parameter. Numerical features and the target are all converted to double, normalized using Z-score normalization. The data is automatically split into the two train and test partitions using stratified sampling technique on the target class and 80% split. The data preparation models are stored for deployment both for pre-processing and post-processing the data around the model predictor.
MODEL TRAINING:
Each model has a number of parameters to be tuned using cross validation and the user-defined evaluation metric on train data. The extent of the parameter optimization, the optimization strategy as well as other settings of the model can be changed directly in the Component.
- Regression Tree: trained with optimized parameter “Min number records per node”
- Linear Regression: trained with default parameters
- Polynomial Regression: trained with optimized parameter “Polynomial degree”
- H2O Generalized Linear Model: trained with the KNIME H2O Machine Learning Integration trained with optimized parameters “alpha” and “lambda”
- XGBoost Linear Ensemble: trained with optimized parameters “alpha” and “lambda”
- XGBoost Tree Ensemble: trained with optimized parameters “eta” and “max depth”
- Gradient Boosted Trees: trained with optimized parameter “Number of trees”
- Random Forest: trained with optimized parameters “Tree Depth”, “Number of models” and “Minimum child node size”
- Deep Learning (Keras): trained with KNIME Deep Learning - Keras Integration with no parameter optimization and a simple architecture for regression determined with a few simple heuristics.
- H2O AutoML: trained with the KNIME H2O Machine Learning Integration and uses the H2O AutoML to train a group of models and select the best one
MODEL SCORING AND SELECTION:
After the training of the specified models is completed and all models are stored in a single table, the system applies the model to the test set. The predictions of all models are scored against the ground truth and several performance metrics are computed. The best model is selected using the performance metric specified by the user.
DEPLOYMENT WORKFLOW:
The data pre-processing, the best model and the data post-processing are captured via the KNIME Integrated Deployment Extension. The end-to-end encapsulated workflow is provided at the output of the Component and it can be used to score raw new data in deployment. Connect to the Workflow Writer node or the Workflow Executor node to reuse the trained model wherever needed.
AUTOML OUTPUT METADATA:
The Component additionally outputs flow variables for advanced users.
- "metric_auto" (String) : the name of the user-defined performance metric.
- "target_column" (String) : the name of the user-defined target column.
- "exported_model" (String) : the best model that was selected.
- "exported_model_params” (String Array) : list of the optimized parameters names and values for the exported model.
- "trained_models" (String Array) : list of all the selected models that were successfully trained and ranked by "metric_auto" metric.
- "trained_metrics" (Double Array) : list of the "metric_auto" metrics for all “trained_models”.
- "failed_models" (String Array) : list of all selected models failed during training or testing.
- "extreme_preds_models" (String Array) : models that had at least one prediction out of range are additionally listed here when the “Remove Extreme Predictions” setting is on.
- Type: TableData to Train and Test ModelsA KNIME Table with data rows with input features and ground truth.