- Type: TableData to Train and Test ModelsA KNIME Table with data rows with input features and ground truth.
This Component automatically trains supervised machine learning models for both binary and multiclass classification. The component is able to automate the whole ML cycle by performing some data preparation, parameter optimization with cross validation, scoring, evaluation and selection. The component also captures the entire end-to-end process and outputs the deployment workflow using the KNIME Integrated Deployment Extension.
For solving an ML regression task, check instead the “AutoML (Regression)” component (kni.me/c/5kzQcySUa8oukv0Y).
STEP-BY-STEP GUIDE:
- Drag&drop the Component from KNIME Hub to KNIME Analytics Platform.
- Connect with your data table of features and target column. Consider using a subsample first.
- IMPORTANT! Execute all up-stream nodes.
- Double click Component to open its dialogue.
- Save your settings with “OK” and execute the Component.
- Wait for models to train, tune, validate, etc. and the best one to be selected and exported.
- Connect the Workflow Executor/Writer node to the Component output to reuse the model.
- (OPTIONAL) Right click Component : “Component” > “Open” to inspect our implementation for you to customize.
- (IF PREVIOUSLY ENABLED) Right click Component : “Open Interactive View: AutoML” to inspect all trained models. Selecting one manually (with “Apply&Close” in local View bottom right corner controls) unfortunately requires training all models again.
DATA PREPARATION:
Before training the models the data is cleaned by replacing the missing values with the categorical column most frequent value or the mean for the numerical columns. Optionally the categorical data can be one-hot encoded and columns with too many unique values are removed based on a user-defined parameter. Numerical features are all converted to double, normalized using Z-score normalization. The data is automatically split into the two train and test partitions using stratified sampling technique on the target class and 80% split. The data preparation models are stored for deployment both for pre-processing and post-processing the data around the model predictor.
MODEL TRAINING:
Each model has a number of parameters to be tuned using cross validation and the user-defined evaluation metric on train data. The extent of the parameter optimization, the optimization strategy as well as other settings of the model can be changed directly in the Component.
- Naive Bayes: trained with optimized parameter “Default probability”.
- Logistic Regression: trained with optimized parameter “Step size”.
- Neural Network: an Rprop Multi-layer Perceptron (MLP) trained with optimized parameters “Number of hidden layers” and “Number of hidden neurons per layer”.
- Gradient Boosted Trees: trained with optimized parameter “Number of trees”.
- Decision Tree: trained with optimized parameter “Min number records per node”.
- Random Forest: trained with optimized parameters “Tree Depth”, “Number of models” and “Minimum child node size”.
- XGBoost Trees: trained with optimized parameters “eta” and “max depth”.
- Generalized Linear Model (H2O): trained with the KNIME H2O Machine Learning Integration with optimized parameters “lambda” and “alpha”.
- Deep Learning (Keras): trained with KNIME Deep Learning - Keras Integration with no parameter optimization and two simple architectures for binary and multiclass classification determined by a few simple heuristics
- H2O AutoML: trained with the KNIME H2O Machine Learning Integration and uses the H2O AutoML to train a group of models and select the best one
MODEL SCORING AND SELECTION:
After the training of the specified models is completed and all models are stored in a single table, the system applies the model to the test set. The predictions of all models are scored against the ground truth and several performance metrics are computed. The best model is selected using the performance metric specified by the user.
DEPLOYMENT WORKFLOW:
The data pre-processing, the best model and the data post-processing are captured via the KNIME Integrated Deployment Extension. The end-to-end encapsulated workflow is provided at the output of the Component and it can be used to score raw new data in deployment. Connect to the Workflow Writer node or Workflow Executor node to reuse the trained model wherever needed.
AUTOML OUTPUT METADATA:
The Component additionally outputs flow variables for advanced users.
- "metric_auto" (String) : the name of the user-defined performance metric.
- "target_column" (String) : the name of the user-defined target column.
- "positive" (String) : the positive class used in binary classification.
- "exported_model" (String) : the best model that was selected.
- "exported_model_params” (String Array) : list of the optimized parameters names and values for the exported model.
- "trained_models" (String Array) : list of all the selected models that were successfully trained and ranked by "metric_auto" metric.
- "trained_metrics" (Double Array) : list of the "metric_auto" metrics for all “trained_models”.
- "failed_models" (String Array) : list of all selected models failed during training or testing.
- "static_prediction_models" (String Array) : models always predicting the majority class are discarded and listed here.
Component details
Input ports
Output ports
- Type: Workflow Port ObjectTrained ModelThe best trained model stored in a Workflow Object port of KNIME Integrated Deployment Extension. Connect this output port to either the Workflow Writer or Workflow Executor node.
Used extensions & nodes
Created with KNIME Analytics Platform version 4.5.1
- Go to item
- Go to item
- Go to item
- Go to item
- Go to item
- Go to item
This component does not have nodes, extensions, nested components and related workflows
Legal
By using or downloading the component, you agree to our terms and conditions.