This component is able to compute Global Feature Importance for classification models with up to 4 different techniques.
The component additionally offers an optional interactive view to explore the results (Right Click > Open Interactive View).
The model to be explained needs to be captured within a Workflow Object via Integrated Deployment.
The data provided should contain instances the model is able to process to compute predictions. It would be best to provide a sample similar to a test or validation set: representative of the entire distribution and never used during training.
Please notice that it is not recommended to use a surrogate model to explain either a GLM or Logistic Regression, a Decision Tree or a Random Forest, but it is still possible.
Available Global Feature Importance methods/techniques:
A) GLOBAL SURROGATE MODELS:
Surrogate models are simply interpretable models that are trained to mimic the behaviour of the original model by overfitting its predictions. The intuition is that if the surrogate and interpretable model is able to make the exact same predictions of the original model, then it can be used to understand how the input features are connected to those predictions. The quality of the surrogate models is estimated with the user-defined performance metric.
Before training the surrogate models:
- the data rows are cleaned by replacing the missing values with the categorical column most frequent value or the mean for the numerical columns;
- optionally the categorical columns with too many unique values can be removed based on a user-defined parameter;
- numerical features are converted to double and normalized using min-max normalization.
Three interpretable models are available:
A1) Surrogate Generalized Linear Model (GLM):
GLM is trained with the KNIME H2O Machine Learning Integration with optimized parameters “lambda” and “alpha”. The family (model type) is either binomial or multinomial for binary or multinomial classification, respectively. GLM coefficient measures feature importance. If there are categorical features, surrogate GLM is not trained due to decreasing interpretability.
A2) Surrogate Decision Tree Model:
Decision Tree is trained with optimized parameter “Min number records per node”. The Decision Tree structure indicates the importance of the top-level level features since they separate the data into classes in the best way.
A3) Surrogate Random Forest Model:
Random Forest is trained with optimized parameters “Tree Depth”, “Number of models” and “Minimum child node size”. Feature importance is calculated by counting how many times it has been selected for a split and at which rank (level) among all available features (candidates) in the trees of the random forest.
B) PERMUTATION FEATURE IMPORTANCE:
Permutation feature importance measures the difference between the model performance score estimated on predictions using all the original features and the model performance score estimated on predictions using all the original features except one which was randomly permuted. If a feature was permuted several times, the average difference is calculated. The process is repeated for each feature. The score difference standard deviation from permutations is provided as an additional output.
More information at:
Molnar, Christoph. "Interpretable machine learning", 2019.
christophm.github.io/interpretable-ml-book
- Type: Workflow Port ObjectInput ModelProduction Workflow containing input model, stored as a Workflow Object via Integrated Deployment nodes
- Type: TableData from Test Set PartitionData from Test Set Partition with available Target (Ground Truth) column