Learns an ensemble of decision trees (such as random forest* variants). Each of the decision tree models is learned on a different set of rows (records) and/or a different set of columns (describing attributes), whereby the latter can also be a bit/byte-vector descriptor (e.g. molecular fingerprint). The output model describes an ensemble of decision tree models and is applied in the corresponding predictor node using a simply majority vote.
The following configuration settings learn a model that is similar to the random forest ™ classifier described by Leo Breiman and Adele Cutler:
- Tree Options - Split Criterion: Gini Index
- Tree Options - Limit number of levels (tree depth): unlimited
- Tree Options - Minimum node size: unlimited
- Ensemble Configuration - Number of models: Arbitrary (random forest arguably does not overfit)
- Ensemble Configuration - Data Sampling: Use all rows (fraction = 1) but choose sampling with replacement (bootstrapping)
- Ensemble Configuration - Attribute Sampling: Sample using a different set of attributes for each tree node split; usually square root of number of attributes but can vary
The decision tree construction takes place in main memory (all data and all models are kept in memory).
(*) RANDOM FORESTS is a registered trademark of Minitab, LLC and is used with Minitab’s permission.