Prediction within Spreadsheets
This workflow shows how to load and prepare data for training a predictive model using some excel files.
All data are stored already within the workflow data area, and all three Excel Reader nodes point directly to this container.
During the first step, the Value Lookup nodes are used to join features coming from different data sources.
This node requires a key column to be specified that must be available in both data sources. In the first Value Lookup node, the key is the "athlete_id", while in the latter it is the "edition_id".
In that way, columns are added to the first connected table to enrich the information available.
The first Value Lookup adds some information about the athlete, like the date of birth and gender.
The second Value Lookup adds a couple of information about the Olympic game edition, like the hosting country and the date of start.
After a Value Lookup operation, it is very likely that some missing values for non matching rows are created. It is important in our case to exclude these rows as this lack of information could create issues during the model training.
The following steps include a feature processing aimed at creating some new features and binning the target variable, that is the kind of medal the athlete won during the Olympic Games.
The rule engine node creates a different bin for the target variable using the following rule:
$medal$="na"=>"No Medal"
$medal$="Bronze" OR $medal$ ="Silver"=>"Bronze/Silver"
TRUE=>$medal$
Hence, we aggregate into a single category the Bronze medal and the Silver medal.
Afterward, into the "Create two additional features" metanode we make use of the Date&Time Difference node to compute the athlete's age and the rule engine to create a binary variable representing if the athlete is competing in an edition hosted by his country.
Within the "Prepare data for model training" metanode, we made use of the table cropper node to clean up the table for the columns that were not going to be used for the model training. To use this node we first resort to the columns in the table to put all not necessary columns at the last and first positions of the table using the Column Resorter node.
Afterward, we decide to "crop" the table keeping only the columns we are going to use to train the model.
A Domain Calculator node is finally used to update all variables domains, a necessary operation to allow the Gradient Boosting algorithm to do its job.
The partitioning node is used to create the training set and the test set. The first is the table used to feed the learner node, and the latter is another table where we test the model performance on a table that the model has never seen, to get some real performance metrics.
Finally, a Gradient Boosting model has been trained to predict whether the athlete will win a model for a specific competition.
The learner node trains the model while the predictor node appends the prediction.
Additionally to the class prediction, we ask the predictor node to append the individual class probabilities that represent the estimated probability of winning or not a given medal.
The model has been evaluated with a ROC Curve and a confusion matrix.
The ROC curve gives an objective model evaluation not affected by the subjective choice of choosing a cut-off value to the probabilities to assign the prediction, while the scorer node provides a static performance evaluation based on the actual predicted class.
Workflow
Prediction within Spreadsheets
Used extensions & nodes
Created with KNIME Analytics Platform version 5.2.1
Legal
By using or downloading the workflow, you agree to our terms and conditions.