Prediction within Spreadsheets
This workflow shows how to load and prepare data for training a predictive model using data from some Excel files.
In this example, we access various data about Olympics Games results and information about the athletes. We use this information to train a Gradient Boosted Trees model to predict, whether an athlete will win "No medal", "Bronze/Silver", or "Gold" medal.
💡 To view each node's configuration, select the node and see the configuration pane on the right side of the workflow editor.
Let's walk through the different nodes involved in this operation:
Excel Reader nodes:
Since the folder with the data is already included when you download the workflow, in the "File and Sheet" tab, we choose to "Read from" the "Current workflow data area" and select the dataset.
In the "Data Area" tab, we select to read the "Whole sheet". This configuration allows us to have the sheet as it is. The intent is to respect its original structure.
Column Renamer node:
In the Summer Olympic Games information sheet, we rename the column "country_noc" to "country_edition".
Value Lookup nodes:
With the first Value Lookup node, we use "athlete_id" as lookup and key column and add the gender and date of birth of the athletes to the athlete event results.
In the second Value Lookup node, we use "edition_id" as lookup and key column and add the country and start date of each event to the athlete event results.
Missing Value node:
After a Value Lookup operation, it is very likely that some missing rows might have originated when a correspondent key in the lookup table was not found. It's important to check for missing values as the lack of information could create issues during the model training later on.
We exclude rows that have missing values in the columns "country_edition", "sex", "born", or "start_date"as not found.
Expression + Bar Chart nodes:
Expression node: We define one expression to bin the target variable ("medal") into three categories: "No medal", "Bronze/Silver", and "Gold".
Bar Chart node: We plot the distribution of the newly created target variable ("Medal - Target Variable") in a Bar Chart.
Create additional features metanode:
This step is also referred to as Feature Engineering. We are creating two additional features:
Date&Time Difference node: We calculate the difference (in years) between the athlete's date of birth and the start date of the competition ("Athlete Age").
Expression node: We implement a logic to indicate whether the competition took place in the athlete's home country ("athlete_at_home").
Prepare data for model training metanode
Column Resorter node: We resort the table so that the important columns are moved to the beginning of the table.
Column Renamer node: We rename the column "country_noc" to "country_athlete" since we have two country columns in the dataset, one for the athlete and one for the game.
Table Cropper node: We crop the table to keep only the columns that are needed in the model training (column range: "Medal - Target Variable" to "athlete_at_home").
Domain Calculator node: We update all variables domains, a necessary operation to allow the Gradient Boosted Trees algorithm to do its job.
Table Partitioner node:
To train and test the model appropriately, we need a training and testing set.
Partition type: Relative - 80%/20% split
Sampling strategy: Stratified sampling on target column
Gradient Boosted Trees Learner + Gradient Boosted Trees Predictor nodes:
Gradient Boosted Trees Learner node: We use the training set (top port of the Partitioning node) to train the model with the following configuration:
Target column: Medal - Target Variable
We include all other columns in the training (we did the filtering already)
Tree depth: 3
Number of models: 400
Learning rate: 0.1 (default value)
Gradient Boosted Trees Predictor node: We connect the test set (bottom port of the Partitioning node) and the trained model (output of Learner node) to the Predictor node. Make sure to "append individual class probabilities".
Scorer + ROC Curve nodes:
To evaluate the model's accuracy.
Scorer node: Provides a static performance evaluation based on the actual value vs. prediction.
First column: "Medal - Target Variable"
Second column: "Prediction (Medal - Target Variable)"
ROC Curve node: Gives an objective model evaluation not affected by the subjective choice of choosing a cut-off value to the probabilities to assign the prediction.
Target column: Medal - Target Variable
Positive class value: Gold
In the Include/Exclude panel only include P (Medal - Target Variable=Gold)