Join and Clean Spreadsheets
In this workflow, we see how to join two datasets and how to clean up a table using KNIME Analytics Platform.
Sometimes, more than a VLOOKUP is needed. You may need to compare two datasets by joining them on a shared column key and see which rows they have in common and what is missing in one or the other table.
In this example, we have access to two datasets:
The athlete event results of the Summer Olympic Games (1896-2020)
The bio information of the athletes who participated in any of the Summer or Winter Olympics
We want to remove all athletes from the athletes' bio information sheet who have never participated in any Summer Olympic Games.
💡 To view each node's configuration, select the node and see the configuration pane on the right side of the workflow editor.
Let's walk through the different nodes involved in this operation:
Excel Reader nodes:
Since the folder with the data is already included when you download the workflow, in the "File and Sheet" tab, we choose to "Read from" the "Current workflow data area" and select the dataset.
In the "Data Area" tab, we select to read the "Whole sheet". This configuration allows us to have the sheet as it is. The intent is to respect its original structure.
Joiner node:
We use the Joiner node to combine both data tables on the "athlete_id" column. The athlete event results data (top input port) is the left table; the athletes' bio information sheet (bottom input port) is the right table. Keep in mind that one athlete could have participated in more than one event or Olympic Games edition.
In the "Include in Output" section, we want to include everything (matching rows, left unmatched rows, and right unmatched rows).
In the "Output Columns" section, we select which columns to include in the joined output table.
Lastly, we select to "split join result into multiple tables". By enabling this option, all three output ports of the Joiner nodes will be active:
Top port: The join result (athletes' bio information combined with their event results).
Middle port: Left unmatched rows. Information from the athlete event results data of which no bio information is available.
Bottom port: Right unmatched rows. Information from the athletes' bio information sheet of which no event data is available (i.e., those are the athletes that have never participated in any Summer Olympics Games).
Column Filter node:
We continue working with the right unmatched rows, i.e., the athletes that have never participated in any Summer Olympics Games.
We filter out the columns to keep only the "athlete_id" column. We will use it in the next step as a reference column in the Reference Row Filter node.
Reference Row Filter node:
The Reference Row Filter node has two input ports:
Top port: The data to be filtered (here: the athletes' bio information sheet).
Bottom port: The reference data (here: the list of athletes that have never participated in any Summer Olympics Games).
We use "athlete_id" as data and reference column and we exclude the matches. That way, we exclude all athletes from the bio information sheet that have never participated in any Summer Olympics Games.
Excel Writer node:
We create a new file called "Olympic_Athlete_Bio_cleaned" and save it in the workflow data area.
After executing the node, the file will open automatically.