Merge Files - Unstructured Data
This workflow merges different unstructured data sources to create a new table.
Data often comes from different sources, and datasets usually have different table structures and are not always structured identically.
In this example, we access data from three different Excel files. The goal is to combine the information contained in the three datasets, however, each file has a different, unstructured format that might be an obstacle when analyzing the data:
Olympic_Athlete_Bio.xlsx: Contains an American athlete whose nationality is missing.
Olympic_Athlete_Event_Results.xlsx: Contains a header on top that needs to be excluded from the table.
Olympic_Games.xlsx: Contains empty columns and rows.
💡 To view each node's configuration, select the node and see the configuration pane on the right side of the workflow editor.
Let's walk through the different nodes involved in this operation:
Excel Reader nodes:
Since the folder with the data is already included when you download the workflow, in the "File and Sheet" tab, we choose to "Read from" the "Current workflow data area" and select the dataset.
Variable Creator + Cell Updater nodes:
To fix the missing value in the Olympic_Athlete_Bio.xlsx file:
Variable Creator node: Create a flow variable with the value to replace the missing value with.
Cell Updater node: Specify the cell which should be replaced by the flow variable (Row 10 in column "country_noc").
Table Cropper nodes:
To exclude the empty header rows of the Olympic_Athlete_Event_Results.xlsx file; and
to exclude the empty columns and rows of the Olympic_Games.xlsx file.
Value Lookup nodes:
To add the gender and date of birth from the athletes' bio information sheet to the event results data; and
to add the start and end date from the Summer Olympic Games information sheet to the event results data.
Missing Value node:
To exclude rows that have missing values in the columns "sex", "born", "start_date", or "end_date".
In fact, after a Value Lookup operation it is very likely that some missing rows might have originated when a correspondent key in the lookup table was not found.
Excel Reader node:
We create a new file called "Merged_table" and save it in the workflow data area.
After executing the node, the file will open automatically.