The University of Saskatchewan
Ph.D. in Interdisciplinary Studies
Created by: Carlos Enrique Diaz, MBM, P.Eng.
Email: carlos.diaz@usask.ca
Supervisor: Lori Bradford, Ph.D.
Email: lori.bradford@usask.ca
Description:
This workflow demonstrates how to assess the quality of synthetic data generated using the Synthetic Data (Copulas) component in KNIME. It uses the well-known Iris dataset as a reference.
Section 1: Original Data Analysis with 150 Observations
Loads and preprocesses the Iris dataset (150 rows).
Uses Linear Correlation and Statistics nodes to explore the original data’s structure and relationships.
Section 2: Mixed Data with 650 Observations
Generates 500 synthetic rows using the Synthetic Data (Copulas) component.
Merges the synthetic data with the original data (total: 650 rows).
Applies the same analysis nodes to compare the combined dataset with the original.
Section 3: Pure Synthetic Data with 500 Observations
Filters to keep only the 500 synthetic rows.
Runs correlation and statistical analysis again to evaluate the synthetic data on its own.
This workflow is a simple and effective way to visualize and compare the statistical quality of synthetic data using built-in KNIME nodes.