The University of Saskatchewan
Ph.D. in Interdisciplinary Studies
Created by: Carlos Enrique Diaz, MBM, B.Eng.
Email: carlos.diaz@usask.ca
Supervisor: Lori Bradford, Ph.D.
Email: lori.bradford@usask.ca
Description:
This KNIME component generates synthetic tabular data using copula-based multivariate models, preserving both marginal distributions and inter-variable dependencies with the help of the Python copulas library.
In the Open View (F10), the component displays Spearman and Pearson correlograms for both the original and synthetic datasets, colour-coded from red (-1) to blue (1) for quick visual comparison.
Configuration Options:
Multivariate Distribution:
Choose between two copula-based modelling approaches:
Gaussian Copula
Vine Copula
Univariate Distribution (Only for Gaussian Copula):
Select the marginal distribution for each numeric column:
GaussianUnivariate (Default)
BetaUnivariate
GammaUnivariate
GaussianKDE
TruncatedGaussian
Vine Type (Only for Vine Copula):
Choose the vine structure:
Center
Regular
Direct
Synthetic Sample Size:
Number of synthetic rows to generate.
Deactivate Correlogram View for Faster Running:
Disables the interactive view to speed up large-scale or automated executions. Recommended to enable only during initial visual validation.
Numeric Columns:
Select the numeric features to model and synthesize.
Key Feature – Real-Value Substitution:
To enhance realism, each synthetic numeric value is replaced by the closest real value found in the original dataset.
This post-processing step ensures all values stay within domain-valid ranges.
The resulting table with real-value substitution is available in Port 1.
The raw synthetic data (possibly outside the original range) is available in Port 2.
Use Cases:
Data anonymization and privacy preservation
Machine learning pipeline testing
Prototyping with realistic mock data
Secure exploration of sensitive datasets
Requirements:
Python environment with the copulas library installed
R environment with the corrplot library installed