The University of Saskatchewan
Ph.D. in Interdisciplinary Studies
Created by: Carlos Enrique Diaz, MBM, P.Eng.
Email: carlos.diaz@usask.ca
Supervisor: Lori Bradford, Ph.D.
Email: lori.bradford@usask.ca
Description:
This KNIME component generates synthetic tabular data using a copula-based statistical model, capturing both marginal distributions and multivariate dependencies present in the original dataset. It leverages the copulas Python library to model the data and produce synthetic samples that mimic the statistical structure of the input.
A Gaussian multivariate copula is used to model the joint distribution of all numeric columns, while a user-selected univariate distribution is applied to each individual feature. Additionally, one string-type column is preserved and used as an identifier in the synthetic output (e.g., "Synthetic_1", "Synthetic_2", etc.).
Configuration Options:
Sample Size (Sample size):
Number of synthetic rows to generate.Distribution (Distribution):
Type of univariate distribution to apply to each numeric feature:BetaUnivariate
GammaUnivariate
GaussianKDE
GaussianUnivariate
TruncatedGaussian
Key Feature: Real-Value Substitution:
After generating synthetic numeric values using the copula model, the component performs a post-processing step to increase realism and interpretability. Each synthetic numeric value is replaced by the closest real value found in the corresponding column of the original dataset.
This ensures that all values in the synthetic data fall within the range of observed values and reflect realistic, domain-valid entries. The synthetic data thus maintains the original format and semantics, while still being statistically independent from real records.
How it works:
The input KNIME table is read and separated into numeric and string columns.
A Gaussian multivariate copula is fitted to the numeric portion using the selected univariate distribution.
Synthetic samples are drawn from the fitted copula model.
Each synthetic value is replaced by the closest real value from the original column.
The final synthetic dataset is output in the same format as the original.
Use Cases:
Data anonymization and privacy preservation
Machine learning model testing
Prototyping with realistic mock data
Exploring data distributions without exposing sensitive records
Requirements:
Python environment with the copulas library installed