Synthetic Data (Copulas)

The University of Saskatchewan

Ph.D. in Interdisciplinary Studies

Created by: Carlos Enrique Diaz, MBM, B.Eng.

Email: carlos.diaz@usask.ca

Supervisor: Lori Bradford, Ph.D.

Email: lori.bradford@usask.ca

Description:

This KNIME component generates synthetic tabular data using copula-based multivariate models, preserving both marginal distributions and inter-variable dependencies with the help of the Python copulas library.

In the Open View (F10), the component displays Spearman and Pearson correlograms for both the original and synthetic datasets, colour-coded from red (-1) to blue (1) for quick visual comparison.

Configuration Options:

Multivariate Distribution:

Choose between two copula-based modelling approaches:

Gaussian Copula
Vine Copula

Univariate Distribution (Only for Gaussian Copula):

Select the marginal distribution for each numeric column:

GaussianUnivariate (Default)
BetaUnivariate
GammaUnivariate
GaussianKDE
TruncatedGaussian

Vine Type (Only for Vine Copula):

Choose the vine structure:

Center
Regular
Direct

Synthetic Sample Size:

Number of synthetic rows to generate.

Deactivate Correlogram View for Faster Running:

Disables the interactive view to speed up large-scale or automated executions. Recommended to enable only during initial visual validation.

Numeric Columns:

Select the numeric features to model and synthesize.

Key Feature – Real-Value Substitution:

To enhance realism, each synthetic numeric value is replaced by the closest real value found in the original dataset.

This post-processing step ensures all values stay within domain-valid ranges.
The resulting table with real-value substitution is available in Port 1.
The raw synthetic data (possibly outside the original range) is available in Port 2.

Use Cases:

Data anonymization and privacy preservation
Machine learning pipeline testing
Prototyping with realistic mock data
Secure exploration of sensitive datasets

Requirements:

Python environment with the copulas library installed
R environment with the corrplot library installed

Component details

Input ports

Output ports

External resources

KNIME Base nodes

KNIME Expressions

KNIME Interactive R Statistics Integration

KNIME Python Integration

KNIME Quick Forms

KNIME Views

Legal