Hub
Pricing About
ComponentComponent

Synthetic Data (Copulas)

Data augmentationCopulasSynthetic dataData generationCopula
+1
carlosenrique84 profile image
Version1.0Latest, created on 
May 14, 2025 11:39 PM
Drag & drop
Like
Use or download

The University of Saskatchewan
Ph.D. in Interdisciplinary Studies

Created by: Carlos Enrique Diaz, MBM, P.Eng.
Email: carlos.diaz@usask.ca

Supervisor: Lori Bradford, Ph.D.
Email: lori.bradford@usask.ca

Description:

This KNIME component generates synthetic tabular data using a copula-based statistical model, capturing both marginal distributions and multivariate dependencies present in the original dataset. It leverages the copulas Python library to model the data and produce synthetic samples that mimic the statistical structure of the input.

A Gaussian multivariate copula is used to model the joint distribution of all numeric columns, while a user-selected univariate distribution is applied to each individual feature. Additionally, one string-type column is preserved and used as an identifier in the synthetic output (e.g., "Synthetic_1", "Synthetic_2", etc.).

Configuration Options:

  • Sample Size (Sample size):
    Number of synthetic rows to generate.

  • Distribution (Distribution):
    Type of univariate distribution to apply to each numeric feature:

    • BetaUnivariate

    • GammaUnivariate

    • GaussianKDE

    • GaussianUnivariate

    • TruncatedGaussian

Key Feature: Real-Value Substitution:

After generating synthetic numeric values using the copula model, the component performs a post-processing step to increase realism and interpretability. Each synthetic numeric value is replaced by the closest real value found in the corresponding column of the original dataset.

This ensures that all values in the synthetic data fall within the range of observed values and reflect realistic, domain-valid entries. The synthetic data thus maintains the original format and semantics, while still being statistically independent from real records.

How it works:

  1. The input KNIME table is read and separated into numeric and string columns.

  2. A Gaussian multivariate copula is fitted to the numeric portion using the selected univariate distribution.

  3. Synthetic samples are drawn from the fitted copula model.

  4. Each synthetic value is replaced by the closest real value from the original column.

  5. The final synthetic dataset is output in the same format as the original.

Use Cases:

  • Data anonymization and privacy preservation

  • Machine learning model testing

  • Prototyping with realistic mock data

  • Exploring data distributions without exposing sensitive records

Requirements:

  • Python environment with the copulas library installed

Component details

Input ports
  1. Type: Table
    Port 1
    No description available
Output ports
  1. Type: Table
    Port 1
    No description available

External resources

  • Workflow Example

Used extensions & nodes

Created with KNIME Analytics Platform version 5.4.2
  • Go to item
    KNIME Base nodesTrusted extension

    KNIME AG, Zurich, Switzerland

    Version 5.4.1

    knime
  • Go to item
    KNIME Python IntegrationTrusted extension

    KNIME AG, Zurich, Switzerland

    Version 5.4.1

    knime
  • Go to item
    KNIME Quick FormsTrusted extension

    KNIME AG, Zurich, Switzerland

    Version 5.4.1

    knime

This component does not have nodes, extensions, nested components and related workflows

Legal

By using or downloading the component, you agree to our terms and conditions.

KNIME
Open for Innovation

KNIME AG
Talacker 50
8001 Zurich, Switzerland
  • Software
  • Getting started
  • Documentation
  • Courses + Certification
  • Solutions
  • KNIME Hub
  • KNIME Forum
  • Blog
  • Events
  • Partner
  • Developers
  • KNIME Home
  • Careers
  • Contact us
Download KNIME Analytics Platform Read more about KNIME Business Hub
© 2025 KNIME AG. All rights reserved.
  • Trademarks
  • Imprint
  • Privacy
  • Terms & Conditions
  • Data Processing Agreement
  • Credits