Hub
Pricing About
ComponentComponent

Data Reduction

carlosenrique84 profile image
Version1.0Latest, created on 
Jun 9, 2025 5:45 PM
Drag & drop
Like
Use or download

The University of Saskatchewan

Ph.D. in Interdisciplinary Studies

Created by: Carlos Enrique Diaz, MBM, B.Eng.

Email: carlos.diaz@usask.ca

Supervisor: Lori Bradford, Ph.D.

Email: lori.bradford@usask.ca

Description

The Data Reduction component reduces the size of a dataset, focusing on removing duplicate values while preserving the original distribution's shape and density. It is particularly useful for tasks such as visualization, lightweight prototyping, or downstream sampling when working with large datasets generated through bootstrapping or processes that involve duplication.

Configuration Options

  • Column selection: Select the numeric column to apply the reduction.

  • Percentage after reduction: Define the target percentage of rows to retain.

  • Significance level alfa: Used to evaluate whether the reduced distribution remains statistically similar to the original.

How It Works

  1. Histogram Binning: The component uses the Freedman-Diaconis rule to determine the optimal bin width and number of bins.

  2. Density-Based Reduction: For each bin, a target count is calculated based on bin density and the overall reduction percentage target. Duplicated values are prioritized for removal to increase variability. If no duplicates are present, the reduction is applied uniformly across the bin.

  3. Statistical Validation: The component uses the Kolmogorov-Smirnov (KS) Test, which provides:

    • A hypothesis test: whether the original and reduced samples are drawn from the same distribution.

    • A KS statistic: which serves as a divergence metric between the empirical distributions, useful alongside other measures like Jensen-Shannon divergence.

  4. Normalized Comparison: Frequency distributions are normalized to probability densities before statistical testing. This improves reliability, especially when comparing datasets of different sizes.

Output Ports

  • Port 1 – Reduced Data (Duplicate-aware): The reduced dataset produced by the component using a bin-aware strategy that prioritizes the removal of duplicate values.

  • Port 2 – KS Test Results (Duplicate-aware Output): Results of the Kolmogorov-Smirnov test comparing the original density distribution to the reduced dataset density distribution from Port 1.

  • Port 3 – Reduced Data (Row Sampling Baseline): A reduced dataset generated using simple random row sampling with the same target percentage. Serves as a baseline for performance comparison.

  • Port 4 – KS Test Results (Row Sampling Output): Kolmogorov-Smirnov test results comparing the original density distribution to the randomly reduced dataset density distribution from Port 3.

Flow Variables

  • check_value: The minimum count remaining in any bin after reduction (0 means at least one bin became empty).

  • bin_number: The number of bins computed via Freedman-Diaconis.

Interactive View (F10)

Displays three histograms:

  • Top: Original data.

  • Bottom Left: Reduced data with duplicate-aware strategy.

  • Bottom Right: Reduced data using random row sampling.

Histograms use a density heatmap (blue to red) to visually assess how closely each reduced sample matches the original distribution.

Note on Row Sampling Comparison

The Row Sampling node (represented by Port 3) serves as a strong baseline when evaluating the effectiveness of the duplicate-aware, bin-based reduction method (Port 1). In some cases, random sampling achieves better KS test results—especially with well-distributed data and no duplicates. However, in other scenarios, the strategy used in Port 1 outperforms Row Sampling by better preserving local density, unique value diversity, and bin structure, as reflected in both the KS statistic and the visual histograms.

Recommendation

Use Port 1 when you suspect the presence of duplicate values or when maintaining visual and statistical fidelity to the original distribution is essential. Otherwise, compare the results of Ports 1 and 3 and choose the one that best suits your specific use case.

Component details

Input ports
  1. Type: Table
    Port 1
    No description available
Output ports
  1. Type: Table
    Port 1
    No description available
  2. Type: Table
    Port 2
    No description available
  3. Type: Table
    Port 3
    No description available
  4. Type: Table
    Port 4
    No description available

Used extensions & nodes

Created with KNIME Analytics Platform version 5.4.2
  • Go to item
    KNIME Base nodesTrusted extension

    KNIME AG, Zurich, Switzerland

    Version 5.4.1

    knime
  • Go to item
    KNIME ExpressionsTrusted extension

    KNIME AG, Zurich, Switzerland

    Version 5.4.1

    knime
  • Go to item
    KNIME JavasnippetTrusted extension

    KNIME AG, Zurich, Switzerland

    Version 5.4.0

    knime
  • Go to item
    KNIME Python IntegrationTrusted extension

    KNIME AG, Zurich, Switzerland

    Version 5.4.1

    knime
  • Go to item
    KNIME Quick FormsTrusted extension

    KNIME AG, Zurich, Switzerland

    Version 5.4.1

    knime
  • Go to item
    KNIME Statistics Nodes (Labs)Trusted extension

    KNIME AG, Zurich, Switzerland

    Version 5.4.0

    knime

This component does not have nodes, extensions, nested components and related workflows

Legal

By using or downloading the component, you agree to our terms and conditions.

KNIME
Open for Innovation

KNIME AG
Talacker 50
8001 Zurich, Switzerland
  • Software
  • Getting started
  • Documentation
  • Courses + Certification
  • Solutions
  • KNIME Hub
  • KNIME Forum
  • Blog
  • Events
  • Partner
  • Developers
  • KNIME Home
  • Careers
  • Contact us
Download KNIME Analytics Platform Read more about KNIME Business Hub
© 2025 KNIME AG. All rights reserved.
  • Trademarks
  • Imprint
  • Privacy
  • Terms & Conditions
  • Data Processing Agreement
  • Credits