Data Reduction

The University of Saskatchewan

Ph.D. in Interdisciplinary Studies

Created by: Carlos Enrique Diaz, MBM, B.Eng.

Email: carlos.diaz@usask.ca

Supervisor: Lori Bradford, Ph.D.

Email: lori.bradford@usask.ca

Description

The Data Reduction component reduces the size of a dataset, focusing on removing duplicate values while preserving the original distribution's shape and density. It is particularly useful for tasks such as visualization, lightweight prototyping, or downstream sampling when working with large datasets generated through bootstrapping or processes that involve duplication.

Configuration Options

Column selection: Select the numeric column to apply the reduction.
Percentage after reduction: Define the target percentage of rows to retain.
Significance level alfa: Used to evaluate whether the reduced distribution remains statistically similar to the original.

How It Works

Histogram Binning: The component uses the Freedman-Diaconis rule to determine the optimal bin width and number of bins.
Density-Based Reduction: For each bin, a target count is calculated based on bin density and the overall reduction percentage target. Duplicated values are prioritized for removal to increase variability. If no duplicates are present, the reduction is applied uniformly across the bin.
Statistical Validation: The component uses the Kolmogorov-Smirnov (KS) Test, which provides:
- A hypothesis test: whether the original and reduced samples are drawn from the same distribution.
- A KS statistic: which serves as a divergence metric between the empirical distributions, useful alongside other measures like Jensen-Shannon divergence.
Normalized Comparison: Frequency distributions are normalized to probability densities before statistical testing. This improves reliability, especially when comparing datasets of different sizes.

Output Ports

Port 1 – Reduced Data (Duplicate-aware): The reduced dataset produced by the component using a bin-aware strategy that prioritizes the removal of duplicate values.
Port 2 – KS Test Results (Duplicate-aware Output): Results of the Kolmogorov-Smirnov test comparing the original density distribution to the reduced dataset density distribution from Port 1.
Port 3 – Reduced Data (Row Sampling Baseline): A reduced dataset generated using simple random row sampling with the same target percentage. Serves as a baseline for performance comparison.
Port 4 – KS Test Results (Row Sampling Output): Kolmogorov-Smirnov test results comparing the original density distribution to the randomly reduced dataset density distribution from Port 3.

Flow Variables

check_value: The minimum count remaining in any bin after reduction (0 means at least one bin became empty).
bin_number: The number of bins computed via Freedman-Diaconis.

Interactive View (F10)

Displays three histograms:

Top: Original data.
Bottom Left: Reduced data with duplicate-aware strategy.
Bottom Right: Reduced data using random row sampling.

Histograms use a density heatmap (blue to red) to visually assess how closely each reduced sample matches the original distribution.

Note on Row Sampling Comparison

The Row Sampling node (represented by Port 3) serves as a strong baseline when evaluating the effectiveness of the duplicate-aware, bin-based reduction method (Port 1). In some cases, random sampling achieves better KS test results—especially with well-distributed data and no duplicates. However, in other scenarios, the strategy used in Port 1 outperforms Row Sampling by better preserving local density, unique value diversity, and bin structure, as reflected in both the KS statistic and the visual histograms.

Recommendation

Use Port 1 when you suspect the presence of duplicate values or when maintaining visual and statistical fidelity to the original distribution is essential. Otherwise, compare the results of Ports 1 and 3 and choose the one that best suits your specific use case.

Component details

Input ports

Output ports

KNIME Base nodes

KNIME Expressions

KNIME Javasnippet

KNIME Python Integration

KNIME Quick Forms

KNIME Statistics Nodes (Labs)

Legal

Data Reduction

Component details

Input ports

Output ports

Used extensions & nodes

KNIME Base nodesTrusted extension

KNIME ExpressionsTrusted extension

KNIME JavasnippetTrusted extension

KNIME Python IntegrationTrusted extension

KNIME Quick FormsTrusted extension

KNIME Statistics Nodes (Labs)Trusted extension