The University of Saskatchewan
Ph.D. in Interdisciplinary Studies
Created by: Carlos Enrique Diaz, MBM, B.Eng.
Email: carlos.diaz@usask.ca
Supervisor: Lori Bradford, Ph.D.
Email: lori.bradford@usask.ca
Description
The Data Reduction component reduces the size of a dataset, focusing on removing duplicate values while preserving the original distribution's shape and density. It is particularly useful for tasks such as visualization, lightweight prototyping, or downstream sampling when working with large datasets generated through bootstrapping or processes that involve duplication.
Configuration Options
Column selection: Select the numeric column to apply the reduction.
Percentage after reduction: Define the target percentage of rows to retain.
Significance level alfa: Used to evaluate whether the reduced distribution remains statistically similar to the original.
How It Works
Histogram Binning: The component uses the Freedman-Diaconis rule to determine the optimal bin width and number of bins.
Density-Based Reduction: For each bin, a target count is calculated based on bin density and the overall reduction percentage target. Duplicated values are prioritized for removal to increase variability. If no duplicates are present, the reduction is applied uniformly across the bin.
Statistical Validation: The component uses the Kolmogorov-Smirnov (KS) Test, which provides:
A hypothesis test: whether the original and reduced samples are drawn from the same distribution.
A KS statistic: which serves as a divergence metric between the empirical distributions, useful alongside other measures like Jensen-Shannon divergence.
Normalized Comparison: Frequency distributions are normalized to probability densities before statistical testing. This improves reliability, especially when comparing datasets of different sizes.
Output Ports
Port 1 – Reduced Data (Duplicate-aware): The reduced dataset produced by the component using a bin-aware strategy that prioritizes the removal of duplicate values.
Port 2 – KS Test Results (Duplicate-aware Output): Results of the Kolmogorov-Smirnov test comparing the original density distribution to the reduced dataset density distribution from Port 1.
Port 3 – Reduced Data (Row Sampling Baseline): A reduced dataset generated using simple random row sampling with the same target percentage. Serves as a baseline for performance comparison.
Port 4 – KS Test Results (Row Sampling Output): Kolmogorov-Smirnov test results comparing the original density distribution to the randomly reduced dataset density distribution from Port 3.
Flow Variables
check_value: The minimum count remaining in any bin after reduction (0 means at least one bin became empty).
bin_number: The number of bins computed via Freedman-Diaconis.
Interactive View (F10)
Displays three histograms:
Top: Original data.
Bottom Left: Reduced data with duplicate-aware strategy.
Bottom Right: Reduced data using random row sampling.
Histograms use a density heatmap (blue to red) to visually assess how closely each reduced sample matches the original distribution.
Note on Row Sampling Comparison
The Row Sampling node (represented by Port 3) serves as a strong baseline when evaluating the effectiveness of the duplicate-aware, bin-based reduction method (Port 1). In some cases, random sampling achieves better KS test results—especially with well-distributed data and no duplicates. However, in other scenarios, the strategy used in Port 1 outperforms Row Sampling by better preserving local density, unique value diversity, and bin structure, as reflected in both the KS statistic and the visual histograms.
Recommendation
Use Port 1 when you suspect the presence of duplicate values or when maintaining visual and statistical fidelity to the original distribution is essential. Otherwise, compare the results of Ports 1 and 3 and choose the one that best suits your specific use case.