🧹 Cleaning Noisy Categories for ML
This workflow demonstrates how to clean categorical labels before training a machine learning model.
Real-world datasets often contain inconsistent or misspelled category values (e.g., Logiystics, Eduzcation, Healthcar). If used directly, these noisy labels fragment the data and reduce model accuracy.
🔑 Steps in this workflow:
📂 Load Product Sales Data โ dataset with features: Units Sold, Purchase Probability, Sales Channel, and noisy Category.
🏷๏ธ Reference Category Labels โ define the valid set of canonical categories (Electronics, Logistics, Education, Healthcare, Finance).
🔍 Approximate String Matcher โ apply Levenshtein distance to align noisy category values with their closest valid label.
โ Result: A cleaned dataset where all category labels are consistent and ML-ready.