Explainable Fuzzy Matching on Payee Data

This workflow demonstrates how to use Approximate String Matching to reconcile noisy, user-entered payee names with a clean reference list of canonical entities. Beyond generating similarity scores, the workflow provides explainable error statistics to highlight where and how mismatches occur.

🔹 Steps in the Workflow

📂 Load Data
- Reference Data: clean list of canonical payee names.
- Payee Data with Typos: noisy, real-world names entered by users.
🔍 Approximate String Matching (Levenshtein)
- Matches each entered payee name against the reference list.
- Produces a Match Sequence (e.g., oooo=ooo=ox=+) that explains differences character by character:
  - o → match
  - = → substitution (wrong character)
  - + → insertion (extra character)
  - x → deletion (missing character)
🧮 Error Type Analysis
- Counts substitutions, insertions, deletions, and matches.
- Calculates error ratios, edit distance, and match accuracy.
- Provides explainable quality metrics for each match.
📊 Aggregation & Statistics
- Groups results by reference payee.
- Computes the average error profile per entity (e.g., “Deutsche Bank AG entries often miss characters”).
- Rounds and formats values for readability.
📈 Interactive Dashboard
- Table of canonical payees with their average match accuracy.
- Bar chart showing the distribution of error types (substitution, insertion, deletion).
- Clear insights into where manual review may be needed and which vendors/customers are most error-prone.

🔹 Business Value

Data Quality Monitoring → Understand how user-entered names deviate from reference data.
Explainable Matching → Not just similarity scores, but insights into why mismatches occur.
Operational Efficiency → Identify entities requiring frequent manual corrections.
Compliance Support → Improve accuracy for KYC, AML, and financial reconciliation tasks.