Cluster Analysis of Binary Data
The University of Saskatchewan
Ph.D. in Interdisciplinary Studies
Created by: Carlos Enrique Diaz, MBM, B.Eng.
Email: carlos.diaz@usask.ca
Supervisor: Lori Bradford, Ph.D.
Email: lori.bradford@usask.ca
This workflow begins by transforming categorical binary data into a numerical format to enable the k-Medoids clustering algorithm, which can operate with Manhattan distance. Since averaging binary values is not meaningful, k-Means is unsuitable for this type of data. The Silhouette Coefficient method is employed for visual evaluation to determine the optimal number of clusters (k). Additionally, a novel approach is introduced to estimate the maximum value of k by identifying the point at which a preset value of k = n results in the k-Medoids algorithm producing fewer clusters than specified. The analysis also includes a co-occurrence examination of the categorical values.