An algorithm for clustering categorical data with set-valued features
Authors: Fuyuan Cao, Joshua Zhexue Huang, Jiye Liang, Xingwang Zhao, Yinfeng Meng, Kai Feng, Yuhua Qian
Abstract:
In data mining, objects are often represented by a set of features, where each feature of an object has only one value. However, in reality, some features can take on multiple values, for instance, a person with several job titles, hobbies, and email addresses. These features can be referred to as set-valued features and are often treated with dummy features when using existing data mining algorithms to analyze data with set-valued features. In this paper, we propose an SV-k-modes algorithm that clusters categorical data with set-valued features. In this algorithm, a distance function is defined between two objects with set-valued features, and a set-valued mode representation of cluster centers is proposed. We develop a heuristic method to update cluster centers in the iterative clustering process and an initialization algorithm to select the initial cluster centers. The convergence and complexity of the SV-k-modes algorithm are analyzed. Experiments are conducted on both synthetic data and real data from five different applications. The experimental results have shown that the SV-k-modes algorithm performs better when clustering real data than do three other categorical clustering algorithms and that the algorithm is scalable to large data.
Keywords: Categorical data set-valued feature; set-valued modes; SV-k-modes algorithm
An algorithm for clustering categorical data with set-valued features.pdf
Fri Dec 28 19:50:00 CST 2018