Robust Partitional Clustering by Outlier and Density

Robust Partitional Clustering by Outlier and Density Mohammad Al Hasan Vineet Chaoji Saeed Salem Mohammed Zaki The partitional clustering technique, k-means, is one of themost computationally efficient clustering methods. However, it produces a local optimal solution that strongly depends on its initial seeds. Bad initial seeds can also causethe splitting or merging of natural clusters even if the clusters are well separated. In this paper, we propose, ROBIN,a novel method for initial seed selection in k-means typesof algorithms. ROBIN is a deterministic and robust initialization method that is virtually insensitive to outliers inthe data, and it can also handle variable density or multi-scale clusters. Since it is deterministic, only one run suffices to obtain good initial seeds, as opposed to traditionalrandom seed selection approaches that need many runs toobtain good seeds that lead to satisfactory clustering. Wedid a comprehensive evaluation of ROBIN against state-of-the-art seeding methods on a wide range of synthetic andreal datasets. We show that ROBIN consistently outperforms existing approaches in terms of the clustering quality(squared error). Department of Computer Science, Rensselaer Polytechnic Institute cs-08-04

Robust Partitional Clustering by Outlier and Density

Mohammad Al Hasan

Vineet Chaoji

Saeed Salem

Mohammed Zaki

The partitional clustering technique, k-means, is one of themost computationally efficient clustering methods. However, it produces a local optimal solution that strongly depends on its initial seeds. Bad initial seeds can also causethe splitting or merging of natural clusters even if the clusters are well separated. In this paper, we propose, ROBIN,a novel method for initial seed selection in k-means typesof algorithms. ROBIN is a deterministic and robust initialization method that is virtually insensitive to outliers inthe data, and it can also handle variable density or multi-scale clusters. Since it is deterministic, only one run suffices to obtain good initial seeds, as opposed to traditionalrandom seed selection approaches that need many runs toobtain good seeds that lead to satisfactory clustering. Wedid a comprehensive evaluation of ROBIN against state-of-the-art seeding methods on a wide range of synthetic andreal datasets. We show that ROBIN consistently outperforms existing approaches in terms of the clustering quality(squared error).

Department of Computer Science, Rensselaer Polytechnic Institute

cs-08-04