Clusterability Detection and Initial Seed Selection in Large Data Sets Scott Epter Mukkai Krishnamoorthy Mohammed Zaki The need for a preliminary assessment of the clustering tendency or clusterability of massive data sets is known. A good clusterability detection method should serve to influence a decision as to whether to cluster at all, as well as provide useful seed input to a chosen clustering algorithm. We present a framework for the definition of the clusterability of a data set from a distance-based perspective. We discuss a graph- based system for detecting clusterability and generating seed information including an estimate of the value of k { the number of clusters in the data set, an input parameter to many distance-based clustering methods. The output of our method is tunable to accommodate a wide variety of clustering methods. We have conducted a number of experiments using our methodology with stock market data and with the well-known BIRCH data sets, in two as well as higher dimensions. Based on our experiments and results we find that our methodology can serve as the basis for much future work in this area. We report our results and discuss promising future directions. Department of Computer Science, Rensselaer Polytechnic Institute, Troy, NY cs-99-06
Clusterability Detection and Initial Seed Selection in Large Data Sets
Scott Epter
Mukkai Krishnamoorthy
Mohammed Zaki
The need for a preliminary assessment of the clustering tendency or clusterability of massive data sets is known. A good clusterability detection method should serve to influence a decision as to whether to cluster at all, as well as provide useful seed input to a chosen clustering algorithm. We present a framework for the definition of the clusterability of a data set from a distance-based perspective. We discuss a graph- based system for detecting clusterability and generating seed information including an estimate of the value of k { the number of clusters in the data set, an input parameter to many distance-based clustering methods. The output of our method is tunable to accommodate a wide variety of clustering methods. We have conducted a number of experiments using our methodology with stock market data and with the well-known BIRCH data sets, in two as well as higher dimensions. Based on our experiments and results we find that our methodology can serve as the basis for much future work in this area. We report our results and discuss promising future directions.
Department of Computer Science, Rensselaer Polytechnic Institute, Troy, NY
cs-99-06