Exploring similarities across high-dimensional datasets

Exploring similarities across high-dimensional datasets Karlton Sequeira Mohammed Zaki Very often, related data may be collected by a number of sources, which may be unable to share their entire datasets for reasons like con- fidentiality agreements, dataset size, etc. However, these sources may be willing to share a condensed model of their datasets. If some substruc- ture of the condensed models of such datasets, from different sources are found to be unusually similar, policies successfully applied to one may be successfully applied to the others. In this paper, we propose a frame- work for constructing condensed models of datasets and algorithms to find similar substructure in these models. The algorithms are based on the tensor product and Gibbs sampling. We test our framework on syn- thetic datasets and compare our algorithms with an existing one. Finally, we apply it to two time-course microarray datasets of different dimension- ality. The results are statistically, more interesting than results obtained from independent analysis of the datasets. Department of Computer Science, Rensselaer Polytechnic Institute, Troy, NY cs-05-05

Exploring similarities across high-dimensional datasets

Karlton Sequeira

Mohammed Zaki

Very often, related data may be collected by a number of sources, which may be unable to share their entire datasets for reasons like con- fidentiality agreements, dataset size, etc. However, these sources may be willing to share a condensed model of their datasets. If some substruc- ture of the condensed models of such datasets, from different sources are found to be unusually similar, policies successfully applied to one may be successfully applied to the others. In this paper, we propose a frame- work for constructing condensed models of datasets and algorithms to find similar substructure in these models. The algorithms are based on the tensor product and Gibbs sampling. We test our framework on syn- thetic datasets and compare our algorithms with an existing one. Finally, we apply it to two time-course microarray datasets of different dimension- ality. The results are statistically, more interesting than results obtained from independent analysis of the datasets.

Department of Computer Science, Rensselaer Polytechnic Institute, Troy, NY

cs-05-05