VOGUE: A Novel Variable Order-Gap State Bouchra Bouqata Christopher D. Carothers Boleslaw K. Szymanski Mohammed J. Zaki Data Mining Data Modeling Hidden Markov Models We present VOGUE, a new state machine thatcombines two separate techniques for modeling complex patternsin sequential data: data mining and data modeling. VOGUE relieson a novel Variable-Gap Sequence miner (VGS), to mine frequentpatterns with different lengths and gaps between elements. It thenuses these mined sequences to build the state machine. Moreover,we propose two variations of VOGUE: C-VOGUE that tends todecrease even further the state space complexity of VOGUE bypruning frequent sequences that are artifacts of other primaryfrequent sequences; and K-VOGUE that allows for sequencesto form the same frequent pattern even if they do not have anexact match of elements in all the positions. However, the differentelements have to share similar characteristics. We apply VOGUEto the task of protein sequence classification on real data from thePROSITE and SCOP protein families. We show that VOGUEsclassification sensitivity outperforms that of higher-order HiddenMarkov Models and of HMMER, a state-of-the-art method forprotein classification, by decreasing the sate space complexityand improving the accuracy and coverage. Department of Computer Science, Rensselaer Polytechnic Institute cs-07-11
VOGUE: A Novel Variable Order-Gap State
Bouchra Bouqata
Christopher D. Carothers
Boleslaw K. Szymanski
Mohammed J. Zaki
Data Mining
Data Modeling
Hidden Markov Models
We present VOGUE, a new state machine thatcombines two separate techniques for modeling complex patternsin sequential data: data mining and data modeling. VOGUE relieson a novel Variable-Gap Sequence miner (VGS), to mine frequentpatterns with different lengths and gaps between elements. It thenuses these mined sequences to build the state machine. Moreover,we propose two variations of VOGUE: C-VOGUE that tends todecrease even further the state space complexity of VOGUE bypruning frequent sequences that are artifacts of other primaryfrequent sequences; and K-VOGUE that allows for sequencesto form the same frequent pattern even if they do not have anexact match of elements in all the positions. However, the differentelements have to share similar characteristics. We apply VOGUEto the task of protein sequence classification on real data from thePROSITE and SCOP protein families. We show that VOGUEsclassification sensitivity outperforms that of higher-order HiddenMarkov Models and of HMMER, a state-of-the-art method forprotein classification, by decreasing the sate space complexityand improving the accuracy and coverage.
Department of Computer Science, Rensselaer Polytechnic Institute
cs-07-11