VOGUE: A Novel Variable Order-Gap State

VOGUE: A Novel Variable Order-Gap State Bouchra Bouqata Christopher D. Carothers Boleslaw K. Szymanski Mohammed J. Zaki Data Mining Data Modeling Hidden Markov Models We present VOGUE, a new state machine thatcombines two separate techniques for modeling complex patternsin sequential data: data mining and data modeling. VOGUE relieson a novel Variable-Gap Sequence miner (VGS), to mine frequentpatterns with different lengths and gaps between elements. It thenuses these mined sequences to build the state machine. Moreover,we propose two variations of VOGUE: C-VOGUE that tends todecrease even further the state space complexity of VOGUE bypruning frequent sequences that are artifacts of other primaryfrequent sequences; and K-VOGUE that allows for sequencesto form the same frequent pattern even if they do not have anexact match of elements in all the positions. However, the differentelements have to share similar characteristics. We apply VOGUEto the task of protein sequence classification on real data from thePROSITE and SCOP protein families. We show that VOGUEsclassification sensitivity outperforms that of higher-order HiddenMarkov Models and of HMMER, a state-of-the-art method forprotein classification, by decreasing the sate space complexityand improving the accuracy and coverage. Department of Computer Science, Rensselaer Polytechnic Institute cs-07-11

VOGUE: A Novel Variable Order-Gap State

Bouchra Bouqata

Christopher D. Carothers

Boleslaw K. Szymanski

Mohammed J. Zaki

Data Mining

Data Modeling

Hidden Markov Models

We present VOGUE, a new state machine thatcombines two separate techniques for modeling complex patternsin sequential data: data mining and data modeling. VOGUE relieson a novel Variable-Gap Sequence miner (VGS), to mine frequentpatterns with different lengths and gaps between elements. It thenuses these mined sequences to build the state machine. Moreover,we propose two variations of VOGUE: C-VOGUE that tends todecrease even further the state space complexity of VOGUE bypruning frequent sequences that are artifacts of other primaryfrequent sequences; and K-VOGUE that allows for sequencesto form the same frequent pattern even if they do not have anexact match of elements in all the positions. However, the differentelements have to share similar characteristics. We apply VOGUEto the task of protein sequence classification on real data from thePROSITE and SCOP protein families. We show that VOGUEsclassification sensitivity outperforms that of higher-order HiddenMarkov Models and of HMMER, a state-of-the-art method forprotein classification, by decreasing the sate space complexityand improving the accuracy and coverage.

Department of Computer Science, Rensselaer Polytechnic Institute

cs-07-11