The Relative Inefficiency of Sequence Weights in Determining a Nucleotide Consensus Distribution

The Relative Inefficiency of Sequence Weights in Determining a Nucleotide Consensus Distribution Lee A. Newberg Lee Ann McCue Charles E. Lawrence The use of sequence weights to estimate a consensus distribution of nucleotides at any po- sition of an alignment of nucleic acid sequences is found to perform poorly in comparison to a maximum-likelihood method based upon phylogenetic relationships. We derive optimal sequence weights for sequences related by a phylogenetic tree but ¯nd that, among a collection of primate sequences, the sequences-weights approach is 51% as e±cient as the maximum-likelihood approach in making use of the alignment data. Preferable to the use of sequence weights to estimate the distribution of bases at each aligned sequence position is the use of an alternative recipe. First, seek a phylogenetic tree that connects the species in question, perhaps employing the aligned sequence data set itself for that purpose. Second, use the tree topology and edge lengths to calculate a maximum-likelihood distribution of bases at each position in the aligned sequence. Department of Computer Science, Rensselaer Polytechnic Institute, Troy, NY 08/07/2003 cs-03-10

The Relative Inefficiency of Sequence Weights in Determining a Nucleotide Consensus Distribution

Lee A. Newberg

Lee Ann McCue

Charles E. Lawrence

The use of sequence weights to estimate a consensus distribution of nucleotides at any po- sition of an alignment of nucleic acid sequences is found to perform poorly in comparison to a maximum-likelihood method based upon phylogenetic relationships. We derive optimal sequence weights for sequences related by a phylogenetic tree but ¯nd that, among a collection of primate sequences, the sequences-weights approach is 51% as e±cient as the maximum-likelihood approach in making use of the alignment data. Preferable to the use of sequence weights to estimate the distribution of bases at each aligned sequence position is the use of an alternative recipe. First, seek a phylogenetic tree that connects the species in question, perhaps employing the aligned sequence data set itself for that purpose. Second, use the tree topology and edge lengths to calculate a maximum-likelihood distribution of bases at each position in the aligned sequence.

Department of Computer Science, Rensselaer Polytechnic Institute, Troy, NY

08/07/2003

cs-03-10