Effective species count and motif efficiency: The value of comparative Lee A. Newberg Background: The identification and characterization of functional, non-coding DNA sequence elements is key to the understanding of cell function, differentiation, and pathology because the elements affect when and to what extent nearby genes are expressed. The proliferation of completed genomic sequences during the past few years has provided impetus for numerous comparative-genomics efforts to identify such elements, while simultaneously underscoring the profound difficulty of accurate and exhaustive identification. In particular, when there is little evolutionary separation between the species, the data from phylogenetically related sites are significantly correlated, and the advantage of having multiple genomes is significantly diminished. Little work has been done to quantify the utility of obtaining additional genomes for the characterization of a DNA motif. Results: We provide a mathematical formalism and an algorithm for evaluating a phylogenetic tree in terms of its utility for constructing a nucleotide equilibrium probability distribution for each multiply aligned DNA sequence position. “Motif efficiency” is measured via Fisher Information and the Cram´er-Rao Inequality, and is scaled so that a set of indistinguishable genomes is deemed to have a 0% motif efficiency, and a set of well-separated genomes is deemed to have 100% motif efficiency. We analyze several standardized phylogenetic trees and several phylogenetic trees from the literature. Conclusions: In our analysis of the standardized phylogenetic trees, we find that inadequate species separation is a particular matter for concern when the number of species is large or when the DNA sequence positions to be characterized have ∗Also: The Center for Bioinformatics,Wadsworth Center, New York State Department of Health, Albany, NY 12208-3425, USA. a nucleotide equilibrium probability distribution that is dominated by a pair of nucleotides. In our analysis of phylogenetic trees from the literature, we find that for a phylogenetic tree of nine mammals and for a phylogenetic tree of 45 vertebrates, motif efficiency is around 10%, and that, for a set of 14 prokaryotes, motif efficiency is around 33%. Department of Computer Science, Rensselaer Polytechnic Institute 08/03/2007 cs-07-09
Effective species count and motif efficiency: The value of comparative
Lee A. Newberg
Background: The identification and characterization of functional, non-coding DNA sequence elements is key to the understanding of cell function, differentiation, and pathology because the elements affect when and to what extent nearby genes are expressed. The proliferation of completed genomic sequences during the past few years has provided impetus for numerous comparative-genomics efforts to identify such elements, while simultaneously underscoring the profound difficulty of accurate and exhaustive identification. In particular, when there is little evolutionary separation between the species, the data from phylogenetically related sites are significantly correlated, and the advantage of having multiple genomes is significantly diminished. Little work has been done to quantify the utility of obtaining additional genomes for the characterization of a DNA motif. Results: We provide a mathematical formalism and an algorithm for evaluating a phylogenetic tree in terms of its utility for constructing a nucleotide equilibrium probability distribution for each multiply aligned DNA sequence position. “Motif efficiency” is measured via Fisher Information and the Cram´er-Rao Inequality, and is scaled so that a set of indistinguishable genomes is deemed to have a 0% motif efficiency, and a set of well-separated genomes is deemed to have 100% motif efficiency. We analyze several standardized phylogenetic trees and several phylogenetic trees from the literature. Conclusions: In our analysis of the standardized phylogenetic trees, we find that inadequate species separation is a particular matter for concern when the number of species is large or when the DNA sequence positions to be characterized have ∗Also: The Center for Bioinformatics,Wadsworth Center, New York State Department of Health, Albany, NY 12208-3425, USA. a nucleotide equilibrium probability distribution that is dominated by a pair of nucleotides. In our analysis of phylogenetic trees from the literature, we find that for a phylogenetic tree of nine mammals and for a phylogenetic tree of 45 vertebrates, motif efficiency is around 10%, and that, for a set of 14 prokaryotes, motif efficiency is around 33%.
Department of Computer Science, Rensselaer Polytechnic Institute
08/03/2007
cs-07-09