Search | VHL Regional Portal

1.

Improved selection of canonical proteins for reference proteomes.

Insana, Giuseppe; Martin, Maria J; Pearson, William R.

NAR Genom Bioinform ; 6(2): lqae066, 2024 Jun.

Article in English | MEDLINE | ID: mdl-38863529

ABSTRACT

The 'canonical' protein sets distributed by UniProt are widely used for similarity searching, and functional and structural annotation. For many investigators, canonical sequences are the only version of a protein examined. However, higher eukaryotes often encode multiple isoforms of a protein from a single gene. For unreviewed (UniProtKB/TrEMBL) protein sequences, the longest sequence in a Gene-Centric group is chosen as canonical. This choice can create inconsistencies, selecting >95% identical orthologs with dramatically different lengths, which is biologically unlikely. We describe the ortho2tree pipeline, which examines Reference Proteome canonical and isoform sequences from sets of orthologous proteins, builds multiple alignments, constructs gap-distance trees, and identifies low-cost clades of isoforms with similar lengths. After examining 140 000 proteins from eight mammals in UniProtKB release 2022_05, ortho2tree proposed 7804 canonical changes for release 2023_01, while confirming 53 434 canonicals. Gap distributions for isoforms selected by ortho2tree are similar to those in bacterial and yeast alignments, organisms unaffected by isoform selection, suggesting ortho2tree canonicals more accurately reflect genuine biological variation. 82% of ortho2tree proposed-changes agreed with MANE; for confirmed canonicals, 92% agreed with MANE. Ortho2tree can improve canonical assignment among orthologous sequences that are >60% identical, a group that includes vertebrates and higher plants.

2.

Comparison of detection methods and genome quality when quantifying nuclear mitochondrial insertions in vertebrate genomes.

Triant, Deborah A; Pearson, William R.

Front Genet ; 13: 984513, 2022.

Article in English | MEDLINE | ID: mdl-36482890

ABSTRACT

The integration of mitochondrial genome fragments into the nuclear genome is well documented, and the transfer of these mitochondrial nuclear pseudogenes (numts) is thought to be an ongoing evolutionary process. With the increasing number of eukaryotic genomes available, genome-wide distributions of numts are often surveyed. However, inconsistencies in genome quality can reduce the accuracy of numt estimates, and methods used for identification can be complicated by the diverse sizes and ages of numts. Numts have been previously characterized in rodent genomes and it was postulated that they might be more prevalent in a group of voles with rapidly evolving karyotypes. Here, we examine 37 rodent genomes, and an additional 26 vertebrate genomes, while also considering numt detection methods. We identify numts using DNA:DNA and protein:translated-DNA similarity searches and compare numt distributions among rodent and vertebrate taxa to assess whether some groups are more susceptible to transfer. A combination of protein sequence comparisons (protein:translated-DNA) and BLASTN genomic DNA searches detect 50% more numts than genomic DNA:DNA searches alone. In addition, higher-quality RefSeq genomes produce lower estimates of numts than GenBank genomes, suggesting that lower quality genome assemblies can overestimate numts abundance. Phylogenetic analysis shows that mitochondrial transfers are not associated with karyotypic diversity among rodents. Surprisingly, we did not find a strong correlation between numt counts and genome size. Estimates using DNA: DNA analyses can underestimate the amount of mitochondrial DNA that is transferred to the nucleus.

3.

Barriers to integration of bioinformatics into undergraduate life sciences education: A national study of US life sciences faculty uncover significant barriers to integrating bioinformatics into undergraduate instruction.

Williams, Jason J; Drew, Jennifer C; Galindo-Gonzalez, Sebastian; Robic, Srebrenka; Dinsdale, Elizabeth; Morgan, William R; Triplett, Eric W; Burnette, James M; Donovan, Samuel S; Fowlks, Edison R; Goodman, Anya L; Grandgenett, Nealy F; Goller, Carlos C; Hauser, Charles; Jungck, John R; Newman, Jeffrey D; Pearson, William R; Ryder, Elizabeth F; Sierk, Michael; Smith, Todd M; Tosado-Acevedo, Rafael; Tapprich, William; Tobin, Tammy C; Toro-Martínez, Arlín; Welch, Lonnie R; Wilson, Melissa A; Ebenbach, David; McWilliams, Mindy; Rosenwald, Anne G; Pauley, Mark A.

PLoS One ; 14(11): e0224288, 2019.

Article in English | MEDLINE | ID: mdl-31738797

ABSTRACT

Bioinformatics, a discipline that combines aspects of biology, statistics, mathematics, and computer science, is becoming increasingly important for biological research. However, bioinformatics instruction is not yet generally integrated into undergraduate life sciences curricula. To understand why we studied how bioinformatics is being included in biology education in the US by conducting a nationwide survey of faculty at two- and four-year institutions. The survey asked several open-ended questions that probed barriers to integration, the answers to which were analyzed using a mixed-methods approach. The barrier most frequently reported by the 1,260 respondents was lack of faculty expertise/training, but other deterrents-lack of student interest, overly-full curricula, and lack of student preparation-were also common. Interestingly, the barriers faculty face depended strongly on whether they are members of an underrepresented group and on the Carnegie Classification of their home institution. We were surprised to discover that the cohort of faculty who were awarded their terminal degree most recently reported the most preparation in bioinformatics but teach it at the lowest rate.

Subject(s)

Biology/education , Computational Biology/education , Curriculum , Faculty/statistics & numerical data , Female , Humans , Male , Motivation , Students/psychology , Surveys and Questionnaires/statistics & numerical data , United States

4.

Using SQL Databases for Sequence Similarity Searching and Analysis.

Pearson, William R; Mackey, Aaron J.

Curr Protoc Bioinformatics ; 59: 9.4.1-9.4.22, 2017 09 13.

Article in English | MEDLINE | ID: mdl-28902397

ABSTRACT

Relational databases can integrate diverse types of information and manage large sets of similarity search results, greatly simplifying genome-scale analyses. By focusing on taxonomic subsets of sequences, relational databases can reduce the size and redundancy of sequence libraries and improve the statistical significance of homologs. In addition, by loading similarity search results into a relational database, it becomes possible to explore and summarize the relationships between all of the proteins in an organism and those in other biological kingdoms. This unit describes how to use relational databases to improve the efficiency of sequence similarity searching and demonstrates various large-scale genomic analyses of homology-related data. It also describes the installation and use of a simple protein sequence database, seqdb_demo, which is used as a basis for the other protocols. The unit also introduces search_demo, a database that stores sequence similarity search results. The search_demo database is then used to explore the evolutionary relationships between E. coli proteins and proteins in other organisms in a large-scale comparative genomic analysis. © 2017 by John Wiley & Sons, Inc.

Subject(s)

Computational Biology/methods , Databases, Protein , Sequence Analysis, Protein/methods , Software , Escherichia coli/genetics , Evolution, Molecular , Proteins/chemistry , Proteins/genetics , Sequence Alignment , Sequence Homology, Amino Acid

5.

Query-seeded iterative sequence similarity searching improves selectivity 5-20-fold.

Pearson, William R; Li, Weizhong; Lopez, Rodrigo.

Nucleic Acids Res ; 45(7): e46, 2017 04 20.

Article in English | MEDLINE | ID: mdl-27923999

ABSTRACT

Iterative similarity search programs, like psiblast, jackhmmer, and psisearch, are much more sensitive than pairwise similarity search methods like blast and ssearch because they build a position specific scoring model (a PSSM or HMM) that captures the pattern of sequence conservation characteristic to a protein family. But models are subject to contamination; once an unrelated sequence has been added to the model, homologs of the unrelated sequence will also produce high scores, and the model can diverge from the original protein family. Examination of alignment errors during psiblast PSSM contamination suggested a simple strategy for dramatically reducing PSSM contamination. psiblast PSSMs are built from the query-based multiple sequence alignment (MSA) implied by the pairwise alignments between the query model (PSSM, HMM) and the subject sequences in the library. When the original query sequence residues are inserted into gapped positions in the aligned subject sequence, the resulting PSSM rarely produces alignment over-extensions or alignments to unrelated sequences. This simple step, which tends to anchor the PSSM to the original query sequence and slightly increase target percent identity, can reduce the frequency of false-positive alignments more than 20-fold compared with psiblast and jackhmmer, with little loss in search sensitivity.

Subject(s)

Sequence Alignment/methods , Sequence Analysis, Protein/methods , Protein Domains , Software

6.

Finding Protein and Nucleotide Similarities with FASTA.

Pearson, William R.

Curr Protoc Bioinformatics ; 53: 3.9.1-3.9.25, 2016 Mar 24.

Article in English | MEDLINE | ID: mdl-27010337

ABSTRACT

The FASTA programs provide a comprehensive set of rapid similarity searching tools (fasta36, fastx36, tfastx36, fasty36, tfasty36), similar to those provided by the BLAST package, as well as programs for slower, optimal, local, and global similarity searches (ssearch36, ggsearch36), and for searching with short peptides and oligonucleotides (fasts36, fastm36). The FASTA programs use an empirical strategy for estimating statistical significance that accommodates a range of similarity scoring matrices and gap penalties, improving alignment boundary accuracy and search sensitivity. The FASTA programs can produce "BLAST-like" alignment and tabular output, for ease of integration into existing analysis pipelines, and can search small, representative databases, and then report results for a larger set of sequences, using links from the smaller dataset. The FASTA programs work with a wide variety of database formats, including mySQL and postgreSQL databases. The programs also provide a strategy for integrating domain and active site annotations into alignments and highlighting the mutational state of functionally critical residues. These protocols describe how to use the FASTA programs to characterize protein and DNA sequences, using protein:protein, protein:DNA, and DNA:DNA comparisons.

Subject(s)

Nucleotides/chemistry , Proteins/chemistry , Sequence Alignment , Sequence Homology, Amino Acid , Sequence Homology, Nucleic Acid , Databases, Nucleic Acid , Databases, Protein

7.

Protein Function Prediction: Problems and Pitfalls.

Pearson, William R.

Curr Protoc Bioinformatics ; 51: 4.12.1-4.12.8, 2015 Sep 03.

Article in English | MEDLINE | ID: mdl-26334923

ABSTRACT

The characterization of new genomes based on their protein sets has been revolutionized by new sequencing technologies, but biologists seeking to exploit new sequence information are often frustrated by the challenges associated with accurately assigning biological functions to newly identified proteins. Here, we highlight some of the challenges in functional inference from sequence similarity. Investigators can improve the accuracy of function prediction by (1) being conservative about the evolutionary distance to a protein of known function; (2) considering the ambiguous meaning of "functional similarity," and (3) being aware of the limitations of annotations in functional databases. Protein function prediction does not offer "one-size-fits-all" solutions. Prediction strategies work better when the idiosyncrasies of function and functional annotation are better understood.

Subject(s)

Databases, Protein , Proteins/chemistry , Proteins/metabolism , Sequence Alignment/methods , Sequence Analysis, Protein/methods , Amino Acid Sequence , Data Mining/methods , Molecular Sequence Data , Structure-Activity Relationship

8.

Most partial domains in proteins are alignment and annotation artifacts.

Triant, Deborah A; Pearson, William R.

Genome Biol ; 16: 99, 2015 May 15.

Article in English | MEDLINE | ID: mdl-25976240

ABSTRACT

BACKGROUND: Protein domains are commonly used to assess the functional roles and evolutionary relationships of proteins and protein families. Here, we use the Pfam protein family database to examine a set of candidate partial domains. Pfam protein domains are often thought of as evolutionarily indivisible, structurally compact, units from which larger functional proteins are assembled; however, almost 4% of Pfam27 PfamA domains are shorter than 50% of their family model length, suggesting that more than half of the domain is missing at those locations. To better understand the structural nature of partial domains in proteins, we examined 30,961 partial domain regions from 136 domain families contained in a representative subset of PfamA domains (RefProtDom2 or RPD2). RESULTS: We characterized three types of apparent partial domains: split domains, bounded partials, and unbounded partials. We find that bounded partial domains are over-represented in eukaryotes and in lower quality protein predictions, suggesting that they often result from inaccurate genome assemblies or gene models. We also find that a large percentage of unbounded partial domains produce long alignments, which suggests that their annotation as a partial is an alignment artifact; yet some can be found as partials in other sequence contexts. CONCLUSIONS: Partial domains are largely the result of alignment and annotation artifacts and should be viewed with caution. The presence of partial domain annotations in proteins should raise the concern that the prediction of the protein's gene may be incomplete. In general, protein domains can be considered the structural building blocks of proteins.

Subject(s)

Molecular Sequence Annotation , Protein Structure, Tertiary , Proteins/chemistry , Sequence Alignment , Animals , Databases, Genetic , Databases, Protein , Drosophila/genetics , Humans , Mice , Models, Molecular , Software

9.

The Catalytic Site Atlas 2.0: cataloging catalytic sites and residues identified in enzymes.

Furnham, Nicholas; Holliday, Gemma L; de Beer, Tjaart A P; Jacobsen, Julius O B; Pearson, William R; Thornton, Janet M.

Nucleic Acids Res ; 42(Database issue): D485-9, 2014 Jan.

Article in English | MEDLINE | ID: mdl-24319146

ABSTRACT

Understanding which are the catalytic residues in an enzyme and what function they perform is crucial to many biology studies, particularly those leading to new therapeutics and enzyme design. The original version of the Catalytic Site Atlas (CSA) (http://www.ebi.ac.uk/thornton-srv/databases/CSA) published in 2004, which catalogs the residues involved in enzyme catalysis in experimentally determined protein structures, had only 177 curated entries and employed a simplistic approach to expanding these annotations to homologous enzyme structures. Here we present a new version of the CSA (CSA 2.0), which greatly expands the number of both curated (968) and automatically annotated catalytic sites in enzyme structures, utilizing a new method for annotation transfer. The curated entries are used, along with the variation in residue type from the sequence comparison, to generate 3D templates of the catalytic sites, which in turn can be used to find catalytic sites in new structures. To ease the transfer of CSA annotations to other resources a new ontology has been developed: the Enzyme Mechanism Ontology, which has permitted the transfer of annotations to Mechanism, Annotation and Classification in Enzymes (MACiE) and UniProt Knowledge Base (UniProtKB) resources. The CSA database schema has been re-designed and both the CSA data and search capabilities are presented in a new modern web interface.

Subject(s)

Catalytic Domain , Databases, Protein , Enzymes/chemistry , Biological Ontologies , Internet , Sequence Analysis, Protein

10.

BLAST and FASTA similarity searching for multiple sequence alignment.

Pearson, William R.

Methods Mol Biol ; 1079: 75-101, 2014.

Article in English | MEDLINE | ID: mdl-24170396

ABSTRACT

BLAST, FASTA, and other similarity searching programs seek to identify homologous proteins and DNA sequences based on excess sequence similarity. If two sequences share much more similarity than expected by chance, the simplest explanation for the excess similarity is common ancestry-homology. The most effective similarity searches compare protein sequences, rather than DNA sequences, for sequences that encode proteins, and use expectation values, rather than percent identity, to infer homology. The BLAST and FASTA packages of sequence comparison programs provide programs for comparing protein and DNA sequences to protein databases (the most sensitive searches). Protein and translated-DNA comparisons to protein databases routinely allow evolutionary look back times from 1 to 2 billion years; DNA:DNA searches are 5-10-fold less sensitive. BLAST and FASTA can be run on popular web sites, but can also be downloaded and installed on local computers. With local installation, target databases can be customized for the sequence data being characterized. With today's very large protein databases, search sensitivity can also be improved by searching smaller comprehensive databases, for example, a complete protein set from an evolutionarily neighboring model organism. By default, BLAST and FASTA use scoring strategies target for distant evolutionary relationships; for comparisons involving short domains or queries, or searches that seek relatively close homologs (e.g. mouse-human), shallower scoring matrices will be more effective. Both BLAST and FASTA provide very accurate statistical estimates, which can be used to reliably identify protein sequences that diverged more than 2 billion years ago.

Subject(s)

Computational Biology/methods , Sequence Alignment/methods , Software , Amino Acid Sequence , Data Mining , Databases, Protein , Humans , Molecular Sequence Data , Sequence Homology, Amino Acid

11.

Adjusting scoring matrices to correct overextended alignments.

Mills, Lauren J; Pearson, William R.

Bioinformatics ; 29(23): 3007-13, 2013 Dec 01.

Article in English | MEDLINE | ID: mdl-23995390

ABSTRACT

MOTIVATION: Sequence similarity searches performed with BLAST, SSEARCH and FASTA achieve high sensitivity by using scoring matrices (e.g. BLOSUM62) that target low identity (<33%) alignments. Although such scoring matrices can effectively identify distant homologs, they can also produce local alignments that extend beyond the homologous regions. RESULTS: We measured local alignment start/stop boundary accuracy using a set of queries where the correct alignment boundaries were known, and found that 7% of BLASTP and 8% of SSEARCH alignment boundaries were overextended. Overextended alignments include non-homologous sequences; they occur most frequently between sequences that are more closely related (>33% identity). Adjusting the scoring matrix to reflect the identity of the homologous sequence can correct higher identity overextended alignment boundaries. In addition, the scoring matrix that produced a correct alignment could be reliably predicted based on the sequence identity seen in the original BLOSUM62 alignment. Realigning with the predicted scoring matrix corrected 37% of all overextended alignments, resulting in more correct alignments than using BLOSUM62 alone.

Subject(s)

Computational Biology/methods , Position-Specific Scoring Matrices , Proteins/chemistry , Sequence Alignment/methods , Sequence Analysis, Protein/methods , Algorithms , Amino Acid Sequence , Databases, Protein , Molecular Sequence Data , Sequence Homology, Amino Acid

12.

An introduction to sequence similarity ("homology") searching.

Pearson, William R.

Curr Protoc Bioinformatics ; Chapter 3: 3.1.1-3.1.8, 2013 Jun.

Article in English | MEDLINE | ID: mdl-23749753

ABSTRACT

Sequence similarity searching, typically with BLAST, is the most widely used and most reliable strategy for characterizing newly determined sequences. Sequence similarity searches can identify "homologous" proteins or genes by detecting excess similarity- statistically significant similarity that reflects common ancestry. This unit provides an overview of the inference of homology from significant similarity, and introduces other units in this chapter that provide more details on effective strategies for identifying homologs.

Subject(s)

Proteins/chemistry , Sequence Alignment/methods , Databases, Protein , Proteins/genetics , Sequence Analysis , Sequence Homology

13.

Selecting the Right Similarity-Scoring Matrix.

Pearson, William R.

Curr Protoc Bioinformatics ; 43: 3.5.1-3.5.9, 2013.

Article in English | MEDLINE | ID: mdl-24509512

ABSTRACT

Protein sequence similarity searching programs like BLASTP, SSEARCH (UNIT 3.10), and FASTA use scoring matrices that are designed to identify distant evolutionary relationships (BLOSUM62 for BLAST, BLOSUM50 for SEARCH and FASTA). Different similarity scoring matrices are most effective at different evolutionary distances. "Deep" scoring matrices like BLOSUM62 and BLOSUM50 target alignments with 20 - 30% identity, while "shallow" scoring matrices (e.g. VTML10 - VTML80), target alignments that share 90 - 50% identity, reflecting much less evolutionary change. While "deep" matrices provide very sensitive similarity searches, they also require longer sequence alignments and can sometimes produce alignment overextension into non-homologous regions. Shallower scoring matrices are more effective when searching for short protein domains, or when the goal is to limit the scope of the search to sequences that are likely to be orthologous between recently diverged organisms. Likewise, in DNA searches, the match and mismatch parameters set evolutionary look-back times and domain boundaries. In this unit, we will discuss the theoretical foundations that drive practical choices of protein and DNA similarity scoring matrices and gap penalties. Deep scoring matrices (BLOSUM62 and BLOSUM50) should be used for sensitive searches with full-length protein sequences, but short domains or restricted evolutionary look-back require shallower scoring matrices.

Subject(s)

Position-Specific Scoring Matrices , Amino Acid Sequence , Amino Acid Substitution , DNA , Molecular Sequence Data , Sequence Alignment , Sequence Homology, Amino Acid

14.

PSI-Search: iterative HOE-reduced profile SSEARCH searching.

Li, Weizhong; McWilliam, Hamish; Goujon, Mickael; Cowley, Andrew; Lopez, Rodrigo; Pearson, William R.

Bioinformatics ; 28(12): 1650-1, 2012 Jun 15.

Article in English | MEDLINE | ID: mdl-22539666

ABSTRACT

UNLABELLED: Iterative similarity searches with PSI-BLAST position-specific score matrices (PSSMs) find many more homologs than single searches, but PSSMs can be contaminated when homologous alignments are extended into unrelated protein domains-homologous over-extension (HOE). PSI-Search combines an optimal Smith-Waterman local alignment sequence search, using SSEARCH, with the PSI-BLAST profile construction strategy. An optional sequence boundary-masking procedure, which prevents alignments from being extended after they are initially included, can reduce HOE errors in the PSSM profile. Preventing HOE improves selectivity for both PSI-BLAST and PSI-Search, but PSI-Search has ~4-fold better selectivity than PSI-BLAST and similar sensitivity at 50% and 60% family coverage. PSI-Search is also produces 2- for 4-fold fewer false-positives than JackHMMER, but is ~5% less sensitive. AVAILABILITY AND IMPLEMENTATION: PSI-Search is available from the authors as a standalone implementation written in Perl for Linux-compatible platforms. It is also available through a web interface (www.ebi.ac.uk/Tools/sss/psisearch) and SOAP and REST Web Services (www.ebi.ac.uk/Tools/webservices).

Subject(s)

Amino Acid Motifs , Sequence Alignment/methods , Software , Computational Biology/methods , Databases, Protein , Internet , Programming Languages

15.

MACiE: exploring the diversity of biochemical reactions.

Holliday, Gemma L; Andreini, Claudia; Fischer, Julia D; Rahman, Syed Asad; Almonacid, Daniel E; Williams, Sophie T; Pearson, William R.

Nucleic Acids Res ; 40(Database issue): D783-9, 2012 Jan.

Article in English | MEDLINE | ID: mdl-22058127

ABSTRACT

MACiE (which stands for Mechanism, Annotation and Classification in Enzymes) is a database of enzyme reaction mechanisms, and can be accessed from http://www.ebi.ac.uk/thornton-srv/databases/MACiE/. This article presents the release of Version 3 of MACiE, which not only extends the dataset to 335 entries, covering 182 of the EC sub-subclasses with a crystal structure available (~90%), but also incorporates greater chemical and structural detail. This version of MACiE represents a shift in emphasis for new entries, from non-homologous representatives covering EC reaction space to enzymes with mechanisms of interest to our users and collaborators with a view to exploring the chemical diversity of life. We present new tools for exploring the data in MACiE and comparing entries as well as new analyses of the data and new searches, many of which can now be accessed via dedicated Perl scripts.

Subject(s)

Databases, Protein , Enzymes/chemistry , Biocatalysis , Biochemical Phenomena , Catalytic Domain , Coenzymes/chemistry , Enzymes/classification , Internet , Molecular Sequence Annotation

16.

RefProtDom: a protein database with improved domain boundaries and homology relationships.

Gonzalez, Mileidy W; Pearson, William R.

Bioinformatics ; 26(18): 2361-2, 2010 Sep 15.

Article in English | MEDLINE | ID: mdl-20693322

ABSTRACT

UNLABELLED: RefProtDom provides a set of divergent query domains, originally selected from Pfam, and full-length proteins containing their homologous domains, with diverse architectures, for evaluating pair-wise and iterative sequence similarity searches. Pfam homology and domain boundary annotations in the target library were supplemented using local and semi-global searches, PSI-BLAST searches, and SCOP and CATH classifications. AVAILABILITY: RefProtDom is available from http://faculty.virginia.edu/wrpearson/fasta/PUBS/gonzalez09a.

Subject(s)

Databases, Protein , Protein Structure, Tertiary , Proteins , Software

17.

Improving pairwise sequence alignment accuracy using near-optimal protein sequence alignments.

Sierk, Michael L; Smoot, Michael E; Bass, Ellen J; Pearson, William R.

BMC Bioinformatics ; 11: 146, 2010 Mar 22.

Article in English | MEDLINE | ID: mdl-20307279

ABSTRACT

BACKGROUND: While the pairwise alignments produced by sequence similarity searches are a powerful tool for identifying homologous proteins - proteins that share a common ancestor and a similar structure; pairwise sequence alignments often fail to represent accurately the structural alignments inferred from three-dimensional coordinates. Since sequence alignment algorithms produce optimal alignments, the best structural alignments must reflect suboptimal sequence alignment scores. Thus, we have examined a range of suboptimal sequence alignments and a range of scoring parameters to understand better which sequence alignments are likely to be more structurally accurate. RESULTS: We compared near-optimal protein sequence alignments produced by the Zuker algorithm and a set of probabilistic alignments produced by the probA program with structural alignments produced by four different structure alignment algorithms. There is significant overlap between the solution spaces of structural alignments and both the near-optimal sequence alignments produced by commonly used scoring parameters for sequences that share significant sequence similarity (E-values < 10-5) and the ensemble of probA alignments. We constructed a logistic regression model incorporating three input variables derived from sets of near-optimal alignments: robustness, edge frequency, and maximum bits-per-position. A ROC analysis shows that this model more accurately classifies amino acid pairs (edges in the alignment path graph) according to the likelihood of appearance in structural alignments than the robustness score alone. We investigated various trimming protocols for removing incorrect edges from the optimal sequence alignment; the most effective protocol is to remove matches from the semi-global optimal alignment that are outside the boundaries of the local alignment, although trimming according to the model-generated probabilities achieves a similar level of improvement. The model can also be used to generate novel alignments by using the probabilities in lieu of a scoring matrix. These alignments are typically better than the optimal sequence alignment, and include novel correct structural edges. We find that the probA alignments sample a larger variety of alignments than the Zuker set, which more frequently results in alignments that are closer to the structural alignments, but that using the probA alignments as input to the regression model does not increase performance. CONCLUSIONS: The pool of suboptimal pairwise protein sequence alignments substantially overlaps structure-based alignments for pairs with statistically significant similarity, and a regression model based on information contained in this alignment pool improves the accuracy of pairwise alignments with respect to structure-based alignments.

Subject(s)

Proteins/chemistry , Sequence Alignment/methods , Sequence Analysis, Protein

18.

Homologous over-extension: a challenge for iterative similarity searches.

Gonzalez, Mileidy W; Pearson, William R.

Nucleic Acids Res ; 38(7): 2177-89, 2010 Apr.

Article in English | MEDLINE | ID: mdl-20064877

ABSTRACT

We have characterized a novel type of PSI-BLAST error, homologous over-extension (HOE), using embedded PFAM domain queries on searches against a reference library containing Pfam-annotated UniProt sequences and random synthetic sequences. PSI-BLAST makes two types of errors: alignments to non-homologous regions and HOE alignments that begin in a homologous region, but extend beyond the homology into neighboring sequence regions. When the neighboring sequence region contains a non-homologous domain, PSI-BLAST can incorporate the unrelated sequence into its position specific scoring matrix, which then finds non-homologous proteins with significant expectation values. HOE accounts for the largest fraction of the initial false positive (FP) errors, and the largest fraction of FPs at iteration 5. In searches against complete protein sequences, 5-9% of alignments at iteration 5 are non-homologous. HOE frequently begins in a partial protein domain; when partial domains are removed from the library, HOE errors decrease from 16 to 3% of weighted coverage (hard queries; 35-5% for sampled queries) and no-error searches increase from 2 to 58% weighed coverage (hard; 16-78% sampled). When HOE is reduced by not extending previously found sequences, PSI-BLAST specificity improves 4-8-fold, with little loss in sensitivity.

Subject(s)

Sequence Alignment/methods , Sequence Homology, Amino Acid , Phylogeny , Position-Specific Scoring Matrices , Protein Structure, Tertiary , Proteins/chemistry , Proteins/classification , Proteins/genetics

19.

Globally, unrelated protein sequences appear random.

Lavelle, Daniel T; Pearson, William R.

Bioinformatics ; 26(3): 310-8, 2010 Feb 01.

Article in English | MEDLINE | ID: mdl-19948773

ABSTRACT

MOTIVATION: To test whether protein folding constraints and secondary structure sequence preferences significantly reduce the space of amino acid words in proteins, we compared the frequencies of four- and five-amino acid word clumps (independent words) in proteins to the frequencies predicted by four random sequence models. RESULTS: While the human proteome has many overrepresented word clumps, these words come from large protein families with biased compositions (e.g. Zn-fingers). In contrast, in a non-redundant sample of Pfam-AB, only 1% of four-amino acid word clumps (4.7% of 5mer words) are 2-fold overrepresented compared with our simplest random model [MC(0)], and 0.1% (4mers) to 0.5% (5mers) are 2-fold overrepresented compared with a window-shuffled random model. Using a false discovery rate q-value analysis, the number of exceptional four- or five-letter words in real proteins is similar to the number found when comparing words from one random model to another. Consensus overrepresented words are not enriched in conserved regions of proteins, but four-letter words are enriched 1.18- to 1.56-fold in alpha-helical secondary structures (but not beta-strands). Five-residue consensus exceptional words are enriched for alpha-helix 1.43- to 1.61-fold. Protein word preferences in regular secondary structure do not appear to significantly restrict the use of sequence words in unrelated proteins, although the consensus exceptional words have a secondary structure bias for alpha-helix. Globally, words in protein sequences appear to be under very few constraints; for the most part, they appear to be random. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

Amino Acid Sequence , Proteins/chemistry , Sequence Analysis, Protein/methods , Databases, Protein , Protein Folding , Protein Structure, Secondary

20.

The limits of protein sequence comparison?

Pearson, William R; Sierk, Michael L.

Curr Opin Struct Biol ; 15(3): 254-60, 2005 Jun.

Article in English | MEDLINE | ID: mdl-15919194

ABSTRACT

Modern sequence alignment algorithms are used routinely to identify homologous proteins, proteins that share a common ancestor. Homologous proteins always share similar structures and often have similar functions. Over the past 20 years, sequence comparison has become both more sensitive, largely because of profile-based methods, and more reliable, because of more accurate statistical estimates. As sequence and structure databases become larger, and comparison methods become more powerful, reliable statistical estimates will become even more important for distinguishing similarities that are due to homology from those that are due to analogy (convergence). The newest sequence alignment methods are more sensitive than older methods, but more accurate statistical estimates are needed for their full power to be realized.

Subject(s)

Algorithms , Databases, Protein , Proteins/chemistry , Sequence Alignment/methods , Sequence Analysis, Protein/methods , Amino Acid Sequence , Molecular Sequence Data , Proteins/analysis , Proteins/classification , Sequence Alignment/trends , Sequence Analysis, Protein/trends , Sequence Homology, Amino Acid , Structure-Activity Relationship

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL