Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 25
Filter
Add more filters










Publication year range
1.
Sci Rep ; 13(1): 17203, 2023 10 11.
Article in English | MEDLINE | ID: mdl-37821494

ABSTRACT

Invasive plant pathogenic fungi have a global impact, with devastating economic and environmental effects on crops and forests. Biosurveillance, a critical component of threat mitigation, requires risk prediction based on fungal lifestyles and traits. Recent studies have revealed distinct genomic patterns associated with specific groups of plant pathogenic fungi. We sought to establish whether these phytopathogenic genomic patterns hold across diverse taxonomic and ecological groups from the Ascomycota and Basidiomycota, and furthermore, if those patterns can be used in a predictive capacity for biosurveillance. Using a supervised machine learning approach that integrates phylogenetic and genomic data, we analyzed 387 fungal genomes to test a proof-of-concept for the use of genomic signatures in predicting fungal phytopathogenic lifestyles and traits during biosurveillance activities. Our machine learning feature sets were derived from genome annotation data of carbohydrate-active enzymes (CAZymes), peptidases, secondary metabolite clusters (SMCs), transporters, and transcription factors. We found that machine learning could successfully predict fungal lifestyles and traits across taxonomic groups, with the best predictive performance coming from feature sets comprising CAZyme, peptidase, and SMC data. While phylogeny was an important component in most predictions, the inclusion of genomic data improved prediction performance for every lifestyle and trait tested. Plant pathogenicity was one of the best-predicted traits, showing the promise of predictive genomics for biosurveillance applications. Furthermore, our machine learning approach revealed expansions in the number of genes from specific CAZyme and peptidase families in the genomes of plant pathogens compared to non-phytopathogenic genomes (saprotrophs, endo- and ectomycorrhizal fungi). Such genomic feature profiles give insight into the evolution of fungal phytopathogenicity and could be useful to predict the risks of unknown fungi in future biosurveillance activities.


Subject(s)
Ascomycota , Genome, Fungal , Humans , Phylogeny , Genome, Fungal/genetics , Ascomycota/genetics , Genomics , Peptide Hydrolases/genetics , Life Style , Machine Learning
2.
Stud Mycol ; 96: 141-153, 2020 Jun.
Article in English | MEDLINE | ID: mdl-32206138

ABSTRACT

Dothideomycetes is the largest class of kingdom Fungi and comprises an incredible diversity of lifestyles, many of which have evolved multiple times. Plant pathogens represent a major ecological niche of the class Dothideomycetes and they are known to infect most major food crops and feedstocks for biomass and biofuel production. Studying the ecology and evolution of Dothideomycetes has significant implications for our fundamental understanding of fungal evolution, their adaptation to stress and host specificity, and practical implications with regard to the effects of climate change and on the food, feed, and livestock elements of the agro-economy. In this study, we present the first large-scale, whole-genome comparison of 101 Dothideomycetes introducing 55 newly sequenced species. The availability of whole-genome data produced a high-confidence phylogeny leading to reclassification of 25 organisms, provided a clearer picture of the relationships among the various families, and indicated that pathogenicity evolved multiple times within this class. We also identified gene family expansions and contractions across the Dothideomycetes phylogeny linked to ecological niches providing insights into genome evolution and adaptation across this group. Using machine-learning methods we classified fungi into lifestyle classes with >95 % accuracy and identified a small number of gene families that positively correlated with these distinctions. This can become a valuable tool for genome-based prediction of species lifestyle, especially for rarely seen and poorly studied species.

3.
Stud Mycol ; 91: 61-78, 2018 Sep.
Article in English | MEDLINE | ID: mdl-30425417

ABSTRACT

The fungal kingdom is too large to be discovered exclusively by classical genetics. The access to omics data opens a new opportunity to study the diversity within the fungal kingdom and how adaptation to new environments shapes fungal metabolism. Genomes are the foundation of modern science but their quality is crucial when analysing omics data. In this study, we demonstrate how one gold-standard genome can improve functional prediction across closely related species to be able to identify key enzymes, reactions and pathways with the focus on primary carbon metabolism. Based on this approach we identified alternative genes encoding various steps of the different sugar catabolic pathways, and as such provided leads for functional studies into this topic. We also revealed significant diversity with respect to genome content, although this did not always correlate to the ability of the species to use the corresponding sugar as a carbon source.

5.
Nature ; 452(7183): 88-92, 2008 Mar 06.
Article in English | MEDLINE | ID: mdl-18322534

ABSTRACT

Mycorrhizal symbioses--the union of roots and soil fungi--are universal in terrestrial ecosystems and may have been fundamental to land colonization by plants. Boreal, temperate and montane forests all depend on ectomycorrhizae. Identification of the primary factors that regulate symbiotic development and metabolic activity will therefore open the door to understanding the role of ectomycorrhizae in plant development and physiology, allowing the full ecological significance of this symbiosis to be explored. Here we report the genome sequence of the ectomycorrhizal basidiomycete Laccaria bicolor (Fig. 1) and highlight gene sets involved in rhizosphere colonization and symbiosis. This 65-megabase genome assembly contains approximately 20,000 predicted protein-encoding genes and a very large number of transposons and repeated sequences. We detected unexpected genomic features, most notably a battery of effector-type small secreted proteins (SSPs) with unknown function, several of which are only expressed in symbiotic tissues. The most highly expressed SSP accumulates in the proliferating hyphae colonizing the host root. The ectomycorrhizae-specific SSPs probably have a decisive role in the establishment of the symbiosis. The unexpected observation that the genome of L. bicolor lacks carbohydrate-active enzymes involved in degradation of plant cell walls, but maintains the ability to degrade non-plant cell wall polysaccharides, reveals the dual saprotrophic and biotrophic lifestyle of the mycorrhizal fungus that enables it to grow within both soil and living plant roots. The predicted gene inventory of the L. bicolor genome, therefore, points to previously unknown mechanisms of symbiosis operating in biotrophic mycorrhizal fungi. The availability of this genome provides an unparalleled opportunity to develop a deeper understanding of the processes by which symbionts interact with plants within their ecosystem to perform vital functions in the carbon and nitrogen cycles that are fundamental to sustainable plant productivity.


Subject(s)
Basidiomycota/genetics , Basidiomycota/physiology , Genome, Fungal/genetics , Mycorrhizae/genetics , Mycorrhizae/physiology , Plant Roots/microbiology , Symbiosis/physiology , Abies/microbiology , Abies/physiology , Basidiomycota/enzymology , Fungal Proteins/classification , Fungal Proteins/genetics , Fungal Proteins/metabolism , Gene Expression Regulation , Genes, Fungal/genetics , Hyphae/genetics , Hyphae/metabolism , Mycorrhizae/enzymology , Plant Roots/physiology , Symbiosis/genetics
6.
Science ; 313(5793): 1596-604, 2006 Sep 15.
Article in English | MEDLINE | ID: mdl-16973872

ABSTRACT

We report the draft genome of the black cottonwood tree, Populus trichocarpa. Integration of shotgun sequence assembly with genetic mapping enabled chromosome-scale reconstruction of the genome. More than 45,000 putative protein-coding genes were identified. Analysis of the assembled genome revealed a whole-genome duplication event; about 8000 pairs of duplicated genes from that event survived in the Populus genome. A second, older duplication event is indistinguishably coincident with the divergence of the Populus and Arabidopsis lineages. Nucleotide substitution, tandem gene duplication, and gross chromosomal rearrangement appear to proceed substantially more slowly in Populus than in Arabidopsis. Populus has more protein-coding genes than Arabidopsis, ranging on average from 1.4 to 1.6 putative Populus homologs for each Arabidopsis gene. However, the relative frequency of protein domains in the two genomes is similar. Overrepresented exceptions in Populus include genes associated with lignocellulosic wall biosynthesis, meristem development, disease resistance, and metabolite transport.


Subject(s)
Gene Duplication , Genome, Plant , Populus/genetics , Sequence Analysis, DNA , Arabidopsis/genetics , Chromosome Mapping , Computational Biology , Evolution, Molecular , Expressed Sequence Tags , Gene Expression , Genes, Plant , Oligonucleotide Array Sequence Analysis , Phylogeny , Plant Proteins/chemistry , Plant Proteins/genetics , Polymorphism, Single Nucleotide , Populus/growth & development , Populus/metabolism , Protein Structure, Tertiary , RNA, Plant/analysis , RNA, Untranslated/analysis
7.
Biochem Soc Trans ; 28(2): 269-75, 2000 Feb.
Article in English | MEDLINE | ID: mdl-10816141

ABSTRACT

The CATH database of protein structures contains approximately 18000 domains organized according to their (C)lass, (A)rchitecture, (T)opology and (H)omologous superfamily. Relationships between evolutionary related structures (homologues) within the database have been used to test the sensitivity of various sequence search methods in order to identify relatives in Genbank and other sequence databases. Subsequent application of the most sensitive and efficient algorithms, gapped blast and the profile based method, Position Specific Iterated Basic Local Alignment Tool (PSI-BLAST), could be used to assign structural data to between 22 and 36 % of microbial genomes in order to improve functional annotation and enhance understanding of biological mechanism. However, on a cautionary note, an analysis of functional conservation within fold groups and homologous superfamilies in the CATH database, revealed that whilst function was conserved in nearly 55% of enzyme families, function had diverged considerably, in some highly populated families. In these families, functional properties should be inherited far more cautiously and the probable effects of substitutions in key functional residues carefully assessed.


Subject(s)
Databases, Factual , Genome , Algorithms , Protein Conformation , Protein Structure, Tertiary , Structure-Activity Relationship
8.
Genome Res ; 10(4): 516-22, 2000 Apr.
Article in English | MEDLINE | ID: mdl-10779491

ABSTRACT

Ab initio gene identification in the genomic sequence of Drosophila melanogaster was obtained using (human gene predictor) and Fgenesh programs that have organism-specific parameters for human, Drosophila, plants, yeast, and nematode. We did not use information about cDNA/EST in most predictions to model a real situation for finding new genes because information about complete cDNA is often absent or based on very small partial fragments. We investigated the accuracy of gene prediction on different levels and designed several schemes to predict an unambiguous set of genes (annotation CGG1), a set of reliable exons (annotation CGG2), and the most complete set of exons (annotation CGG3). For 49 genes, protein products of which have clear homologs in protein databases, predictions were recomputed by Fgenesh+ program. The first annotation serves as the optimal computational description of new sequence to be presented in a database. Reliable exons from the second annotation serve as good candidates for selecting the PCR primers for experimental work for gene structure verification. Our results shows that we can identify approximately 90% of coding nucleotides with 20% false positives. At the exon level we accurately predicted 65% of exons and 89% including overlapping exons with 49% false positives. Optimizing accuracy of prediction, we designed a gene identification scheme using Fgenesh, which provided sensitivity (Sn) = 98% and specificity (Sp) = 86% at the base level, Sn = 81% (97% including overlapping exons) and Sp = 58% at the exon level and Sn = 72% and Sp = 39% at the gene level (estimating sensitivity on std1 set and specificity on std3 set). In general, these results showed that computational gene prediction can be a reliable tool for annotating new genomic sequences, giving accurate information on 90% of coding sequences with 14% false positives. However, exact gene prediction (especially at the gene level) needs additional improvement using gene prediction algorithms. The program was also tested for predicting genes of human Chromosome 22 (the last variant of Fgenesh can analyze the whole chromosome sequence). This analysis has demonstrated that the 88% of manually annotated exons in Chromosome 22 were among the ab initio predicted exons. The suite of gene identification programs is available through the WWW server of Computational Genomics Group at http://genomic.sanger.ac.uk/gf. html.


Subject(s)
Algorithms , DNA/genetics , Drosophila melanogaster/genetics , Genes, Insect/genetics , Software , Animals , Computational Biology/methods , Databases, Factual , Humans
9.
Protein Sci ; 8(4): 771-7, 1999 Apr.
Article in English | MEDLINE | ID: mdl-10211823

ABSTRACT

We describe the results of a procedure for maximizing the number of sequences that can be reliably linked to a protein of known three-dimensional structure. Unlike other methods, which try to increase sensitivity through the use of fold recognition software, we only use conventional sequence alignment tools, but apply them in a manner that significantly increases the number of relationships detected. We analyzed 11 genomes and found that, depending on the genome, between 23 and 32% of the ORFs had significant matches to proteins of known structure. In all cases, the aligned region consisted of either >100 residues or >50% of the smaller sequence. Slightly higher percentages could be attained if smaller motifs were also included. This is significantly higher than most previously reported methods, even those that have a fold-recognition component. We survey the biochemical and structural characteristics of the most frequently occurring proteins, and discuss the extent to which alignment methods can realistically assign function to gene products.


Subject(s)
Protein Conformation , Sequence Analysis, DNA/methods , Algorithms , Computer Simulation , Databases, Factual , Protein Structure, Tertiary , Sensitivity and Specificity , Sequence Alignment/methods
10.
Protein Eng ; 12(2): 95-100, 1999 Feb.
Article in English | MEDLINE | ID: mdl-10195280

ABSTRACT

Using data from the CATH structure classification, we have assessed the blastp, fasta, smith-waterman and gapped-blast algorithms, developed a portable normalization scheme and identified safe thresholds for database searching. Of the four methods assessed, fasta, smith-waterman and gapped-blast perform similarly, whereas the sensitivity of blastp was much lower. Introduction of an intermediate sequence search substantially improved the results. When tested on a set of relationships that could not be identified by blastp, intermediate sequences were able to find double the number of relationships identified by the smith-waterman algorithm alone. However, we found that the benefit of using intermediates varied considerably between each family and depended not only on the number of available sequences, but also their diversity. In an attempt to increase sensitivity further, a multiple intermediate sequence search (MISS) procedure was developed. When assessed on 1906 cases from a wide range of homologous families that could not be detected by the previous approaches, MISS was able to identify 241 additional relationships. MISS uses the full extent of sequence diversity to detect additional relationships, but does not consider any structure-specific information. For this reason, it is more generally applicable than fold recognition and threading methods, which require a library of known structures.


Subject(s)
Databases, Factual , Sequence Alignment/methods , Sequence Homology, Amino Acid , Computer Simulation , Models, Statistical , Protein Conformation , Sensitivity and Specificity
11.
Nucleic Acids Res ; 27(1): 248-50, 1999 Jan 01.
Article in English | MEDLINE | ID: mdl-9847192

ABSTRACT

INFOGENE is a database of known and predicted gene structures with descriptions of basic functional signals and gene components. It provides a possibility to create compilations of sequences with a given gene feature as well as to accumulate and analyze predicted genes in finished and unfinished sequences from genome sequencing projects. Protein sequence similarity searches in the database of predicted proteins is offered through the BLASTP program. INFOGENE is realized under the Sequence Retrieval System that provides useful links with the other informational databases. The database is available through the WWW server of the Computational Genomics Group at http://genomic.sanger.ac.uk/db.html


Subject(s)
Databases, Factual , Genes , Genome , Proteins/genetics , Sequence Analysis, DNA , Animals , Arabidopsis/genetics , Base Sequence , Drosophila/genetics , Exons/genetics , Expressed Sequence Tags , Human Genome Project , Humans , Information Storage and Retrieval , Internet , Mice , Proteins/chemistry , Sequence Homology, Amino Acid , Software
12.
Bioinformatics ; 14(5): 384-90, 1998 Jun.
Article in English | MEDLINE | ID: mdl-9682051

ABSTRACT

MOTIVATION: In cDNA sequencing projects, it is vital to know whether the protein coding region of a sequence is complete, or whether errors have occurred during library construction. Here we present a linear discriminant approach that predicts this completeness by estimating the probability of each ATG being the initiation codon. RESULTS: Because of the current shortage of full-length cDNA data on which to base this work, tests were performed on a non-redundant set of 660 initiation codon-containing DNA sequences that had been conceptually spliced into mRNA/cDNA. We also used an edited set of the same sequences that only contained the region following the initiation codon as a negative control. Using the criterion that only a single prediction is allowed for each sequence, a cut-off was selected at which discrimination of both positive and negative sets was equal. At this cut-off, 67% of each set could be correctly distinguished, with the correct ATG codon also being identified in the positive set. Reliability could be increased further by raising the cut-off or including homologues, the relative merits of which are discussed. AVAILABILITY: The prediction program, called ATGpr, and other data are available at http://www.hri.co.jp/atgpr CONTACT: swintech@hri.co.jp


Subject(s)
DNA, Complementary/genetics , Proteins/genetics , Sequence Analysis, DNA , Base Sequence , Codon, Initiator/genetics , Computational Biology , Databases, Factual , Humans , Open Reading Frames , RNA, Messenger/genetics
13.
J Mol Biol ; 268(1): 31-6, 1997 Apr 25.
Article in English | MEDLINE | ID: mdl-9149139

ABSTRACT

The accuracy of secondary structure prediction methods has been improved significantly by the use of aligned protein sequences. The PHD method and the NNSSP method reach 71 to 72% of sustained overall three-state accuracy when multiple sequence alignments are with neural networks and nearest-neighbor algorithms, respectively. We introduce a variant of the nearest-neighbor approach that can achieve similar accuracy using a single sequence as the query input. We compute the 50 best non-intersecting local alignments of the query sequence with each sequence from a set of proteins with known 3D structures. Each position of the query sequence is aligned with the database amino acids in alpha-helical, beta-strand or coil states. The prediction type of secondary structure is selected as the type of aligned position with the maximal total score. On the dataset of 124 non-membrane non-homologous proteins, used earlier as a benchmark for secondary structure predictions, our method reaches an overall three-state accuracy of 71.2%. The performance accuracy is verified by an additional test on 461 non-homologous proteins giving an accuracy of 71.0%. The main strength of the method is the high level of prediction accuracy for proteins without any known homolog. Using multiple sequence alignments as input the method has a prediction accuracy of 73.5%. Prediction of secondary structure by the SSPAL method is available via Baylor College of Medicine World Wide Web server.


Subject(s)
Algorithms , Models, Molecular , Protein Structure, Secondary , Sequence Alignment/methods , Amino Acid Sequence , Databases, Factual , Molecular Sequence Data , Proteins/chemistry
14.
Comput Appl Biosci ; 13(1): 23-8, 1997 Feb.
Article in English | MEDLINE | ID: mdl-9088705

ABSTRACT

We have developed a computer program POLYAH and an algorithm for the identification of 3'-processing sites of human mRNA precursors. The algorithm is based on a linear discriminant function (LDF) trained to discriminate real poly(A) signal regions from the other regions of human genes possessing the AATAAA sequence which is most likely non-functional. As the parameters of LDF, various significant contextual characteristics of sequences surrounding AATAAA signals were used. An accuracy of method has been estimated on a set of 131 poly(A) regions and 1466 regions of human genes having the AATAAA sequence. When the threshold was set to predict 86% of poly(A) regions correctly, specificity of 51% and correlation coefficient of 0.62 had been achieved. The precision of this approach is better than for the other methods and has been tested on a larger data set. POLYAH can be used through World Wide Web (at Gene-Finder Home page: URL http:@dot.imgen.bcm.tmc.edu:9331/gene-finder/ gf.html) or by sending files with uncharacterized human sequences to the University of Houston or Weizmann Institute of Science e-mail servers.


Subject(s)
Algorithms , RNA Precursors/metabolism , RNA Processing, Post-Transcriptional , RNA, Messenger/metabolism , Software , Base Sequence , Binding Sites , Computer Communication Networks , Computer Simulation , Databases, Factual , Discriminant Analysis , Humans , RNA Precursors/chemistry , RNA Precursors/genetics , RNA, Messenger/chemistry , RNA, Messenger/genetics
15.
Article in English | MEDLINE | ID: mdl-9322052

ABSTRACT

We present a complex of new programs for promoter, 3'-processing, splice sites, coding exons and gene structure identification in genomic DNA of several model species. The human gene structure prediction program FGENEH, exon prediction-FEXH and splice site prediction-HSPL have been modified for sequence analysis of Drosophila (FGENED, FEXD and DSPL), C.elegance (FGENEN, FEXN and NSPL), Yeast (FEXY and YSPL) and Plant (FGENEA, FEXA and ASPL) genomic sequences. We recomputed all frequency and discriminant function parameters for these organisms and adjusted organism specific minimal intron lengths. An accuracy of coding region prediction for these programs is similar with the observed accuracy of FEXH and FGENEH. We have developed FEXHB and FGENEHB programs combining pattern recognition features and information about similarity of predicted exons with known sequences in protein databases. These programs have approximately 10% higher average accuracy of coding region recognition. Two new programs for human promoter site prediction (TSSG and TSSW) have been developed which use Gosh (1993) and Wingender (1994) data bases of functional motifs, respectively. POLYAH program was designed for prediction of 3'-processing regions in human genes and CDSB program was developed for bacterial gene prediction. We have developed a new approach to predict multiple genes based on double dynamic programming, that is very important for analysis of long genomic DNA fragments generated by genome sequencing projects. Analysis of uncharacterized sequences based on our methods is available through the University of Houston, Weizmann Institute of Science email servers and several Web pages at Baylor College of Medicine.


Subject(s)
Genome, Human , Genome , Software , Animals , DNA/genetics , Databases, Factual , Exons , Genes, Bacterial , Humans , Models, Genetic , Promoter Regions, Genetic , RNA Splicing , Sequence Alignment/methods , Sequence Alignment/statistics & numerical data
16.
J Mol Biol ; 247(1): 11-5, 1995 Mar 17.
Article in English | MEDLINE | ID: mdl-7897654

ABSTRACT

Recently Yi & Lander used a neural network and nearest-neighbor method with a scoring system that combined a sequence-similarity matrix with the local structural environment scoring scheme described by Bowie and co-workers for predicting protein secondary structure. We have improved their scoring system by taking into consideration N and C-terminal positions of alpha-helices and beta-strands and also beta-turns as distinctive types of secondary structure. Another improvement, which also decreases the time of computation, is performed by restricting a data base with a smaller subset of proteins that are similar with a query sequence. Using multiple sequence alignments rather than single sequences and a simple jury decision procedure our method reaches a sustained overall three-state accuracy of 72.2%, which is better than that observed for the most accurate multilayered neural-network approach, tested on the same data set of 126 non-homologous protein chains.


Subject(s)
Protein Structure, Secondary , Proteins/chemistry , Algorithms , Biological Evolution , Sequence Alignment , Sequence Homology, Amino Acid , Software
17.
Article in English | MEDLINE | ID: mdl-7584460

ABSTRACT

Development of advanced technique to identify gene structure is one of the main challenges of the Human Genome Project. Discriminant analysis was applied to the construction of recognition functions for various components of gene structure. Linear discriminant functions for splice sites, 5'-coding, internal exon, and 3'-coding region recognition have been developed. A gene structure prediction system FGENE has been developed based on the exon recognition functions. We compute a graph of mutual compatibility of different exons and present a gene structure models as paths of this directed acyclic graph. For an optimal model selection we apply a variant of dynamic programming algorithm to search for the path in the graph with the maximal value of the corresponding discriminant functions. Prediction by FGENE for 185 complete human gene sequences has 81% exact exon recognition accuracy and 91% accuracy at the level of individual exon nucleotides with the correlation coefficient (C) equals 0.90. Testing FGENE on 35 genes not used in the development of discriminant functions shows 71% accuracy of exact exon prediction and 89% at the nucleotide level (C = 0.86). FGENE compares very favorably with the other programs currently used to predict protein-coding regions. Analysis of uncharacterized human sequences based on our methods for splice site (HSPL, RNASPL), internal exons (HEXON), all type of exons (FEXH) and human (FGENEH) and bacterial (CDSB) gene structure prediction and recognition of human and bacterial sequences (HBR) (to test a library for E. coli contamination) is available through the University of Houston, Weizmann Institute of Science network server and a WWW page of the Human Genome Center at Baylor College of Medicine.


Subject(s)
Algorithms , DNA/chemistry , DNA/genetics , Exons , Human Genome Project , Software , Base Sequence , Discriminant Analysis , Genes, Bacterial , Humans , Models, Genetic , Molecular Sequence Data , Open Reading Frames
18.
Nucleic Acids Res ; 22(24): 5156-63, 1994 Dec 11.
Article in English | MEDLINE | ID: mdl-7816600

ABSTRACT

A new method which predicts internal exon sequences in human DNA has been developed. The method is based on a splice site prediction algorithm that uses the linear discriminant function to combine information about significant triplet frequencies of various functional parts of splice site regions and preferences of oligonucleotides in protein coding and intron regions. The accuracy of our splice site recognition function is 97% for donor splice sites and 96% for acceptor splice sites. For exon prediction, we combine in a discriminant function the characteristics describing the 5'-intron region, donor splice site, coding region, acceptor splice site and 3'-intron region for each open reading frame flanked by GT and AG base pairs. The accuracy of precise internal exon recognition on a test set of 451 exon and 246693 pseudoexon sequences is 77% with a specificity of 79%. The recognition quality computed at the level of individual nucleotides is 89% for exon sequences and 98% for intron sequences. This corresponds to a correlation coefficient for exon prediction of 0.87. The precision of this approach is better than other methods and has been tested on a larger data set. We have also developed a means for predicting exon-exon junctions in cDNA sequences, which can be useful for selecting optimal PCR primers.


Subject(s)
Algorithms , Base Composition , Exons/genetics , Open Reading Frames/genetics , Base Sequence , Databases, Factual , Discriminant Analysis , Humans , Introns/genetics , Molecular Sequence Data , Oligodeoxyribonucleotides/genetics , RNA Splicing/genetics
19.
Comput Appl Biosci ; 10(6): 661-9, 1994 Dec.
Article in English | MEDLINE | ID: mdl-7704665

ABSTRACT

All current methods of protein secondary structure prediction are based on evaluation of a single residue state. Although the accuracy of the best of them is approximately 60-70%, for reliable prediction of tertiary structure it is more useful to predict an approximate location of alpha-helix and beta-strand segments, especially prolonged ones. We have developed a simple method for protein secondary structure prediction which is oriented on the location of secondary structure segments. The method uses linear discriminant analysis to assign segments of a given amino acid sequence a particular type of secondary structure, by taking into account the amino acid composition of internal parts of segments as well as their terminal and adjacent regions. Four linear discriminant functions were constructed for recognition of short and long alpha-helix and beta-strand segments respectively. These functions combine three characteristics: hydrophobic moment, segment singlet, and pair preferences to an alpha-helix or beta-strand. The last two characteristics are calculated by summing the preference parameters of single residues and pairs of residues located in a segment and its adjacent regions. The final program SSP predicts all possible potential alpha-helices and beta-strands and resolves some possible overlap between them. Overall three-state (alpha, beta, c) prediction gives approximately 65.1% correctly predicted residues on 126 non-homologous proteins using the jackknife test procedure. Analysis of the prediction results shows a high prediction accuracy of long secondary structure segments (approximately 89% of alpha-helices of length > 8 and approximately 71% of beta-strands of length > 6 are correctly located with probability of correct prediction 0.82 and 0.78 respectively.(ABSTRACT TRUNCATED AT 250 WORDS)


Subject(s)
Protein Structure, Secondary , Proteins/chemistry , Software , Algorithms , Amino Acid Sequence , Computer Simulation , Discriminant Analysis , Models, Molecular , Molecular Sequence Data , Molecular Structure , Protein Folding , Protein Structure, Tertiary , Sequence Alignment
20.
Article in English | MEDLINE | ID: mdl-7584412

ABSTRACT

Discriminant analysis is applied to the problem of recognition 5'-, internal and 3'-exons in human DNA sequences. Specific recognition functions were developed for revealing exons of particular types. The method based on a splice site prediction algorithm that uses the linear Fisher discriminant to combine the information about significant triplet frequencies of various functional parts of splice site regions and preferences of oligonucleotides in protein coding and intron regions (Solovyev, Lawrence, 1994). The accuracy of our splice site recognition function is about 97%. A discriminant function for 5'-exon prediction includes hexanucleotide composition of upstream region, triplet composition around the ATG codon, ORF coding potential, donor splice site potential and composition of downstream intron region. For internal exon prediction, we combine in a discriminant function the characteristics describing the 5'-intron region, donor splice site, coding region, acceptor splice site and 3'-intron region for each open reading frame flanked by GT and AG base pairs. The accuracy of precise internal exon recognition on a test set of 451 exon and 246693 pseudoexon sequences is 77% with a specificity of 79% and a level of pseudoexon ORF prediction of 99.96%. The recognition quality computed at the level of individual nucleotides is 89% for exon sequences and 98% for intron sequences. A discriminant function for 3'-exon prediction includes octanucleotide composition of upstream intron region, triplet composition around the stop codon, ORF coding potential, acceptor splice site potential and hexanucleotide composition of downstream region.(ABSTRACT TRUNCATED AT 250 WORDS)


Subject(s)
Computer Simulation , Discriminant Analysis , Exons/genetics , Models, Theoretical , Sequence Analysis , Humans , Oligonucleotides/genetics
SELECTION OF CITATIONS
SEARCH DETAIL
...