Search | VHL Regional Portal

Splice site prediction in Arabidopsis thaliana pre-mRNA by combining local and global sequence information.

Hebsgaard, S M; Korning, P G; Tolstrup, N; Engelbrecht, J; Rouzé, P; Brunak, S.

Nucleic Acids Res ; 24(17): 3439-52, 1996 Sep 01.

Article in English | MEDLINE | ID: mdl-8811101

ABSTRACT

Artificial neural networks have been combined with a rule based system to predict intron splice sites in the dicot plant Arabidopsis thaliana. A two step prediction scheme, where a global prediction of the coding potential regulates a cutoff level for a local prediction of splice sites, is refined by rules based on splice site confidence values, prediction scores, coding context and distances between potential splice sites. In this approach, the prediction of splice sites mutually affect each other in a non-local manner. The combined approach drastically reduces the large amount of false positive splice sites normally haunting splice site prediction. An analysis of the errors made by the networks in the first step of the method revealed a previously unknown feature, a frequent T-tract prolongation containing cryptic acceptor sites in the 5' end of exons. The method presented here has been compared with three other approaches, GeneFinder, Gene-Mark and Grail. Overall the method presented here is an order of magnitude better. We show that the new method is able to find a donor site in the coding sequence for the jelly fish Green Fluorescent Protein, exactly at the position that was experimentally observed in A.thaliana transformants. Predictions for alternatively spliced genes are also presented, together with examples of genes from other dicots, monocots and algae. The method has been made available through electronic mail (NetPlantGene@cbs.dtu.dk), or the WWW at http://www.cbs.dtu.dk/NetPlantGene.html

Subject(s)

Arabidopsis/genetics , Artificial Intelligence , Models, Genetic , RNA Precursors/genetics , RNA Splicing/genetics , RNA, Plant/genetics , Algorithms , DNA, Plant/genetics , Databases, Factual , Exons , Expert Systems , Forecasting , Green Fluorescent Proteins , Introns , Luminescent Proteins/genetics , Molecular Sequence Data , Neural Networks, Computer , Reproducibility of Results

Cleaning the GenBank Arabidopsis thaliana data set.

Korning, P G; Hebsgaard, S M; Rouze, P; Brunak, S.

Nucleic Acids Res ; 24(2): 316-20, 1996 Jan 15.

Article in English | MEDLINE | ID: mdl-8628656

ABSTRACT

Data driven computational biology relies on the large quantities of genomic data stored in international sequence data banks. However, the possibilities are drastically impaired if the stored data is unreliable. During a project aiming to predict splice sites in the dicot Arabidopsis thaliana, we extracted a data set from the A.thaliana entries in GenBank. A number of simple 'sanity' checks, based on the nature of the data, revealed an alarmingly high error rate. More than 15% of the most important entries extracted did contain erroneous information. In addition, a number of entries had directly conflicting assignments of exons and introns, not stemming from alternative splicing. In a few cases the errors are due to mere typographical misprints, which may be corrected by comparison to the original papers, but errors caused by wrong assignments of splice sites from experimental data are the most common. It is proposed that the level of error correction should be increased and that gene structure sanity checks should be incorporated--also at the submitter level--to avoid or reduce the problem in the future. A non-redundant and error corrected subset of the data for A.thaliana is made available through anonymous FTP.

Subject(s)

Arabidopsis/genetics , Databases, Factual , Algorithms , Base Sequence , DNA, Plant/genetics , Genome, Plant , Introns , Molecular Sequence Data , Neural Networks, Computer , RNA Splicing/genetics

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL