Search | VHL Regional Portal

Evaluation of EST-data using the genome assembly.

Murray, Christian G; Larsson, Thomas P; Hill, Tobias; Björklind, Rikard; Fredriksson, Robert; Schiöth, Helgi B.

Biochem Biophys Res Commun ; 331(4): 1566-76, 2005 Jun 17.

Article in English | MEDLINE | ID: mdl-15883052

ABSTRACT

Using expressed sequence tag (EST) data for genomewide studies requires thorough understanding of the nature of the problems that are related to handling these sequences. We investigated how EST clustering performs when the genome is used as guidance as compared to pairwise sequence alignment methods. We show that clustering with the genome as a template outperforms sequence similarity methods used to create other EST clusters, such as the UniGene set, in respect to the extent ESTs originating from the same transcriptional unit are separated into disjunct clusters. Using our approach, approximately 80% of the RefSeq genes were represented by a single EST cluster and 20% comprised of two or more EST clusters. In contrast, approximately 25% of all RefSeq genes were found to be represented by a single cluster for the UniGene clustering method. The approach minimizes the risk for overestimations due to the amount of disjunct clusters originating from the same transcript. We have also investigated the quality of EST-data by aligning ESTs to the genome. The results show how many ESTs are not adequately trimmed in respect of vector sequences and low quality regions. Moreover, we identified important problems related to ESTs aligned to the genome using BLAT, such as inferring splice junctions, and explained this aspect by simulations with synthetic data. EST-clusters created with the method are available upon request from the authors.

Subject(s)

Expressed Sequence Tags , Genome, Human , Base Sequence , Humans , RNA Splicing , RNA, Messenger/genetics , Sequence Alignment , Sequence Homology, Nucleic Acid

Comparison of the current RefSeq, Ensembl and EST databases for counting genes and gene discovery.

Larsson, Thomas P; Murray, Christian G; Hill, Tobias; Fredriksson, Robert; Schiöth, Helgi B.

FEBS Lett ; 579(3): 690-8, 2005 Jan 31.

Article in English | MEDLINE | ID: mdl-15670830

ABSTRACT

Large amounts of refined sequence material in the form of predicted, curated and annotated genes and expressed sequences tags (ESTs) have recently been added to the NCBI databases. We matched the transcript-sequences of RefSeq, Ensembl and dbEST in an attempt to provide an updated overview of how many unique human genes can be found. The results indicate that there are about 25000 unique genes in the union of RefSeq and Ensembl with 12-18% and 8-13% of the genes in each set unique to the other set, respectively. About 20% of all genes had splice variants. There are a considerable number of ESTs (2200000) that do not match the identified genes and we used an in-house pipeline to identify 22 novel genes from Genscan predictions that have considerable EST coverage. The study provides an insight into the current status of human gene catalogues and shows that considerable refinement of methods and datasets is needed to come to a conclusive gene count.

Subject(s)

Expressed Sequence Tags , Alternative Splicing , Genome, Human , Humans , RNA, Messenger/genetics

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL