ABSTRACT
Pseudogenes, alternative transcripts, noncoding RNA, and polymorphisms each add extensive complexity to the mammalian transcriptome and confound estimation of the total number of genes. Despite advanced algorithms for gene prediction and several large-scale efforts to obtain cDNA clones for all human open reading frames (ORFs), no single collection is complete. To enhance this effort, we have developed a high-throughput pipeline for reverse transcription PCR (RT-PCR) gene recovery. Most importantly, novel molecular strategies for improving RT-PCR yield of transcripts that have been difficult to isolate by other means and computational strategies for clone sequence validation have been developed and optimized. This systematic gene recovery pipeline allows both rescue of predicted human and rat genes and provides insight into the complexity of the transcriptome through comparisons with existing data sets.
Subject(s)
DNA, Complementary/genetics , Reverse Transcriptase Polymerase Chain Reaction/methods , Automation , Cloning, Molecular , DNA, Complementary/biosynthesis , RNA, Messenger/genetics , Sequence Analysis, DNAABSTRACT
We describe a high-throughput cDNA sequencing pipeline (http://www.hgsc.bcm.tmc.edu/projects/cdna) built in response to the emerging need for rapid sequencing of large cDNA collections. Using this strategy cDNA inserts are purified and joined through concatenation into large molecules. These 'pseudo-BACs' are subjected to random shotgun sequencing whereby the majority of cDNA inserts in the pool are sequenced. Using this concatenation cDNA sequencing platform, we have contributed more than 13000 full-length cDNA sequences from human and mouse to the Mammalian Gene Collection (MGC).