Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 7 de 7
Filter
Add more filters










Database
Language
Publication year range
1.
BMC Genomics ; 8: 391, 2007 Oct 26.
Article in English | MEDLINE | ID: mdl-17963481

ABSTRACT

BACKGROUND: Repeats are present in all genomes, and often have important functions. However, in large genome sequencing projects, many repetitive regions remain uncharacterized. The genome of the protozoan parasite Trypanosoma cruzi consists of more than 50% repeats. These repeats include surface molecule genes, and several other gene families. In the T. cruzi genome sequencing project, it was clear that not all copies of repetitive genes were present in the assembly, due to collapse of nearly identical repeats. However, at the time of publication of the T. cruzi genome, it was not clear to what extent this had occurred. RESULTS: We have developed a pipeline to estimate the genomic repeat content, where shotgun reads are aligned to the genomic sequence and the gene copy number is estimated using the average shotgun coverage. This method was applied to the genome of T. cruzi and copy numbers of all protein coding sequences and pseudogenes were estimated. The 22,640 results were stored in a database available online. 18% of all protein coding sequences and pseudogenes were estimated to exist in 14 or more copies in the T. cruzi CL Brener genome. The average coverage of the annotated protein coding sequences and pseudogenes indicate a total gene copy number, including allelic gene variants, of over 40,000. CONCLUSION: Our results indicate that the number of protein coding sequences and pseudogenes in the T. cruzi genome may be twice the previous estimate. We have constructed a database of the T. cruzi gene repeat data that is available as a resource to the community. The main purpose of the database is to enable biologists interested in repeated, unfinished regions to closely examine and resolve these regions themselves using all available shotgun data, instead of having to rely on annotated consensus sequences that often are erroneous and possibly misleading. Five repetitive genes were studied in more detail, in order to illustrate how the database can be used to analyze and extract information about gene repeats with different characteristics in Trypanosoma cruzi.


Subject(s)
Databases, Genetic , Genetic Variation , Repetitive Sequences, Nucleic Acid , Trypanosoma cruzi/genetics , Amino Acid Sequence , Animals , Antigens, Surface/genetics , Conserved Sequence , DNA, Protozoan , Gene Amplification , Gene Dosage , Genes, Protozoan/physiology , Genome, Protozoan , Membrane Proteins/genetics , Models, Biological , Molecular Sequence Data , Sequence Homology, Amino Acid
2.
Comput Methods Programs Biomed ; 86(1): 87-92, 2007 Apr.
Article in English | MEDLINE | ID: mdl-17292508

ABSTRACT

Modern alignment methods designed to work rapidly and efficiently with large datasets often do so at the cost of method sensitivity. To overcome this, we have developed a novel alignment program, GRAT, built to accurately align short, highly similar DNA sequences. The program runs rapidly and requires no more memory and CPU power than a desktop computer. In addition, specificity is ensured by statistically separating the true alignments from spurious matches using phred quality values. An efficient separation is especially important when searching large datasets and whenever there are repeats present in the dataset. Results are superior in comparison to widely used existing software, and analysis of two large genomic datasets show the usefulness and scalability of the algorithm.


Subject(s)
Sequence Alignment/instrumentation , Sequence Analysis, DNA , Software Design , Algorithms , Animals , Chickens
3.
BMC Bioinformatics ; 7: 155, 2006 Mar 20.
Article in English | MEDLINE | ID: mdl-16549006

ABSTRACT

BACKGROUND: Many genome projects are left unfinished due to complex, repeated regions. Finishing is the most time consuming step in sequencing and current finishing tools are not designed with particular attention to the repeat problem. RESULTS: We have developed DNPTrapper, a shotgun sequence finishing tool, specifically designed to address the problems posed by the presence of repeated regions in the target sequence. The program detects and visualizes single base differences between nearly identical repeat copies, and offers the overview and flexibility needed to rapidly resolve complex regions within a working session. The use of a database allows large amounts of data to be stored and handled, and allows viewing of mammalian size genomes. The program is available under an Open Source license. CONCLUSION: With DNPTrapper, it is possible to separate repeated regions that previously were considered impossible to resolve, and finishing tasks that previously took days or weeks can be resolved within hours or even minutes.


Subject(s)
Algorithms , DNA/genetics , Documentation/methods , Repetitive Sequences, Nucleic Acid/genetics , Sequence Analysis, DNA/methods , Software , User-Computer Interface , Base Sequence , DNA/analysis , DNA/chemistry , Molecular Sequence Data
4.
Science ; 309(5733): 409-15, 2005 Jul 15.
Article in English | MEDLINE | ID: mdl-16020725

ABSTRACT

Whole-genome sequencing of the protozoan pathogen Trypanosoma cruzi revealed that the diploid genome contains a predicted 22,570 proteins encoded by genes, of which 12,570 represent allelic pairs. Over 50% of the genome consists of repeated sequences, such as retrotransposons and genes for large families of surface molecules, which include trans-sialidases, mucins, gp63s, and a large novel family (>1300 copies) of mucin-associated surface protein (MASP) genes. Analyses of the T. cruzi, T. brucei, and Leishmania major (Tritryp) genomes imply differences from other eukaryotes in DNA repair and initiation of replication and reflect their unusual mitochondrial DNA. Although the Tritryp lack several classes of signaling molecules, their kinomes contain a large and diverse set of protein kinases and phosphatases; their size and diversity imply previously unknown interactions and regulatory processes, which may be targets for intervention.


Subject(s)
Genome, Protozoan , Protozoan Proteins/genetics , Sequence Analysis, DNA , Trypanosoma cruzi/genetics , Animals , Chagas Disease/drug therapy , Chagas Disease/parasitology , DNA Repair , DNA Replication , DNA, Mitochondrial/genetics , DNA, Protozoan/genetics , Genes, Protozoan , Humans , Meiosis , Membrane Proteins/chemistry , Membrane Proteins/genetics , Membrane Proteins/physiology , Multigene Family , Protozoan Proteins/chemistry , Protozoan Proteins/physiology , Recombination, Genetic , Repetitive Sequences, Nucleic Acid , Retroelements , Signal Transduction , Telomere/genetics , Trypanocidal Agents/pharmacology , Trypanocidal Agents/therapeutic use , Trypanosoma cruzi/chemistry , Trypanosoma cruzi/physiology
5.
Gene ; 341: 149-65, 2004 Oct 27.
Article in English | MEDLINE | ID: mdl-15474298

ABSTRACT

Although microsatellites with functional effects have been described, generally, these repeats are considered as "junk" DNA in the same way as other repetitive sequences. Our aim was to investigate if certain microsatellites can have a functional role as cis-regulatory elements. A database was created of all short tandem repeats, from 2 to 10 bases, located in the first 10-kb 5' of the transcription start sites of all annotated genes of the human genome. Of 114 microsatellites selected based on their size and location in the promoter, 51 were found to be polymorphic. Using electrophoretic mobility shift assay (EMSA), we studied five repetitive motifs and three displayed specific protein binding which were found in 12 of the polymorphic microsatellites. An interesting microsatellite is the CTC/GAG repeat which, as double-stranded (DS) DNA, bound specificity protein 1 (SP1) with high affinity, formed triplexes in vitro and displayed differences in SP1 binding and triplex formation capacity for repeats with distinct numbers of repeat units. Interestingly, the polypyrimidine strand of the repeat (CTC) bound other proteins such as polypyrimidine tract-binding protein 1 (PTBP1) as single-stranded (SS) DNA, and a model with two alternative DNA conformations is proposed for these repeats. Distinct protein binding to DS DNA was also observed for different numbers of AAACA and AAAAT repeats. Our results suggest that certain microsatellites may act as cis-regulatory elements, controlling gene expression through transcription factor binding and/or secondary DNA structure formation. Due to their high polymorphism and abundance, they might represent an important source of quantitative genetic variation.


Subject(s)
Microsatellite Repeats/genetics , Regulatory Sequences, Nucleic Acid/genetics , Transcription Factors/metabolism , Base Sequence , Binding Sites/genetics , Chromatography, High Pressure Liquid/methods , Competitive Bidding , DNA/chemistry , DNA/genetics , DNA/metabolism , DNA-Binding Proteins/metabolism , Databases, Nucleic Acid , Electrophoretic Mobility Shift Assay , Genotype , HeLa Cells , Humans , Molecular Sequence Data , Oligonucleotides/genetics , Oligonucleotides/metabolism , Polymorphism, Genetic , Promoter Regions, Genetic/genetics , Protein Binding , Sequence Analysis, DNA , Sequence Homology, Nucleic Acid , Sp1 Transcription Factor/metabolism
6.
Bioinformatics ; 20(5): 803-4, 2004 Mar 22.
Article in English | MEDLINE | ID: mdl-14751967

ABSTRACT

UNLABELLED: Finishing, i.e. gap closure and editing, is the most time-consuming part of genome sequencing. Repeated sequences together with sequencing errors complicate the assembly and often result in misassemblies that are difficult to correct. Repeat Discrepancy Tagger (ReDiT) is a tool designed to aid in the finishing step. This software processes assembly results produced by any fragment assembly program that outputs ace files. The input sequences are analyzed to determine possible differences between repeated sequences. The output is written as tags in an ace file that can be viewed by, e.g. the Consed sequence editor. AVAILABILITY: The ReDiT program is freely available at http://web.cgb.ki.se/redit


Subject(s)
Chromosome Mapping/methods , Documentation/methods , Expressed Sequence Tags , Repetitive Sequences, Nucleic Acid/genetics , Sequence Analysis, DNA/methods , Software , User-Computer Interface , Algorithms , Base Sequence , Computer Graphics , Gene Expression Profiling , Genome , Molecular Sequence Data , Sequence Alignment/methods , Word Processing/methods
7.
Nucleic Acids Res ; 31(15): 4663-72, 2003 Aug 01.
Article in English | MEDLINE | ID: mdl-12888528

ABSTRACT

Sequencing errors in combination with repeated regions cause major problems in shotgun sequencing, mainly due to the failure of assembly programs to distinguish single base differences between repeat copies from erroneous base calls. In this paper, a new strategy designed to correct errors in shotgun sequence data using defined nucleotide positions, DNPs, is presented. The method distinguishes single base differences from sequencing errors by analyzing multiple alignments consisting of a read and all its overlaps with other reads. The construction of multiple alignments is performed using a novel pattern matching algorithm, which takes advantage of the symmetry between indices that can be computed for similar words of the same length. This allows for rapid construction of multiple alignments, with no previous pair-wise matching of sequence reads required. Results from a C++ implementation of this method show that up to 99% of sequencing errors can be corrected, while up to 87% of the single base differences remain and up to 80% of the corrected reads contain at most one error. The results also show that the method outperforms the error correction method used in the EULER assembler. The prototype software, MisEd, is freely available from the authors for academic use.


Subject(s)
Sequence Analysis, DNA/methods , Algorithms , Genome , Repetitive Sequences, Nucleic Acid , Sequence Alignment/methods , Software , Time Factors
SELECTION OF CITATIONS
SEARCH DETAIL
...