Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 6 de 6
Filter
Add more filters










Database
Language
Publication year range
1.
PLoS One ; 8(5): e62803, 2013.
Article in English | MEDLINE | ID: mdl-23658777

ABSTRACT

There is considerable interest in studying sequenced variations. However, while the positions of substitutions are uniquely identifiable by sequence alignment, the location of insertions and deletions still poses problems. Each insertion and deletion causes a change of sequence. Yet, due to low complexity or repetitive sequence structures, the same indel can sometimes be annotated in different ways. Two indels which differ in allele sequence and position can be one and the same, i.e. the alternative sequence of the whole chromosome is identical in both cases and, therefore, the two deletions are biologically equivalent. In such a case, it is impossible to identify the exact position of an indel merely based on sequence alignment. Thus, variation entries in a mutation database are not necessarily uniquely defined. We prove the existence of a contiguous region around an indel in which all deletions of the same length are biologically identical. Databases often show only one of several possible locations for a given variation. Furthermore, different data base entries can represent equivalent variation events. We identified 1,045,590 such problematic entries of insertions and deletions out of 5,860,408 indel entries in the current human database of Ensembl. Equivalent indels are found in sequence regions of different functions like exons, introns or 5' and 3' UTRs. One and the same variation can be assigned to several different functional classifications of which only one is correct. We implemented an algorithm that determines for each indel database entry its complete set of equivalent indels which is uniquely characterized by the indel itself and a given interval of the reference sequence.


Subject(s)
Databases, Genetic , INDEL Mutation/genetics , Molecular Sequence Annotation/methods , Algorithms , Ataxia/genetics , Base Sequence , Codon, Initiator/genetics , Humans , Molecular Sequence Data , Parkinsonian Disorders/genetics , RNA Splice Sites/genetics , Repetitive Sequences, Nucleic Acid/genetics
2.
PLoS One ; 8(12)2013.
Article in English | MEDLINE | ID: mdl-29294467

ABSTRACT

[This corrects the article DOI: 10.1371/journal.pone.0062803.].

3.
Genomics Insights ; 3: 1-8, 2010.
Article in English | MEDLINE | ID: mdl-26279623

ABSTRACT

We compare the results of three different assembler programs, Celera, Phrap and Mira2, for the same set of about a hundred thousand Sanger reads derived from an unknown bacterial genome. In difference to previous assembly comparisons we do not focus on speed of computation and numbers of assembled contigs but on how the different sequence assemblies agree by content. Threefold consistently assembled genome regions are identified in order to estimate a lower bound of erroneously identified single nucleotide polymorphisms (SNP) caused by nothing but the process of mathematical sequence assembly. We identified 509 sequence triplets common to all three de-novo assemblies spanning only 34% (3.3 Mb) of the bacterial genome with 175 of these regions (~1.5 Mb) including erroneous SNPs and insertion/deletions. Within these triplets this on average leads to one error per 7,155 base pairs. Replacing the assembler Mira2 by the most recent version Mira3, the letter number even drops to 5,923. Our results therefore suggest that a considerably high number of erroneous SNPs may be present in current sequence data and mathematicians should urgently take up research on numerical stability of sequence assembly algorithms. Furthermore, even the latest versions of currently used assemblers produce erroneous SNPs that depend on the order reads are used as input. Such errors will severely hamper molecular diagnostics as well as relating genome variation and disease. This issue needs to be addressed urgently as the field is moving fast into clinical applications.

4.
BMC Bioinformatics ; 8 Suppl 5: S7, 2007 May 24.
Article in English | MEDLINE | ID: mdl-17570866

ABSTRACT

BACKGROUND: Sequence comparison faces new challenges today, with many complete genomes and large libraries of transcripts known. Gene annotation pipelines match these sequences in order to identify genes and their alternative splice forms. However, the software currently available cannot simultaneously compare sets of sequences as large as necessary especially if errors must be considered. RESULTS: We therefore present a new algorithm for the identification of almost perfectly matching substrings in very large sets of sequences. Its implementation, called ClustDB, is considerably faster and can handle 16 times more data than VMATCH, the most memory efficient exact program known today. ClustDB simultaneously generates large sets of exactly matching substrings of a given minimum length as seeds for a novel method of match extension with errors. It generates alignments of maximum length with a considered maximum number of errors within each overlapping window of a given size. Such alignments are not optimal in the usual sense but faster to calculate and often more appropriate than traditional alignments for genomic sequence comparisons, EST and full-length cDNA matching, and genomic sequence assembly. The method is used to check the overlaps and to reveal possible assembly errors for 1377 Medicago truncatula BAC-size sequences published at http://www.medicago.org/genome/assembly_table.php?chr=1. CONCLUSION: The program ClustDB proves that window alignment is an efficient way to find long sequence sections of homogenous alignment quality, as expected in case of random errors, and to detect systematic errors resulting from sequence contaminations. Such inserts are systematically overlooked in long alignments controlled by only tuning penalties for mismatches and gaps. ClustDB is freely available for academic use.


Subject(s)
Genomics/methods , Medicago truncatula/genetics , Sequence Alignment/methods , Sequence Analysis/methods , Software , Algorithms , Chromosomes, Artificial, Bacterial , Sample Size
5.
Nature ; 445(7123): 47-52, 2007 Jan 04.
Article in English | MEDLINE | ID: mdl-17183269

ABSTRACT

We observe that the time of appearance of cellular compartmentalization correlates with atmospheric oxygen concentration. To explore this correlation, we predict and characterize the topology of all transmembrane proteins in 19 taxa and correlate differences in topology with historical atmospheric oxygen concentrations. Here we show that transmembrane proteins, individually and as a group, were probably selectively excluding oxygen in ancient ancestral taxa, and that this constraint decreased over time when atmospheric oxygen levels rose. As this constraint decreased, the size and number of communication-related transmembrane proteins increased. We suggest the hypothesis that atmospheric oxygen concentrations affected the timing of the evolution of cellular compartmentalization by constraining the size of domains necessary for communication across membranes.


Subject(s)
Evolution, Molecular , Membrane Proteins/chemistry , Oxygen/analysis , Animals , Atmosphere/chemistry , Cell Compartmentation/physiology , Cell Membrane/metabolism , Eukaryotic Cells/metabolism , Intracellular Membranes/metabolism , Membrane Proteins/metabolism , Models, Biological , Oxidation-Reduction , Oxygen/metabolism , Prokaryotic Cells/metabolism , Protein Structure, Tertiary , Proteome/metabolism , Time Factors
6.
In Silico Biol ; 7(6): 613-21, 2007.
Article in English | MEDLINE | ID: mdl-18467774

ABSTRACT

The surprisingly low number of about 25,000 genes in the human genome [1] confirmed a fairly accurate estimate given by King and Jukes in 1969 based on population genetical arguments [2]. On the other hand, the number of different transcripts vastly exceeds gene number. This fact intensified the search for alternatively spliced genes. Recent results [1,3,4-7] suggest that more than 60% of the human genes are alternatively spliced, some of them with a myriad of different splice forms. Alternative splicing is found in all higher eukaryotic species in varying frequency. In this paper we focus on a particular form of alternative splicing, the so-called mutually exclusive exon usage (MEEU). In most known examples mutually exclusive exons are arranged in cassettes of highly similar exons suggesting that they have been derived by exon duplication [8-10]. Since classical gene-finding programs may fail to correctly predict such genes [11-16], we present a method, which is based on local similarity of exons, to detect gene candidates with mutually exclusive exon usage. We have screened the entire genome of D. melanogaster and found five new genes with MEEU in addition to eight previously described cases. Additional 1703 candidate regions of putative mutually exclusive exons were identified.


Subject(s)
Exons/genetics , Genome, Human , Animals , Base Sequence , Chromosome Mapping , Chromosomes, Human/genetics , Drosophila melanogaster/genetics , Expressed Sequence Tags , Humans , Sequence Alignment
SELECTION OF CITATIONS
SEARCH DETAIL
...