Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 36
Filter
1.
Proc Natl Acad Sci U S A ; 103(52): 19824-9, 2006 Dec 26.
Article in English | MEDLINE | ID: mdl-17189424

ABSTRACT

We propose an approach for identifying microinversions across different species and show that microinversions provide a source of low-homoplasy evolutionary characters. These characters may be used as "certificates" to verify different branches in a phylogenetic tree, turning the challenging problem of phylogeny reconstruction into a relatively simple algorithmic problem. We estimate that there exist hundreds of thousands of microinversions in genomes of mammals from comparative sequencing projects, an untapped source of new phylogenetic characters.


Subject(s)
Biological Evolution , Animals , Computational Biology , Genetic Vectors/genetics , Humans , Mammals
2.
J Proteome Res ; 5(10): 2554-66, 2006 Oct.
Article in English | MEDLINE | ID: mdl-17022627

ABSTRACT

We have employed recently developed blind modification search techniques to generate the most comprehensive map of post-translational modifications (PTMs) in human lens constructed to date. Three aged lenses, two of which had moderate cataract, and one young control lens were analyzed using multidimensional liquid chromatography mass spectrometry. In total, 491 modification sites in lens proteins were identified. There were 155 in vivo PTM sites in crystallins: 77 previously reported sites and 78 newly detected PTM sites. Several of these sites had modifications previously undetected by mass spectrometry in lens including carboxymethyl lysine (+58 Da), carboxyethyl lysine (+72 Da), and an arginine modification of +55 Da with yet unknown chemical structure. These new modifications were observed in all three aged lenses but were not found in the young lens. Several new sites of cysteine methylation were identified indicating this modification is more extensive in lens than previously thought. The results were used to estimate the extent of modification at specific sites by spectral counting. We tested the long-standing hypothesis that PTMs contribute to age-related loss of crystallin solubility by comparing spectral counts between the water-soluble and water-insoluble fractions of the aged lenses and found that the extent of deamidation was significantly increased in the water-insoluble fractions. On the basis of spectral counting, the most abundant PTMs in aged lenses were deamidations and methylated cysteines with other PTMs present at lower levels.


Subject(s)
Amides/analysis , Crystallins/analysis , Lens, Crystalline/chemistry , Protein Processing, Post-Translational , Age Factors , Aged , Aged, 80 and over , Amino Acid Sequence , Cysteine/analysis , Humans , Infant, Newborn , Male , Methylation , Molecular Sequence Data , Peptides/analysis , Solubility
3.
Bioinformatics ; 18(10): 1374-81, 2002 Oct.
Article in English | MEDLINE | ID: mdl-12376382

ABSTRACT

MOTIVATION: Gene activity is often affected by binding transcription factors to short fragments in DNA sequences called motifs. Identification of subtle regulatory motifs in a DNA sequence is a difficult pattern recognition problem. In this paper we design a new motif finding algorithm that can detect very subtle motifs. RESULTS: We introduce the notion of a multiprofile and use it for finding subtle motifs in DNA sequences. Multiprofiles generalize the notion of a profile and allow one to detect subtle patterns that escape detection by the standard profiles. Our MULTIPROFILER algorithm outperforms other leading motif finding algorithms in a number of synthetic models. Moreover, it can be shown that in some previously studied motif models, MULTIPROFILER is capable of pushing the performance envelope to its theoretical limits. AVAILABILITY: http://www-cse.ucsd.edu/groups/bioinformatics/software.html


Subject(s)
Algorithms , Amino Acid Motifs/genetics , Regulatory Sequences, Nucleic Acid/genetics , Sequence Alignment/methods , Sequence Analysis, DNA/methods , Base Sequence , Benchmarking , Consensus Sequence/genetics , DNA/genetics , DNA-Binding Proteins/genetics , Escherichia coli/genetics , Escherichia coli/metabolism , Molecular Sequence Data , Promoter Regions, Genetic/genetics , Quality Control , Saccharomyces cerevisiae/genetics , Saccharomyces cerevisiae/metabolism , Sensitivity and Specificity
4.
Bioinformatics ; 18(10): 1382-90, 2002 Oct.
Article in English | MEDLINE | ID: mdl-12376383

ABSTRACT

MOTIVATION: What constitutes a subtle motif? Intuitively, it is a motif that is almost indistinguishable, in the statistical sense, from random motifs. This question has important practical consequences: consider, for example, a biologist that is generating a sample of upstream regulatory sequences with the goal of finding a regulatory pattern that is shared by these sequences. If the sequences are too short then one risks losing some of the regulatory patterns that are located further upstream. Conversely, if the sequences are too long, the motif becomes too subtle and one is then likely to encounter random motifs which are at least as significant statistically as the regulatory pattern itself. In practical terms one would like to recognize the sequence length threshold, or the twilight zone, beyond which the motifs are in some sense too subtle. RESULTS: The paper defines the motif twilight zone where every motif finding algorithm would be exposed to random motifs which are as significant as the one which is sought. We also propose an objective tool for evaluating the performance of subtle motif finding algorithms. Finally we apply these tools to evaluate the success of our MULTIPROFILER algorithm to detect subtle motifs.


Subject(s)
Amino Acid Motifs/genetics , Models, Genetic , Models, Statistical , Regulatory Sequences, Nucleic Acid/genetics , Sequence Analysis, DNA/methods , Algorithms , Base Sequence , Benchmarking , Consensus Sequence/genetics , DNA/genetics , DNA Mutational Analysis/methods , Molecular Sequence Data , Quality Control , Reproducibility of Results , Sensitivity and Specificity , Sequence Alignment/methods , Software , Stochastic Processes
5.
Pac Symp Biocomput ; : 235-46, 2002.
Article in English | MEDLINE | ID: mdl-11928479

ABSTRACT

Recognition of regulatory sites in unaligned DNA sequences is an old and well-studied problem in computational molecular biology. Recently, large-scale expression studies and comparative genomics brought this problem into a spotlight by generating a large number of samples with unknown regulatory signals. Here we develop algorithms for recognition of signals in corrupted samples (where only a fraction of sequences contain sites) with biased nucleotide composition. We further benchmark these and other algorithms on several bacterial and archaeal sites in a setting specifically designed to imitate the situations arising in comparative genomics studies.


Subject(s)
Base Sequence , DNA/chemistry , DNA/genetics , Computational Biology/methods , Gene Expression , Regulatory Sequences, Nucleic Acid , Reproducibility of Results , Software
7.
Proc Natl Acad Sci U S A ; 98(17): 9748-53, 2001 Aug 14.
Article in English | MEDLINE | ID: mdl-11504945

ABSTRACT

For the last 20 years, fragment assembly in DNA sequencing followed the "overlap-layout-consensus" paradigm that is used in all currently available assembly tools. Although this approach proved useful in assembling clones, it faces difficulties in genomic shotgun assembly. We abandon the classical "overlap-layout-consensus" approach in favor of a new euler algorithm that, for the first time, resolves the 20-year-old "repeat problem" in fragment assembly. Our main result is the reduction of the fragment assembly to a variation of the classical Eulerian path problem that allows one to generate accurate solutions of large-scale sequencing problems. euler, in contrast to the celera assembler, does not mask such repeats but uses them instead as a powerful fragment assembly tool.


Subject(s)
Algorithms , Contig Mapping/methods , Sequence Analysis, DNA/methods , Campylobacter jejuni/genetics , DNA, Bacterial/genetics , Genome, Bacterial , Lactococcus lactis/genetics , Models, Theoretical , Neisseria meningitidis/genetics , Sequence Alignment/methods , Software
8.
Bioinformatics ; 17 Suppl 1: S225-33, 2001.
Article in English | MEDLINE | ID: mdl-11473013

ABSTRACT

For the last twenty years fragment assembly was dominated by the "overlap - layout - consensus" algorithms that are used in all currently available assembly tools. However, the limits of these algorithms are being tested in the era of genomic sequencing and it is not clear whether they are the best choice for large-scale assemblies. Although the "overlap - layout - consensus" approach proved to be useful in assembling clones, it faces difficulties in genomic assemblies: the existing algorithms make assembly errors even in bacterial genomes. We abandoned the "overlap - layout - consensus" approach in favour of a new Eulerian Superpath approach that outperforms the existing algorithms for genomic fragment assembly (Pevzner et al. 2001 InProceedings of the Fifth Annual International Conference on Computational Molecular Biology (RECOMB-01), 256-26). In this paper we describe our new EULER-DB algorithm that, similarly to the Celera assembler takes advantage of clone-end sequencing by using the double-barreled data. However, in contrast to the Celera assembler, EULER-DB does not mask repeats but uses them instead as a powerful tool for contig ordering. We also describe a new approach for the Copy Number Problem: "How many times a given repeat is present in the genome?". For long nearly-perfect repeats this question is notoriously difficult and some copies of such repeats may be "lost" in genomic assemblies. We describe our EULER-CN algorithm for the Copy Number Problem that proved to be successful in difficult sequencing projects.


Subject(s)
Algorithms , Sequence Analysis, DNA/statistics & numerical data , Computational Biology , Genetic Techniques/statistics & numerical data , Genome , Models, Genetic , Nucleic Acid Hybridization
9.
Bioinformatics ; 17(4): 327-37, 2001 Apr.
Article in English | MEDLINE | ID: mdl-11301301

ABSTRACT

The Smith-Waterman algorithm for local sequence alignment is one of the most important techniques in computational molecular biology. This ingenious dynamic programming approach was designed to reveal the highly conserved fragments by discarding poorly conserved initial and terminal segments. However, the existing notion of local similarity has a serious flaw: it does not discard poorly conserved intermediate segments. The Smith-Waterman algorithm finds the local alignment with maximal score but it is unable to find local alignment with maximum degree of similarity (e.g. maximal percent of matches). Moreover, there is still no efficient algorithm that answers the following natural question: do two sequences share a (sufficiently long) fragment with more than 70% of similarity? As a result, the local alignment sometimes produces a mosaic of well-conserved fragments artificially connected by poorly-conserved or even unrelated fragments. This may lead to problems in comparison of long genomic sequences and comparative gene prediction as recently pointed out by Zhang et al. (Bioinformatics, 15, 1012-1019, 1999). In this paper we propose a new sequence comparison algorithm (normalized local alignment ) that reports the regions with maximum degree of similarity. The algorithm is based on fractional programming and its running time is O(n2log n). In practice, normalized local alignment is only 3-5 times slower than the standard Smith-Waterman algorithm.


Subject(s)
Algorithms , Sequence Alignment/methods , Software
10.
Genome Res ; 11(2): 290-9, 2001 Feb.
Article in English | MEDLINE | ID: mdl-11157792

ABSTRACT

Although protein identification by matching tandem mass spectra (MS/MS) against protein databases is a widespread tool in mass spectrometry, the question about reliability of such searches remains open. Absence of rigorous significance scores in MS/MS database search makes it difficult to discard random database hits and may lead to erroneous protein identification, particularly in the case of mutated or post-translationally modified peptides. This problem is especially important for high-throughput MS/MS projects when the possibility of expert analysis is limited. Thus, algorithms that sort out reliable database hits from unreliable ones and identify mutated and modified peptides are sought. Most MS/MS database search algorithms rely on variations of the Shared Peaks Count approach that scores pairs of spectra by the peaks (masses) they have in common. Although this approach proved to be useful, it has a high error rate in identification of mutated and modified peptides. We describe new MS/MS database search tools, MS-CONVOLUTION and MS-ALIGNMENT, which implement the spectral convolution and spectral alignment approaches to peptide identification. We further analyze these approaches to identification of modified peptides and demonstrate their advantages over the Shared Peaks Count. We also use the spectral alignment approach as a filter in a new database search algorithm that reliably identifies peptides differing by up to two mutations/modifications from a peptide in a database.


Subject(s)
Computational Biology/methods , Databases, Factual , Mass Spectrometry/methods , Mutation , Proteins/analysis , Proteins/genetics , Algorithms , Amino Acid Sequence , Computational Biology/statistics & numerical data , Database Management Systems , Databases, Factual/statistics & numerical data , Fungal Proteins/analysis , Fungal Proteins/genetics , Fungal Proteins/metabolism , Mass Spectrometry/statistics & numerical data , Molecular Sequence Data , Peptide Fragments/analysis , Peptide Fragments/genetics , Peptide Fragments/metabolism , Proteins/metabolism
11.
Article in English | MEDLINE | ID: mdl-10977088

ABSTRACT

Signal finding (pattern discovery in unaligned DNA sequences) is a fundamental problem in both computer science and molecular biology with important applications in locating regulatory sites and drug target identification. Despite many studies, this problem is far from being resolved: most signals in DNA sequences are so complicated that we don't yet have good models or reliable algorithms for their recognition. We complement existing statistical and machine learning approaches to this problem by a combinatorial approach that proved to be successful in identifying very subtle signals.


Subject(s)
Algorithms , DNA/analysis , DNA/genetics , Sequence Analysis, DNA , Animals , Humans
12.
J Comput Biol ; 7(6): 777-87, 2000.
Article in English | MEDLINE | ID: mdl-11382361

ABSTRACT

Database search in tandem mass spectrometry is a powerful tool for protein identification. High-throughput spectral acquisition raises the problem of dealing with genetic variation and peptide modifications within a population of related proteins. A method that cross-correlates and clusters related spectra in large collections of uncharacterized spectra (i.e., from normal and diseased individuals) would be very valuable in functional proteomics. This problem is far from being simple since very similar peptides may have very different spectra. We introduce a new notion of spectral similarity that allows one to identify related spectra even if the corresponding peptides have multiple modifications/mutations. Based on this notion, we developed a new algorithm for mutation-tolerant database search as well as a method for cross-correlating related uncharacterized spectra.


Subject(s)
Algorithms , Image Processing, Computer-Assisted , Mass Spectrometry/methods , Mutation , Proteins/genetics , Databases, Factual , Proteins/chemistry , Software
13.
Microb Comp Genomics ; 4(3): 167-72, 1999.
Article in English | MEDLINE | ID: mdl-10587944

ABSTRACT

The expressed sequence tag (EST) data provide a powerful tool for identification of transcribed DNA sequences. However, as EST are relatively short, many exons are poorly covered by EST, thus reducing the utility of EST data. Recently, signature sequence tag (SST) fingerprints were proposed as an alternative to EST fingerprints. Given a fingerprint set of probes, SST of a clone is a subset of probes from the fingerprint set that hybridize with the clone. We demonstrate that besides being a powerful technique for screening cDNA libraries, SST technology provides for very accurate gene predictions. Even with a small fingerprint set (600-800 probes), SST-based gene recognition outperforms many conventional and EST-based methods. The increase in the size of the fingerprint set to 1500 probes provides almost perfect gene recognition. Even more importantly, SST-based gene predictions miss very few exons and, therefore, provide an opportunity to bypass the cDNA sequencing step on the way from finished genomic sequence to mutation detection in gene-hunting projects. Because SST data can be obtained in a highly parallel and inexpensive way, SST technology has a potential of complementing EST technology for gene hunting.


Subject(s)
DNA, Complementary/genetics , Expressed Sequence Tags , Gene Expression Profiling , Gene Library , Oligonucleotide Probes , Computational Biology , DNA Fingerprinting , Humans , Oligonucleotide Array Sequence Analysis , Software
14.
J Comput Biol ; 6(3-4): 327-42, 1999.
Article in English | MEDLINE | ID: mdl-10582570

ABSTRACT

Peptide sequencing via tandem mass spectrometry (MS/MS) is one of the most powerful tools in proteomics for identifying proteins. Because complete genome sequences are accumulating rapidly, the recent trend in interpretation of MS/MS spectra has been database search. However, de novo MS/MS spectral interpretation remains an open problem typically involving manual interpretation by expert mass spectrometrists. We have developed a new algorithm, SHERENGA, for de novo interpretation that automatically learns fragment ion types and intensity thresholds from a collection of test spectra generated from any type of mass spectrometer. The test data are used to construct optimal path scoring in the graph representations of MS/MS spectra. A ranked list of high scoring paths corresponds to potential peptide sequences. SHERENGA is most useful for interpreting sequences of peptides resulting from unknown proteins and for validating the results of database search algorithms in fully automated, high-throughput peptide sequencing.


Subject(s)
Algorithms , Mass Spectrometry/methods , Peptides/chemistry , Sequence Analysis/methods , Amino Acid Sequence , Databases, Factual , Evaluation Studies as Topic , Mass Spectrometry/statistics & numerical data , Sequence Analysis/statistics & numerical data
15.
Article in English | MEDLINE | ID: mdl-10786293

ABSTRACT

One current approach to quality control in DNA array manufacturing is to synthesize a small set of test probes that detect variation in the manufacturing process. These fidelity probes consist of identical copies of the same probe, but they are deliberately manufactured using different steps of the manufacturing process. A known target is hybridized to these probes, and those hybridization results are indicative of the quality of the manufacturing process. It is not only desirable to detect variations, but also to analyze the variations that occur, indicating in what process step the manufacture changed. We describe a combinatorial approach which constructs a small set of fidelity probes that not only detect variations, but also point out the manufacturing step in which a variation has occurred. This algorithm is currently being used in mass-production of DNA arrays at Affyetrix.


Subject(s)
DNA Probes , Oligonucleotide Array Sequence Analysis , Algorithms , Combinatorial Chemistry Techniques , Quality Control , Reproducibility of Results , Software
16.
Genomics ; 51(3): 332-9, 1998 Aug 01.
Article in English | MEDLINE | ID: mdl-9721203

ABSTRACT

An important and still unsolved problem in gene prediction is designing an algorithm that not only predicts genes but estimates the quality of individual predictions as well. Since experimental biologists are interested mainly in the reliability of individual predictions (rather than in the average reliability of an algorithm) we attempted to develop a gene recognition algorithm that guarantees a certain quality of predictions. We demonstrate here that the similarity level with a related protein is a reliable quality estimator for the spliced alignment approach to gene recognition. We also study the average performance of the spliced alignment algorithm for different targets on a complete set of human genomic sequences with known relatives and demonstrate that the average performance of the method remains high even for very distant targets. Using plant, fungal, and prokaryotic target proteins for recognition of human genes leads to accurate predictions with 95, 93, and 91% correlation coefficient, respectively. For target proteins with similarity score above 60%, not only the average correlation coefficient is very high (97% and up) but also the quality of individual predictions is guaranteed to be at least 82%. It indicates that for this level of similarity the worst case performance of the spliced alignment algorithm is better than the average case performance of many statistical gene recognition methods.


Subject(s)
Genome, Human , Sequence Alignment , Sequence Analysis, DNA/methods , Algorithms , DNA/chemistry , Databases as Topic , Exons/genetics , Humans , Proteins/chemistry , RNA Splicing/genetics , Software
17.
Bioinformatics ; 14(1): 14-9, 1998.
Article in English | MEDLINE | ID: mdl-9520497

ABSTRACT

MOTIVATION: Gene annotation is the final goal of gene prediction algorithms. However, these algorithms frequently make mistakes and therefore the use of gene predictions for sequence annotation is hardly possible. As a result, biologists are forced to conduct time-consuming gene identification experiments by designing appropriate PCR primers to test cDNA libraries or applying RT-PCR, exon trapping/amplification, or other techniques. This process frequently amounts to 'guessing' PCR primers on top of unreliable gene predictions and frequently leads to wasting of experimental efforts. RESULTS: The present paper proposes a simple and reliable algorithm for experimental gene identification which bypasses the unreliable gene prediction step. Studies of the performance of the algorithm on a sample of human genes indicate that an experimental protocol based on the algorithm's predictions achieves an accurate gene identification with relatively few PCR primers. Predictions of PCR primers may be used for exon amplification in preliminary mutation analysis during an attempt to identify a gene responsible for a disease. We propose a simple approach to find a short region from a genomic sequence that with high probability overlaps with some exon of the gene. The algorithm is enhanced to find one or more segments that are probably contained in the translated region of the gene and can be used as PCR primers to select appropriate clones in cDNA libraries by selective amplification. The algorithm is further extended to locate a set of PCR primers that uniformly cover all translated regions and can be used for RT-PCR and further sequencing of (unknown) mRNA.


Subject(s)
Algorithms , Genes , Software , Arabidopsis , DNA Primers , Humans , Open Reading Frames , Polymerase Chain Reaction
18.
Genomics ; 47(2): 171-9, 1998 Jan 15.
Article in English | MEDLINE | ID: mdl-9479489

ABSTRACT

We propose a new experimental protocol, ExonPCR, which is able to identify exon boundaries in a cDNA even in the absence of any genomic clones. ExonPCR can bypass the isolation, characterization, and DNA sequencing of subclones of genomic DNA to determine exon boundaries: a major effort in the process of positional cloning. Given a cDNA sequence, ExonPCR uses a series of "adaptive" steps to analyze the PCR products from cDNA and genomic DNA thereby revealing the approximate positions of "hidden" exon boundaries in the cDNA. The nucleotide sequence of adjacent intronic regions is determined by ligation-mediated PCR. Primers adjacent to the "hidden" exon boundaries are used to amplify genomic DNA followed by limited DNA sequencing of the PCR product. The method was successfully tested on the 3-kb hMSH2 cDNA with 16 known exons and the 9-kb PRDII-BF1 cDNA with a previously unknown number of exons. We subsequently developed the ExonPCR algorithm and software to direct the experimental protocol using a strategy that is analogous to that used in the game "Twenty Questions." Through the use of ExonPCR, the search for disease-causing mutations can be initiated almost immediately after cDNA clones in a genetically mapped region become available. This approach would be most valuable in gene discovery strategies that focus initially on cDNA isolation.


Subject(s)
Cloning, Molecular/methods , DNA, Complementary/isolation & purification , Exons , Genes/genetics , Sequence Analysis, DNA/methods , Animals , DNA-Binding Proteins/genetics , Humans , Introns , Mice , MutS Homolog 2 Protein , Polymerase Chain Reaction/methods , Proto-Oncogene Proteins/genetics , Transcription Factors
19.
J Comput Biol ; 4(3): 297-309, 1997.
Article in English | MEDLINE | ID: mdl-9278061

ABSTRACT

Recently, Gelfand, Mironov and Pevzner (1996) proposed a spliced alignment approach to gene recognition that provides 99% accurate recognition of human genes if a related mammalian protein is available. However, even 99% accurate gene predictions are insufficient for automated sequence annotation in large-scale sequencing projects and therefore have to be complemented by experimental gene verification. One hundred percent accurate gene predictions would lead to a substantial reduction of experimental work on gene identification. Our goal is to develop an algorithm that either predicts an exon assembly with accuracy sufficient for sequence annotation or warns a biologist that the accuracy of a prediction is insufficient and further experimental work is required. We study suboptimal and error-tolerant spliced alignment problems as the first steps towards such an algorithm, and report an algorithm which provides 100% accurate recognition of human genes in 37% of cases (if a related mammalian protein is available). In 52% of genes, the algorithm predicts at least one exon with 100% accuracy.


Subject(s)
Algorithms , Genes , Nucleic Acid Conformation , RNA Splicing , Amino Acid Sequence , Animals , Binding Sites , Humans , Sequence Alignment/methods
20.
Comput Appl Biosci ; 13(2): 205-10, 1997 Apr.
Article in English | MEDLINE | ID: mdl-9146969

ABSTRACT

Sequencing by hybridization (SBH) is a promising alternative approach to DNA sequencing and mutation detection. Analysis of the resolving power of SBH involves rather difficult combinatorial and probabilistic problems, and sometimes computer simulation is the only way to estimate the parameters and limitations of SBH experiments. This paper describes a software package, DNA-SPECTRUM, which allows one to analyze the resolving power and parameters of SBH. We also introduce the technique for visualizing multiple SBH reconstructions and describe applications of DNA-SPECTRUM to estimate various SBH parameters. DNA-SPECTRUM is available at http://www-hto.usc.edu/software/sbh/index. html.


Subject(s)
Sequence Analysis, DNA/methods , Software , Base Sequence , Computer Graphics , Computer Simulation , DNA/genetics , Evaluation Studies as Topic , Nucleic Acid Hybridization , Sequence Analysis, DNA/statistics & numerical data
SELECTION OF CITATIONS
SEARCH DETAIL
...