Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 5 de 5
Filter
Add more filters










Database
Language
Publication year range
1.
Metab Eng ; 2(3): 159-77, 2000 Jul.
Article in English | MEDLINE | ID: mdl-11056059

ABSTRACT

In the past few years, pattern discovery has been emerging as a generic tool of choice for tackling problems from the computational biology domain. In this presentation, and after defining the problem in its generality, we review some of the algorithms that have appeared in the literature and describe several applications of pattern discovery to problems from computational biology.


Subject(s)
Computational Biology , Pattern Recognition, Automated , Algorithms , Amino Acid Sequence , Biomedical Engineering , DNA/genetics , Gene Expression , Molecular Sequence Data , Sequence Alignment
2.
Proteins ; 37(2): 264-77, 1999 Nov 01.
Article in English | MEDLINE | ID: mdl-10584071

ABSTRACT

Using Teiresias, a pattern discovery method that identifies all motifs present in any given set of protein sequences without requiring alignment or explicit enumeration of the solution space, we have explored the GenPept sequence database and built a dictionary of all sequence patterns with two or more instances. The entries of this dictionary, henceforth named seqlets, cover 98.12% of all amino acid positions in the input database and in essence provide a comprehensive finite set of descriptors for protein sequence space. As such, seqlets can be effectively used to describe almost every naturally occurring protein. In fact, seqlets can be thought of as building blocks of protein molecules that are a necessary (but not sufficient) condition for function or family equivalence memberships. Thus, seqlets can either define conserved family signatures or cut across molecular families and previously undetected sequence signals deriving from functional convergence. Moreover, we show that seqlets also can capture structurally conserved motifs. The availability of a dictionary of seqlets that has been derived in such an unsupervised, hierarchical manner is generating new opportunities for addressing problems that range from reliable classification and the correlation of sequence fragments with functional categories to faster and sensitive engines for homology searches, evolutionary studies, and protein structure prediction.


Subject(s)
Amino Acid Motifs , Dictionaries, Chemical as Topic , Proteins/chemistry , Computational Biology , Databases, Factual , Models, Molecular , Protein Conformation , Sequence Alignment
3.
Article in English | MEDLINE | ID: mdl-10786305

ABSTRACT

We have used the Teiresias algorithm to carry out unsupervised pattern discovery in a database containing the unaligned ORFs from the 17 publicly available complete archaeal and bacterial genomes and build a 1D dictionary of motifs. These motifs which we refer to as seqlets account for and cover 97.88% of this genomic input at the level of amino acid positions. Each of the seqlets in this 1D dictionary was located among the sequences in Release 38.0 of the Protein Data Bank and the structural fragments corresponding to each seqlet's instances were identified and aligned in three dimensions: those of the seqlets that resulted in RMSD errors below a pre-selected threshold of 2.5 Angstroms were entered in a 3D dictionary of structurally conserved seqlets. These two dictionaries can be thought of as cross-indices that facilitate the tackling of tasks such as automated functional annotation of genomic sequences, local homology identification, local structure characterization, comparative genomics, etc.


Subject(s)
Amino Acid Motifs , Genome, Archaeal , Genome, Bacterial , Algorithms , Amino Acid Sequence , Databases, Factual , Models, Molecular , Molecular Sequence Data , Open Reading Frames , Probability , Sequence Homology, Amino Acid , Software
4.
Article in English | MEDLINE | ID: mdl-11072327

ABSTRACT

Given a set of N sequences, the Multiple Sequence Alignment problem is to align these N sequences, possibly with gaps, that brings out the best commonality of the N sequences. MUSCA is a two-stage approach to the alignment problem by identifying two relatively simpler sub-problems whose solutions are used to obtain the alignment of the sequences. We first discover motifs in the N sequences and then extract an appropriate subset of compatible motifs to obtain a good alignment. The motifs of interest to us are the irredundant motifs which are only polynomial in the input size. In practice, however, the number is much smaller (sub-linear). Notice that this step aids in a direct N-wise alignment, as opposed to composing the alignments from lower order (say pairwise) alignments and the solution is also independent of the order of the input sequences; hence the algorithm works very well while dealing with a large number of sequences. The second part of the problem that deals with obtaining a good alignment is solved using a graph-theoretic approach that computes an induced subgraph satisfying certain simple constraints. We reduce a version of this problem to that of solving an instance of a set covering problem, thus offer the best possible approximate solution to the problem (provided P not equalNP). Our experimental results, while being preliminary, indicate that this approach is efficient, particularly on large numbers of long sequences, and, gives good alignments when tested on biological data such as DNA and protein sequences. We introduce the the notion of an alignment number K (2

5.
J Comput Biol ; 5(4): 725-39, 1998.
Article in English | MEDLINE | ID: mdl-10072087

ABSTRACT

Optical Mapping is an emerging technology for constructing ordered restriction maps of DNA molecules. The underlying computational problems for this technology have been studied and several models have been proposed in recent literature. Most of these propose combinatorial models; some of them also present statistical approaches. However, it is not a priori clear as to how these models relate to one another and to the underlying problem. We present a uniform framework for the restriction map problems where each of these various models is a specific instance of the basic framework. We achieve this by identifying two "signature" functions f() and g() that characterize the models. We identify the constraints these two functions must satisfy, thus opening up the possibility of exploring other plausible models. We show that for all of the combinatorial models proposed in literature, the signature functions are semi-algebraic. We also analyze a proposed statistical method in this framework and show that the signature functions are transcendental for this model. We also believe that this framework would provide useful guidelines for dealing with other inferencing problems arising in practice. Finally, we indicate the open problems by including a survey of the best known results for these problems.


Subject(s)
Models, Molecular , Restriction Mapping/methods , Algorithms , False Positive Reactions , Models, Biological , Models, Statistical , Optics and Photonics
SELECTION OF CITATIONS
SEARCH DETAIL
...