Search | VHL Regional Portal

A new method to predict the consensus secondary structure of a set of unaligned RNA sequences.

Bouthinon, D; Soldano, H.

Bioinformatics ; 15(10): 785-98, 1999 Oct.

Article in English | MEDLINE | ID: mdl-10705432

ABSTRACT

MOTIVATION: To predict the consensus secondary structure, possibly including pseudoknots, of a set of RNA unaligned sequences. RESULTS: We have designed a method based on a new representation of any RNA secondary structure as a set of structural relationships between the helices of the structure. We refer to this representation as a structural pattern. In a first step, we use thermodynamic parameters to select, for each sequence, the best secondary structures according to energy minimization and we represent each of them using its corresponding structural pattern. In a second step, we search for the repeated structural patterns, i.e. the largest structural patterns that occur in at least one sequence, i.e. included in at least one of the structural patterns associated to each sequence. Thanks to an efficient encoding of structural patterns, this search comes down to identifying the largest repeated word suffixes in a dictionary. In a third step, we compute the plausibility of each repeated structural pattern by checking if it occurs more frequently in the studied sequences than in random RNA sequences. We then suppose that the consensus secondary structure corresponds to the repeated structural pattern that displays the highest plausibility. We present several experiments concerning tRNA, fragments of 16S rRNA and 10Sa RNA (including pseudoknots); in each of them, we found the putative consensus secondary structure.

Subject(s)

Computational Biology , Nucleic Acid Conformation , RNA/chemistry , RNA/genetics , Algorithms , Base Sequence , Consensus Sequence , Escherichia coli/chemistry , Escherichia coli/genetics , Molecular Sequence Data , RNA, Bacterial/chemistry , RNA, Bacterial/genetics , RNA, Ribosomal, 16S/chemistry , RNA, Ribosomal, 16S/genetics , RNA, Transfer/chemistry , RNA, Transfer/genetics , Repetitive Sequences, Nucleic Acid , Thermodynamics

Pairwise and multiple identification of three-dimensional common substructures in proteins.

Escalier, V; Pothier, J; Soldano, H; Viari, A.

J Comput Biol ; 5(1): 41-56, 1998.

Article in English | MEDLINE | ID: mdl-9541870

ABSTRACT

In this paper, we present an algorithm to find three-dimensional substructures common to two or more molecules. The basic algorithm is devoted to pairwise structural comparison. Given two sets of atomic coordinates, it finds the largest subsets of atoms which are "similar" in the sense that all internal distances are approximately conserved. The basic idea of the algorithm is to recursively build subsets of increasing sizes, combining two sets of size k to build a set of size k + 1. The algorithm can be used "as is" for small molecules or local parts of proteins (about 30 atoms). When a high number of atoms is involved, we use a two step procedure. First we look for common "local" fragments by using the previous algorithm, and then we gather these fragments by using a Branch and Bound technique. We also extend the basic algorithm to perform multiple comparisons, by using one of the structures as a reference point (pivot) to which all other structures are compared. The solution is the largest subsets of atoms common to the pivot and at least q other structures. Although both algorithms are theoretically exponential in the number of atoms, experiments performed on biological data and using realistic parameters show that the solution is obtained within a few minutes. Finally, an application to the determination of the structural core of seven globins is presented.

Subject(s)

Protein Structure, Tertiary , Algorithms , Amino Acid Sequence , Animals , Computers , Globins/chemistry , Models, Molecular , Molecular Sequence Data , Sequence Alignment , Software

Finding flexible patterns in a text: an application to three-dimensional molecular matching.

Sagot, M F; Viari, A; Pothier, J; Soldano, H.

Comput Appl Biosci ; 11(1): 59-70, 1995 Feb.

Article in English | MEDLINE | ID: mdl-7796276

ABSTRACT

Finding certain regularities in a text is an important problem in many areas, e.g. in the analysis of biological molecules such as nucleic acids or proteins. In the latter case, the text may be sequences of amino acids or a linear coding of three-dimensional structures, and the regularities then correspond to lexical or structural motifs common to two, or more, proteins. We first recall an earlier algorithm that found these regularities in a flexible way. Then we introduce a generalized version of this algorithm designed for the particular case of protein three-dimensional structures, since these structures present a few peculiarities that make them computationally harder to process. Finally, we give some applications of our new algorithm on concrete examples.

Subject(s)

Algorithms , Pattern Recognition, Automated , Proteins/chemistry , Cytochrome P-450 Enzyme System/chemistry , Databases, Factual , Models, Molecular , Models, Statistical , Molecular Structure , Protein Conformation , Sequence Alignment/methods , Sequence Alignment/statistics & numerical data

A distance-based block searching algorithm.

Sagot, M F; Viari, A; Soldano, H.

Proc Int Conf Intell Syst Mol Biol ; 3: 322-31, 1995.

Article in English | MEDLINE | ID: mdl-7584455

ABSTRACT

We present in this paper an algorithm for the multiple comparison of a set of protein sequences. Our approach is that of peptide matching and consists in looking for all the words that occur approximatively in at least q of the sequences in the set, where q is a parameter. Words are compared by using a reference object called a model, that is itself a word over the alphabet of the amino acids, and the comparison between a model and a word is based on w-length words instead of single symbols. This idea is similar to the one used in the Blast program in the case of pairwise comparisons. Two w-length words are considered to be related if an alignment without gaps of the two using a similarity matrix has a score greater than a certain threshold value t. In our case, we say that a k-length word u is an occurrence of a model m of the same length if every w-length subword of u is related to the corresponding subword of m in the sense given above. If a model m has occurrences in at least q of the sequences of the set, m is said to occur in the set. In percentage terms, the value of q may correspond to something as small as 5% of the sequences (search for recurrent words in a set of non homologous proteins) or as high as 70-100% (establishment of a list of all similar words as a first step in a multiple alignment program). The algorithm presented here is an efficient and exact way of looking for all the models, of a fixed length k or of the greatest possible length kmax, that occur in a set of sequences. It can work with any kind of scoring matrix and an extension of the algorithm allows for the introduction of gaps between a model and its occurrences.

Subject(s)

Algorithms , Proteins/chemistry , Sequence Homology, Amino Acid , Amino Acid Sequence , Animals , Computer Simulation , Humans , Models, Theoretical , Molecular Sequence Data , Software

From data banks to data bases.

Danchin, A; Médigue, C; Gascuel, O; Soldano, H; Hénaut, A.

Res Microbiol ; 142(7-8): 913-6, 1991.

Article in English | MEDLINE | ID: mdl-1784830

ABSTRACT

The information collected in national and international libraries on nucleotide and protein sequences cannot be directly treated for proper handling by existing software. Therefore we evaluated the feasibility of constructing a data base for Escherichia coli using the data present in the banks. The knowhow thus acquired was applied to Bacillus subtilis. Specific examples of the general procedure are given.

Subject(s)

Bacillus subtilis/ultrastructure , Chromosomes, Bacterial/ultrastructure , DNA, Bacterial/ultrastructure , Databases, Factual , Escherichia coli/ultrastructure , Bacillus subtilis/genetics , Base Sequence/genetics , DNA, Bacterial/genetics , Database Management Systems , Databases, Bibliographic , Escherichia coli/genetics , In Vitro Techniques , Molecular Sequence Data

'Multifrequency' location and clustering of sequence patterns from proteins.

Ollivier, E; Soldano, H; Viari, A.

Comput Appl Biosci ; 7(1): 31-8, 1991 Jan.

Article in English | MEDLINE | ID: mdl-2004272

ABSTRACT

In previous work, we have shown that a set of characteristics, defined as (code frequency) pairs, can be derived from a protein family by the use of a signal-processing method. This method enables the location and extraction of sequence patterns by taking into account each (code frequency) pair individually. In the present paper, we propose to extend this method in order to detect and visualize patterns by taking into account several pairs simultaneously. Two 'multifrequency' methods are described. The first one is based on a rewriting of the sequences with new symbols which summarize the frequency information. The second method is based on a clustering of the patterns associated with each pair. Both methods lead to the definition of significant consensus sequences. Some results obtained with calcium-binding proteins and serine proteases are also discussed.

Subject(s)

Calcium-Binding Proteins/genetics , Serine Endopeptidases/genetics , Amino Acid Sequence , Cluster Analysis , Humans , Molecular Sequence Data , Software

A scale-independent signal processing method for sequence analysis.

Viari, A; Soldano, H; Ollivier, E.

Comput Appl Biosci ; 6(2): 71-80, 1990 Apr.

Article in English | MEDLINE | ID: mdl-2361187

ABSTRACT

In this paper, we present methods to detect and localize patterns in biologically related protein sequences (family). The patterns common to the sequences of the family are detected by using Fourier analysis. No previous scales (codes) are needed, they are actually produced as a result of the analysis procedure, together with the frequencies of the Fourier decompositions. Characteristic features of the family are thus expressed as (code-frequency) pairs. Various tools are proposed in order to localize the patterns, to compare the codes, and to evaluate the proximity of an arbitrary sequence to the investigated family. The general strategy is illustrated on a family composed of calcium-binding proteins.

Subject(s)

Amino Acid Sequence , Signal Processing, Computer-Assisted , Fourier Analysis , Pattern Recognition, Automated , Proteins

Sequence analysis of cell cycle control (cdc2) protein kinases among protein serine/threonine kinases.

Guerrucci, M A; Soldano, H; Bellé, R.

Biol Cell ; 70(1-2): 1-8, 1990.

Article in English | MEDLINE | ID: mdl-2150765

ABSTRACT

Among protein serine/threonine kinases, the CDC2 proteins are both well characterized as protein serine/threonine kinases and are functionally involved in the control of cell division. Protein serine/threonine kinase sequences were analysed using Fourier transform of the coded sequences. Characteristic code/frequency pairs were extracted from a set of well defined protein serine/threonine kinases. The characteristic frequencies 0.179, 0.250 and 0.408 distinguished protein serine/threonine kinases from proteins which did not have the biological activity. Pertinent patterns in the sequence, responsible for the code/frequency pairs detection were searched and found to be correlated with the putative catalytic domain of the proteins. Protein serine/threonine kinases involved in cell division control, CDC2 protein kinases, were compared to the other protein serine/threonine kinases. Specific code/frequency pairs were extracted from the sequences and could be related to the function or regulation of the kinases in cell division. Two CDC2 related proteins CDC2(Mm) from mice and CDC2(Gg) from chicken were shown to fit well with the CDC2 proteins, whereas KIN28, PHO85 and PSKJ3, which share sequence homology but not functional activity with the CDC2 proteins, were clearly excluded from the CDC2 proteins by the characteristic code/frequency pairs. Pertinent patterns in the CDC2 proteins were analysed and mapped on the CDC2 related protein sequences. Four patterns were correlated with the code/frequency detection and therefore, could be associated to the regulation of the CDC2-related proteins.

Subject(s)

CDC2 Protein Kinase/genetics , Amino Acid Sequence , Animals , Cell Cycle/genetics , Fourier Analysis , Molecular Sequence Data , Protein Kinases/genetics , Protein Serine-Threonine Kinases , Sequence Homology, Nucleic Acid

Statistico-syntactic learning techniques.

Soldano, H; Moisy, J L.

Biochimie ; 67(5): 493-8, 1985 May.

Article in English | MEDLINE | ID: mdl-3839691

ABSTRACT

The methods of "learning from examples" enable the solving of problems of classification: discrimination between two classes of objects, assimilation of an object to a class of objects representing a property. They are used in a situation where we don't know a priori a procedure in order to decide, but we have examples (in sufficient amount). After a learning stage with the examples, a procedure to solve the problem is built. In the exposed methodology the description of an object is a list of attributes, the acquired knowledge is sets of "rules" considered as arguments in favour of a particular decision.

Subject(s)

Computers , Learning , Software , Discrimination Learning , Logic , Mathematics , Statistics as Topic

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL