Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 13 de 13
Filter
Add more filters










Publication year range
1.
Mech Ageing Dev ; 126(1): 193-208, 2005 Jan.
Article in English | MEDLINE | ID: mdl-15610779

ABSTRACT

The diverse nature of cancer- and aging-related genes presents a challenge for large-scale studies based on molecular sequence and profiling data. An underexplored source of data for modeling and analysis is the textual descriptions and annotations present in curated gene-centered biomedical corpora. Here, 450 genes designated by surveys of the scientific literature as being associated with cancer and aging were analyzed using two complementary approaches. The first, ensemble attribute profile clustering, is a recently formulated, text-based, semi-automated data interpretation strategy that exploits ideas from statistical information retrieval to discover and characterize groups of genes with common structural and functional properties. Groups of genes with shared and unique Gene Ontology terms and protein domains were defined and examined. Human homologs of a group of known Drosphila aging-related genes are candidates for genes that may influence lifespan (hep/MAPK2K7, bsk/MAPK8, puc/LOC285193). These JNK pathway-associated proteins may specify a molecular hub that coordinates and integrates multiple intra- and extracellular processes via space- and time-dependent interactions with proteins in other pathways. The second approach, a qualitative examination of the chromosomal locations of 311 human cancer- and aging-related genes, provides anecdotal evidence for a "phenotype position effect": genes that are proximal in the linear genome often encode proteins involved in the same phenomenon. Comparative genomics was employed to enhance understanding of several genes, including open reading frames, identified as new candidates for genes with roles in aging or cancer. Overall, the results highlight fundamental molecular and mechanistic connections between progenitor/stem cell lineage determination, embryonic morphogenesis, cancer, and aging. Despite diversity in the nature of the molecular and cellular processes associated with these phenomena, they seem related to the architectural hub of tissue polarity and a need to generate and control this property in a timely manner.


Subject(s)
Aging/genetics , Algorithms , Databases, Genetic , Genes , Neoplasms/genetics , Proteins/genetics , Computational Biology/methods
2.
J Comput Biol ; 11(6): 1073-89, 2004.
Article in English | MEDLINE | ID: mdl-15662199

ABSTRACT

Molecular profiling studies can generate abundance measurements for thousands of transcripts, proteins, metabolites, or other species in, for example, normal and tumor tissue samples. Treating such measurements as features and the samples as labeled data points, sparse hyperplanes provide a statistical methodology for classifying data points into one of two categories (classification and prediction) and defining a small subset of discriminatory features (relevant feature identification). However, this and other extant classification methods address only implicitly the issue of observed data being a combination of underlying signals and noise. Recently, robust optimization has emerged as a powerful framework for handling uncertain data explicitly. Here, ideas from this field are exploited to develop robust sparse hyperplanes, i.e., classification and relevant feature identification algorithms that are resilient to variation in the data. Specifically, each data point is associated with an explicit data uncertainty model in the form of an ellipsoid parameterized by a center and covariance matrix. The task of learning a robust sparse hyperplane from such data is formulated as a second order cone program (SOCP). Gaussian and distribution-free data uncertainty models are shown to yield SOCPs that are equivalent to the SCOP based on ellipsoidal uncertainty. The real-world utility of robust sparse hyperplanes is demonstrated via retrospective analysis of breast cancer related transcript profiles. Data-dependent heuristics are used to compute the parameters of each ellipsoidal data uncertainty model. The generalization performance of a specific implementation, designated "robust LIKNON," is better than its nominal counterpart. Finally, the strengths and limitations of robust sparse hyperplanes are discussed.


Subject(s)
Computational Biology , Sequence Analysis, DNA/statistics & numerical data , Sequence Analysis, Protein/statistics & numerical data , Breast Neoplasms/genetics , Breast Neoplasms/metabolism , Data Interpretation, Statistical , Female , Genes, BRCA1 , Genes, BRCA2 , Humans
3.
Mech Ageing Dev ; 124(1): 109-14, 2003 Jan.
Article in English | MEDLINE | ID: mdl-12618013

ABSTRACT

Transcript profiling can be used to elucidate the molecular and cellular mechanisms involved in ageing and cancer. A recent study of human gastrointestinal stromal tumours (GISTs) with mutations in the KIT gene, Cancer Res. 61 (2001) 8624 exemplifies a common type of investigation. cDNA microarrays were used to generate measurements for 1987 clones in two types of tissues: 13 KIT mutation-positive GISTs and 6 spindle cell tumours from locations outside the gastrointestinal tract. Statistical problems associated with such two-class, high-dimensional profiling data include simultaneous classification and relevant feature identification, probabilistic clustering and protein sequence family modelling. Here, the GIST data were reexamined using specific solutions to these problems, namely sparse hyperplanes, nai;ve Bayes models and profile hidden Markov models respectively. The integrated analysis of molecular profiling and sequence data highlighted 6 clones that may be of clinical and experimental interest. The protein encoded by one of these putative biomarkers defined a novel protein family present in diverse eucarya. The family may be involved in chromosome segregation and/or stability. One family member is a potential biomarker identified recently from a retrospective analysis of transcript profiles for sporadic breast cancer samples from patients with poor and good prognosis, Signal Process. (in press).


Subject(s)
Gene Expression Profiling/statistics & numerical data , Sequence Analysis, Protein/statistics & numerical data , Amino Acid Sequence , Animals , Bayes Theorem , Carcinoma/genetics , Cluster Analysis , Data Interpretation, Statistical , Gastrointestinal Neoplasms/genetics , Humans , Markov Chains , Models, Statistical , Molecular Sequence Data , Mutation , Oligonucleotide Array Sequence Analysis/statistics & numerical data , Proto-Oncogene Proteins c-kit/genetics , Sequence Homology, Amino Acid , Transcription, Genetic
4.
Pac Symp Biocomput ; : 263-74, 2001.
Article in English | MEDLINE | ID: mdl-11262946

ABSTRACT

Computer aided sequence analysis is a critical aspect of current biological research. Sequence information from the genome sequencing projects fills databases so quickly that humans cannot examine it all. Hence there is a heavy reliance on computer algorithms to point out the few important nuggets for human examination. Sequence search algorithms range from simple to complex, as does the representation of the biological data. Typically though, simple algorithms are used on the simplest of data representations because of the large computational demands of anything more complex. This leads to missed hits because the simple search techniques are often not sufficiently sensitive. Here we describe the implementation of several sensitive sequence analysis algorithms on the Kestrel parallel processor, a single-instruction multiple-data (SIMD) processor developed and built at UCSC. Performance of the Smith-Waterman and Hidden Markov Model algorithms, with both Viterbi and Expectation Maximization methods ranges from 6 to 20 times faster than standard computers.


Subject(s)
Algorithms , Computers , Sequence Analysis/statistics & numerical data , Databases, Factual , Markov Chains
5.
Proteins ; Suppl 5: 86-91, 2001.
Article in English | MEDLINE | ID: mdl-11835485

ABSTRACT

This article presents results of blind predictions submitted to the CASP4 protein structure prediction experiment. We made two sets of predictions: one using the fully automated SAM-T99 server and one using the improved SAM-T2K method with human intervention. Both methods use iterative hidden Markov model-based methods for constructing protein family profiles, using only sequence information. Although the SAM-T99 method is purely sequence based, the SAM-T2K method uses the predicted secondary structure of the target sequence and the known secondary structure of the templates to improve fold recognition and alignment. In this article, we try to determine what aspects of the SAM-T2K method were responsible for its significantly better performance in the CASP4 experiment in the hopes of producing a better automatic prediction server. The use of secondary structure prediction seems to be the most valuable single improvement, though the combined total of various human interventions is probably at least as important.


Subject(s)
DNA-Binding Proteins , Models, Molecular , Protein Conformation , Adenosine Triphosphatases/chemistry , Bacterial Proteins/chemistry , Computer Simulation , Endodeoxyribonucleases/chemistry , Escherichia coli Proteins/chemistry , Lyases/chemistry , MutS DNA Mismatch-Binding Protein , Neural Networks, Computer , Protein Structure, Tertiary , Repressor Proteins/chemistry , Research Design , Sequence Alignment , Sequence Analysis, Protein
6.
Bioinformatics ; 16(5): 478-81, 2000 May.
Article in English | MEDLINE | ID: mdl-10871270

ABSTRACT

MOTIVATION: The antizymes (AZ) are proteins that regulate cellular polyamine pools in metazoa. To search for remote homologs in single-celled eukaryotes, we used computer software based on hidden Markov models. The most divergent homolog detected was that of the fission yeast Schizosaccharomyces pombe. Sequence identities between S.POMBE: AZ and known AZs are as low as 18-22% in the most conserved C-terminal regions. The authenticity of the S.POMBE: AZ is validated by the presence of a conserved nucleotide sequence that, in metazoa, promotes a +1 programmed ribosomal frameshift required for AZ expression. However, no homolog was detected in the completed genome of the budding yeast Saccharomyces cerevisiae. Procedural details and supplementary information can be found at http://itsa.ucsf.edu/ approximately czhu/AZ.


Subject(s)
Fungal Proteins/genetics , Proteins/genetics , Saccharomyces cerevisiae/genetics , Schizosaccharomyces/genetics , Amino Acid Sequence , Animals , Conserved Sequence , Enzyme Inhibitors/chemistry , Genome, Fungal , Humans , Molecular Sequence Data , Ornithine Decarboxylase Inhibitors , Sequence Homology, Amino Acid , Species Specificity
7.
Nucleic Acids Res ; 28(8): 1700-6, 2000 Apr 15.
Article in English | MEDLINE | ID: mdl-10734188

ABSTRACT

Correct identification of all introns is necessary to discern the protein-coding potential of a eukaryotic genome. The existence of most of the spliceosomal introns predicted in the genome of Saccharomyces cerevisiae remains unsupported by molecular evidence. We tested the intron predictions for 87 introns predicted to be present in non-ribosomal protein genes, more than a third of all known or suspected introns in the yeast genome. Evidence supporting 61 of these predictions was obtained, 20 predicted intron sequences were not spliced and six predictions identified an intron-containing region but failed to specify the correct splice sites, yielding a successful prediction rate of <80%. Alternative splicing has not been previously described for this organism, and we identified two genes (YKL186C/ MTR2 and YML034W) which encode alternatively spliced mRNAs; YKL186C/ MTR2 produces at least five different spliced mRNAs. One gene (YGR225W/ SPO70 ) has an intron whose removal is activated during meiosis under control of the MER1 gene. We found eight new introns, suggesting that numerous introns still remain to be discovered. The results show that correct prediction of introns remains a significant barrier to understanding the structure, function and coding capacity of eukaryotic genomes, even in a supposedly simple system like yeast.


Subject(s)
Alternative Splicing , Gene Expression Regulation, Fungal , Introns , Meiosis/genetics , RNA, Messenger/genetics , Saccharomyces cerevisiae Proteins , Saccharomyces cerevisiae/genetics , Fungal Proteins/metabolism , RNA-Binding Proteins/metabolism , Saccharomyces cerevisiae/cytology
8.
Proteins ; Suppl 3: 121-5, 1999.
Article in English | MEDLINE | ID: mdl-10526360

ABSTRACT

This paper presents results of blind predictions submitted to the CASP3 protein structure prediction experiment. We made predictions using the SAM-T98 method, an iterative hidden Markov model-based method for constructing protein family profiles. The method is purely sequence-based, using no structural information, and yet was able to predict structures as well as all but five of the structure-based methods in CASP3.


Subject(s)
Proteins/chemistry , Algorithms , Amino Acid Sequence , Markov Chains , Molecular Sequence Data , Protein Structure, Secondary , Sequence Alignment
10.
RNA ; 5(2): 221-34, 1999 Feb.
Article in English | MEDLINE | ID: mdl-10024174

ABSTRACT

Introns have typically been discovered in an ad hoc fashion: introns are found as a gene is characterized for other reasons. As complete eukaryotic genome sequences become available, better methods for predicting RNA processing signals in raw sequence will be necessary in order to discover genes and predict their expression. Here we present a catalog of 228 yeast introns, arrived at through a combination of bioinformatic and molecular analysis. Introns annotated in the Saccharomyces Genome Database (SGD) were evaluated, questionable introns were removed after failing a test for splicing in vivo, and known introns absent from the SGD annotation were added. A novel branchpoint sequence, AAUUAAC, was identified within an annotated intron that lacks a six-of-seven match to the highly conserved branchpoint consensus UACUAAC. Analysis of the database corroborates many conclusions about pre-mRNA substrate requirements for splicing derived from experimental studies, but indicates that splicing in yeast may not be as rigidly determined by splice-site conservation as had previously been thought. Using this database and a molecular technique that directly displays the lariat intron products of spliced transcripts (intron display), we suggest that the current set of 228 introns is still not complete, and that additional intron-containing genes remain to be discovered in yeast. The database can be accessed at http://www.cse.ucsc.edu/research/compbi o/yeast_introns.html.


Subject(s)
Computational Biology , Genome, Fungal , Introns/genetics , Saccharomyces cerevisiae/genetics , Databases as Topic , Markov Chains , RNA Precursors/genetics , RNA Splicing/genetics , RNA, Small Nuclear/genetics , Ribosomal Proteins/genetics , Spliceosomes/genetics
11.
J Acquir Immune Defic Syndr Hum Retrovirol ; 17(5): 398-403, 1998 Apr 15.
Article in English | MEDLINE | ID: mdl-9562041

ABSTRACT

It has been proposed on the basis of sequence analysis that HIV-1 encodes a protein containing the amino acid selenocysteine (Sec). Selenocysteine is known to be incorporated into protein in response to a specific RNA secondary structure motif within the mRNA that is being translated. This RNA motif, the selenocysteine insertion sequence (SECIS) element, has not yet been identified in the HIV genome by either biologic or computation methods. This report uses computer-based sequence analysis to identify those locations in HIV-1 strain HXB2 where the current model of the SECIS element could exist. One particularly good match to the SECIS element occurs in an interesting location, spanning the end of env and the start of nef, in a position theoretically capable of directing the previously proposed Sec incorporation.


Subject(s)
HIV-1/genetics , Selenocysteine/genetics , Algorithms , Base Sequence , Computational Biology , DNA Transposable Elements/genetics , Databases, Factual , Genome, Viral , HIV-1/chemistry , Humans , Molecular Sequence Data , Nucleic Acid Conformation , RNA, Viral/chemistry , RNA, Viral/genetics , Selenocysteine/chemistry , Sequence Alignment , Sequence Homology, Nucleic Acid
12.
Article in English | MEDLINE | ID: mdl-7584430

ABSTRACT

We have developed a method for predicting the common secondary structure of large RNA multiple alignments using only the information in the alignment. It uses a series of progressively more sensitive searches of the data in an iterative manner to discover regions of base pairing; the first pass examines the entire multiple alignment. The searching uses two methods to find base pairings. Mutual information is used to measure covariation between pairs of columns in the multiple alignment and a minimum length encoding method is used to detect column pairs with high potential to base pair. Dynamic programming is used to recover the optimal tree made up of the best potential base pairs and to create a stochastic context-free grammar. The information in the tree guides the next iteration of searching. The method is similar to the traditional comparative sequence analysis technique. The method correctly identifies most of the common secondary structure in 16S and 23S rRNA.


Subject(s)
Markov Chains , Nucleic Acid Conformation , RNA/chemistry , Stochastic Processes , Algorithms , Base Composition , Base Sequence , Escherichia coli/genetics , Models, Structural , Molecular Sequence Data , RNA, Bacterial/chemistry , RNA, Ribosomal, 16S/chemistry , RNA, Ribosomal, 23S/chemistry
13.
Article in English | MEDLINE | ID: mdl-7584383

ABSTRACT

A new method of discovering the common secondary structure of a family of homologous RNA sequences using Gibbs sampling and stochastic context-free grammars is proposed. Given an unaligned set of sequences, a Gibbs sampling step simultaneously estimates the secondary structure of each sequence and a set of statistical parameters describing the common secondary structure of the set as a whole. These parameters describe a statistical model of the family. After the Gibbs sampling has produced a crude statistical model for the family, this model is translated into a stochastic context-free grammar, which is then refined by an Expectation Maximization (EM) procedure to produce a more complete model. A prototype implementation of the method is tested on tRNA, pieces of 16S rRNA and on U5 snRNA with good results.


Subject(s)
Models, Theoretical , RNA/analysis , Animals , Base Sequence , Humans , Molecular Sequence Data , Molecular Structure , Stochastic Processes
SELECTION OF CITATIONS
SEARCH DETAIL
...