Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 15 de 15
Filter
Add more filters










Publication year range
1.
BMC Bioinformatics ; 15 Suppl 7: S1, 2014.
Article in English | MEDLINE | ID: mdl-25079667

ABSTRACT

BACKGROUND: Drug discovery, disease detection, and personalized medicine are fast-growing areas of genomic research. With the advancement of next-generation sequencing techniques, researchers can obtain an abundance of data for many different biological assays in a short period of time. When this data is error-free, the result is a high-quality base-pair resolution picture of the genome. However, when the data is lossy the heuristic algorithms currently used when aligning next-generation sequences causes the corresponding accuracy to drop. RESULTS: This paper describes a program, ADaM (APF DNA Mapper) which significantly increases final alignment accuracy. ADaM works by first using an existing program to align "easy" sequences, and then using an algorithm with accuracy guarantees (the APF) to align the remaining sequences. The final result is a technique that increases the mapping accuracy from only 60% to over 90% for harder-to-align sequences.


Subject(s)
Algorithms , Artificial Intelligence , Sequence Alignment/methods , Sequence Analysis, DNA/methods , Software , Animals , Base Sequence , Genome , High-Throughput Nucleotide Sequencing/methods , Humans
2.
BMC Syst Biol ; 7 Suppl 4: S13, 2013.
Article in English | MEDLINE | ID: mdl-24565058

ABSTRACT

BACKGROUND: The analysis of RNA sequences, once a small niche field for a small collection of scientists whose primary emphasis was the structure and function of a few RNA molecules, has grown most significantly with the realizations that 1) RNA is implicated in many more functions within the cell, and 2) the analysis of ribosomal RNA sequences is revealing more about the microbial ecology within all biological and environmental systems. The accurate and rapid alignment of these RNA sequences is essential to decipher the maximum amount of information from this data. METHODS: Two computer systems that utilize the Gutell lab's RNA Comparative Analysis Database (rCAD) were developed to align sequences to an existing template alignment available at the Gutell lab's Comparative RNA Web (CRW) Site. Multiple dimensions of cross-indexed information are contained within the relational database--rCAD, including sequence alignments, the NCBI phylogenetic tree, and comparative secondary structure information for each aligned sequence. The first program, CRWAlign-1 creates a phylogenetic-based sequence profile for each column in the alignment. The second program, CRWAlign-2 creates a profile based on phylogenetic, secondary structure, and sequence information. Both programs utilize their profiles to align new sequences into the template alignment. RESULTS: The accuracies of the two CRWAlign programs were compared with the best template-based rRNA alignment programs and the best de-novo alignment programs. We have compared our programs with a total of eight alternative alignment methods on different sets of 16S rRNA alignments with sequence percent identities ranging from 50% to 100%. Both CRWAlign programs were superior to these other programs in accuracy and speed. CONCLUSIONS: Both CRWAlign programs can be used to align the very extensive amount of RNA sequencing that is generated due to the rapid next-generation sequencing technology. This latter technology is augmenting the new paradigm that RNA is intimately implicated in a significant number of functions within the cell. In addition, the use of bacterial 16S rRNA sequencing in the identification of the microbiome in many different environmental systems creates a need for rapid and highly accurate alignment of bacterial 16S rRNA sequences.


Subject(s)
Computational Biology/methods , Phylogeny , RNA/genetics , Sequence Alignment/methods , Templates, Genetic , Base Sequence , Databases, Genetic , RNA, Bacterial/genetics , RNA, Ribosomal, 16S/genetics
3.
Article in English | MEDLINE | ID: mdl-24772376

ABSTRACT

The rapid determination of nucleic acid sequences is increasing the number of sequences that are available. Inherent in a template or seed alignment is the culmination of structural and functional constraints that are selecting those mutations that are viable during the evolution of the RNA. While we might not understand these structural and functional, template-based alignment programs utilize the patterns of sequence conservation to encapsulate the characteristics of viable RNA sequences that are aligned properly. We have developed a program that utilizes the different dimensions of information in rCAD, a large RNA informatics resource, to establish a profile for each position in an alignment. The most significant include sequence identity and column composition in different phylogenetic taxa. We have compared our methods with a maximum of eight alternative alignment methods on different sets of 16S and 23S rRNA sequences with sequence percent identities ranging from 50% to 100%. The results showed that CRWAlign outperformed the other alignment methods in both speed and accuracy. A web-based alignment server is available at http://www.rna.ccbb.utexas.edu/SAE/2F/CRWAlign.

4.
J Biomed Semantics ; 2 Suppl 1: S3, 2011 Mar 07.
Article in English | MEDLINE | ID: mdl-21388572

ABSTRACT

BACKGROUND: Ontologies are commonly used in biomedicine to organize concepts to describe domains such as anatomies, environments, experiment, taxonomies etc. NCBO BioPortal currently hosts about 180 different biomedical ontologies. These ontologies have been mainly expressed in either the Open Biomedical Ontology (OBO) format or the Web Ontology Language (OWL). OBO emerged from the Gene Ontology, and supports most of the biomedical ontology content. In comparison, OWL is a Semantic Web language, and is supported by the World Wide Web consortium together with integral query languages, rule languages and distributed infrastructure for information interchange. These features are highly desirable for the OBO content as well. A convenient method for leveraging these features for OBO ontologies is by transforming OBO ontologies to OWL. RESULTS: We have developed a methodology for translating OBO ontologies to OWL using the organization of the Semantic Web itself to guide the work. The approach reveals that the constructs of OBO can be grouped together to form a similar layer cake. Thus we were able to decompose the problem into two parts. Most OBO constructs have easy and obvious equivalence to a construct in OWL. A small subset of OBO constructs requires deeper consideration. We have defined transformations for all constructs in an effort to foster a standard common mapping between OBO and OWL. Our mapping produces OWL-DL, a Description Logics based subset of OWL with desirable computational properties for efficiency and correctness. Our Java implementation of the mapping is part of the official Gene Ontology project source. CONCLUSIONS: Our transformation system provides a lossless roundtrip mapping for OBO ontologies, i.e. an OBO ontology may be translated to OWL and back without loss of knowledge. In addition, it provides a roadmap for bridging the gap between the two ontology languages in order to enable the use of ontology content in a language independent manner.

5.
Article in English | MEDLINE | ID: mdl-24772375

ABSTRACT

We present a fast pairwise RNA sequence alignment method using structural information, named R-PASS (RNA Pairwise Alignment of Structure and Sequence), which shows good accuracy on sequences with low sequence identity and significantly faster than alternative methods. The method begins by representing RNA secondary structure as a set of structure motifs. The motifs from two RNAs are then used as input into a bipartite graph-matching algorithm, which determines the structure matches. The matches are then used as constraints in a constrained dynamic programming sequence alignment procedure. The R-PASS method has an O(nm) complexity. We compare our method with two other structure-based alignment methods, LARA and ExpaLoc, and with a sequence-based alignment method, MAFFT, across three benchmarks and obtain favorable results in accuracy and orders of magnitude faster in speed.

6.
Bioinformatics ; 25(22): 2955-61, 2009 Nov 15.
Article in English | MEDLINE | ID: mdl-19633097

ABSTRACT

MOTIVATION: High-throughput protein identification experiments based on tandem mass spectrometry (MS/MS) often suffer from low sensitivity and low-confidence protein identifications. In a typical shotgun proteomics experiment, it is assumed that all proteins are equally likely to be present. However, there is often other evidence to suggest that a protein is present and confidence in individual protein identification can be updated accordingly. RESULTS: We develop a method that analyzes MS/MS experiments in the larger context of the biological processes active in a cell. Our method, MSNet, improves protein identification in shotgun proteomics experiments by considering information on functional associations from a gene functional network. MSNet substantially increases the number of proteins identified in the sample at a given error rate. We identify 8-29% more proteins than the original MS experiment when applied to yeast grown in different experimental conditions analyzed on different MS/MS instruments, and 37% more proteins in a human sample. We validate up to 94% of our identifications in yeast by presence in ground-truth reference sets. AVAILABILITY AND IMPLEMENTATION: Software and datasets are available at http://aug.csres.utexas.edu/msnet


Subject(s)
Computational Biology/methods , Gene Regulatory Networks , Proteins/chemistry , Proteome/analysis , Proteomics/methods , Tandem Mass Spectrometry/methods , Databases, Protein
7.
Bioinformatics ; 25(11): 1397-403, 2009 Jun 01.
Article in English | MEDLINE | ID: mdl-19318424

ABSTRACT

MOTIVATION: Tandem mass spectrometry (MS/MS) offers fast and reliable characterization of complex protein mixtures, but suffers from low sensitivity in protein identification. In a typical shotgun proteomics experiment, it is assumed that all proteins are equally likely to be present. However, there is often other information available, e.g. the probability of a protein's presence is likely to correlate with its mRNA concentration. RESULTS: We develop a Bayesian score that estimates the posterior probability of a protein's presence in the sample given its identification in an MS/MS experiment and its mRNA concentration measured under similar experimental conditions. Our method, MSpresso, substantially increases the number of proteins identified in an MS/MS experiment at the same error rate, e.g. in yeast, MSpresso increases the number of proteins identified by approximately 40%. We apply MSpresso to data from different MS/MS instruments, experimental conditions and organisms (Escherichia coli, human), and predict 19-63% more proteins across the different datasets. MSpresso demonstrates that incorporating prior knowledge of protein presence into shotgun proteomics experiments can substantially improve protein identification scores. AVAILABILITY AND IMPLEMENTATION: Software is available upon request from the authors. Mass spectrometry datasets and supplementary information are available from (http://www.marcottelab.org/MSpresso/).


Subject(s)
Proteins/chemistry , Proteomics/methods , RNA, Messenger/metabolism , Bayes Theorem , Databases, Protein , Humans , Proteome/analysis , Proteome/genetics , Proteome/metabolism , Software , Tandem Mass Spectrometry/methods , User-Computer Interface
8.
J Biomed Inform ; 41(5): 730-8, 2008 Oct.
Article in English | MEDLINE | ID: mdl-18599379

ABSTRACT

Life science identifier (LSID) is a global unique identifier standard intended to help rationalize the unique archival requirements of biological data. We describe LSID implementation architecture such that data managed by a relational database management system may be integrated with the LSID protocol as an add-on layer. The approach requires a database administrator (DBA) to specify an export schema detailing the content and structure of the archived data, and a mapping of the existing database to that schema. This specification can be expressed using SQL view syntax. In effect, we define a SQL-like language for implementing LSIDs. We describe the mapping of the view definition to an implementation as a set of databases triggers and a fixed runtime library. Thus a compiler for a domain-specific language could be written that would reduce the implementation of LSIDs to the task of writing SQL view-like definitions.


Subject(s)
Computational Biology/methods , Database Management Systems/organization & administration , Natural Language Processing , Software Design , Animals , Databases, Factual , Humans , Online Systems/organization & administration , Programming Languages
9.
BMC Bioinformatics ; 8: 445, 2007 Nov 15.
Article in English | MEDLINE | ID: mdl-18005433

ABSTRACT

BACKGROUND: Cis-acting transcriptional regulatory elements in mammalian genomes typically contain specific combinations of binding sites for various transcription factors. Although some cis-regulatory elements have been well studied, the combinations of transcription factors that regulate normal expression levels for the vast majority of the 20,000 genes in the human genome are unknown. We hypothesized that it should be possible to discover transcription factor combinations that regulate gene expression in concert by identifying over-represented combinations of sequence motifs that occur together in the genome. In order to detect combinations of transcription factor binding motifs, we developed a data mining approach based on the use of association rules, which are typically used in market basket analysis. We scored each segment of the genome for the presence or absence of each of 83 transcription factor binding motifs, then used association rule mining algorithms to mine this dataset, thus identifying frequently occurring pairs of distinct motifs within a segment. RESULTS: Support for most pairs of transcription factor binding motifs was highly correlated across different chromosomes although pair significance varied. Known true positive motif pairs showed higher association rule support, confidence, and significance than background. Our subsets of high-confidence, high-significance mined pairs of transcription factors showed enrichment for co-citation in PubMed abstracts relative to all pairs, and the predicted associations were often readily verifiable in the literature. CONCLUSION: Functional elements in the genome where transcription factors bind to regulate expression in a combinatorial manner are more likely to be predicted by identifying statistically and biologically significant combinations of transcription factor binding motifs than by simply scanning the genome for the occurrence of binding sites for a single transcription factor.


Subject(s)
Algorithms , Electronic Data Processing/methods , Genome, Human , Regulatory Elements, Transcriptional , Transcription Factors/metabolism , Binding Sites , Databases, Factual , Forecasting/methods , Gene Expression Profiling , Humans , Microarray Analysis , Models, Biological , Protein Binding
10.
Bioinformatics ; 23(24): 3289-96, 2007 Dec 15.
Article in English | MEDLINE | ID: mdl-17921494

ABSTRACT

MOTIVATIONS: Biclustering is a clustering method that simultaneously clusters both the domain and range of a relation. A challenge in multiple sequence alignment (MSA) is that the alignment of sequences is often intended to reveal groups of conserved functional subsequences. Simultaneously, the grouping of the sequences can impact the alignment; precisely the kind of dual situation biclustering is intended to address. RESULTS: We define a representation of the MSA problem enabling the application of biclustering algorithms. We develop a computer program for local MSA, BlockMSA, that combines biclustering with divide-and-conquer. BlockMSA simultaneously finds groups of similar sequences and locally aligns subsequences within them. Further alignment is accomplished by dividing both the set of sequences and their contents. The net result is both a multiple sequence alignment and a hierarchical clustering of the sequences. BlockMSA was tested on the subsets of the BRAliBase 2.1 benchmark suite that display high variability and on an extension to that suite to larger problem sizes. Also, alignments were evaluated of two large datasets of current biological interest, T box sequences and Group IC1 Introns. The results were compared with alignments computed by ClustalW, MAFFT, MUCLE and PROBCONS alignment programs using Sum of Pairs (SPS) and Consensus Count. Results for the benchmark suite are sensitive to problem size. On problems of 15 or greater sequences, BlockMSA is consistently the best. On none of the problems in the test suite are there appreciable differences in scores among BlockMSA, MAFFT and PROBCONS. On the T box sequences, BlockMSA does the most faithful job of reproducing known annotations. MAFFT and PROBCONS do not. On the Intron sequences, BlockMSA, MAFFT and MUSCLE are comparable at identifying conserved regions. AVAILABILITY: BlockMSA is implemented in Java. Source code and supplementary datasets are available at http://aug.csres.utexas.edu/msa/


Subject(s)
Algorithms , Artificial Intelligence , Cluster Analysis , Pattern Recognition, Automated/methods , Sequence Alignment/methods , Sequence Analysis, RNA/methods , Base Sequence , Molecular Sequence Data , Sequence Homology, Nucleic Acid
11.
Bioinformatics ; 22(12): 1524-31, 2006 Jun 15.
Article in English | MEDLINE | ID: mdl-16585069

ABSTRACT

MOTIVATION: We reformulate the problem of comparing mass-spectra by mapping spectra to a vector space model. Our search method leverages a metric space indexing algorithm to produce an initial candidate set, which can be followed by any fine ranking scheme. RESULTS: We consider three distance measures integrated into a multi-vantage point index structure. Of these, a semi-metric fuzzy-cosine distance using peptide precursor mass constraints performs the best. The index acts as a coarse, lossless filter with respect to the SEQUEST and ProFound scoring schemes, reducing the number of distance computations and returned candidates for fine filtering to about 0.5% and 0.02% of the database respectively. The fuzzy cosine distance term improves specificity over a peptide precursor mass filter, reducing the number of returned candidates by an order of magnitude. Run time measurements suggest proportional speedups in overall search times. Using an implementation of ProFound's Bayesian score as an example of a fine filter on a test set of Escherichia coli protein fragmentation spectra, the top results of our sample system are consistent with that of SEQUEST.


Subject(s)
Mass Spectrometry/methods , Peptide Mapping/methods , Peptides/chemistry , Proteomics/methods , Algorithms , Databases, Protein , Escherichia coli/metabolism , Genetic Vectors , Programming Languages , Proteins/chemistry , Sequence Analysis, Protein/methods , Software
12.
Article in English | MEDLINE | ID: mdl-16447992

ABSTRACT

Similarity search leveraging distance-based index structures is increasingly being used for both multimedia and biological database applications. We consider distance-based indexing for three important biological data types, protein k-mers with the metric PAM model, DNA k-mers with Hamming distance and peptide fragmentation spectra with a pseudo-metric derived from cosine distance. To date, the primary driver of this research has been multimedia applications, where similarity functions are often Euclidean norms on high dimensional feature vectors. We develop results showing that the character of these biological workloads is different from multimedia workloads. In particular, they are not intrinsically very high dimensional, and deserving different optimization heuristics. Based on MVP-trees, we develop a pivot selection heuristic seeking centers and show it outperforms the most widely used corner seeking heuristic. Similarly, we develop a data partitioning approach sensitive to the actual data distribution in lieu of median splits.


Subject(s)
Algorithms , Artificial Intelligence , Database Management Systems , Databases, Genetic , Information Storage and Retrieval/methods , Sequence Alignment/methods , Sequence Analysis/methods , Pattern Recognition, Automated/methods , Sequence Homology , User-Computer Interface
13.
Bioinformatics ; 20 Suppl 1: i355-62, 2004 Aug 04.
Article in English | MEDLINE | ID: mdl-15262820

ABSTRACT

MOTIVATION: For the purpose of identifying evolutionary reticulation events in flowering plants, we determine a large number of paired, conserved DNA oligomers that may be used as primers to amplify orthologous DNA regions using the polymerase chain reaction (PCR). RESULTS: We develop an initial candidate set by comparing the Arabidopsis and rice genomes using MoBIoS (Molecular Biological Information System). MoBIoS is a metric-space database management system targeting life science data. Through the use of metric-space indexing techniques, two genomes can be compared in O(mlog n), where m and n are the lengths of the genomes, versus O(mn) for BLAST-based analysis. The filtering of low-complexity regions may also be accomplished by directly assessing the uniqueness of the region. We describe mSQL, a SQL extension being developed for MoBIoS that encapsulates the algorithmic details in a common database programming language, shielding end-users from esoteric programming. AVAILABILITY: Available upon request from authors.


Subject(s)
Arabidopsis/genetics , Chromosome Mapping/methods , Conserved Sequence/genetics , DNA Primers/genetics , Genome, Plant/genetics , Oryza/genetics , Polymerase Chain Reaction/methods , Sequence Analysis, DNA/methods , Sequence Homology, Nucleic Acid , Software
14.
Bioinformatics ; 20(8): 1214-21, 2004 May 22.
Article in English | MEDLINE | ID: mdl-14871874

ABSTRACT

MOTIVATION: We address the question of whether there exists an effective evolutionary model of amino-acid substitution that forms a metric-distance function. There is always a trade-off between speed and sensitivity among competing computational methods of determining sequence homology. A metric model of evolution is a prerequisite for the development of an entire class of fast sequence analysis algorithms that are both scalable, O(log n) and sensitive. RESULTS: We have reworked the mathematics of the point accepted mutation model (PAM) by calculating the expected time between accepted mutations in lieu of calculating log-odds probabilities. The resulting substitution matrix (mPAM) forms a metric. We validate the application of the mPAM evolutionary model for sequence homology by executing sequence queries from a controlled yeast protein homology search benchmark. We compare the accuracy of the results of mPAM and PAM similarity matrices as well as three prior metric models. The experiment shows that mPAM significantly outperforms the other three metrics and sufficiently approaches the sensitivity of PAM250 to make it applicable to the management of protein sequence databases.


Subject(s)
Amino Acid Substitution , Models, Molecular , Proteins/chemistry , Sequence Alignment/methods , Sequence Analysis, Protein/methods , Algorithms , Amino Acid Sequence , Computer Simulation , Evolution, Molecular , Molecular Sequence Data , Mutation , Proteins/analysis , Reproducibility of Results , Sensitivity and Specificity , Sequence Homology, Amino Acid
15.
OMICS ; 7(1): 57-60, 2003.
Article in English | MEDLINE | ID: mdl-12838940

ABSTRACT

Biochemical databases will be best served by the development of new specialized database management systems whose storage managers are based on metric-space indexing techniques and the development a database query languages that embody semantics derived from biochemical models of similarity and evolution. Important biochemical data types cannot be effectively mapped to low dimensional coordinate systems on which O(log n) indexing methods rely. It is clear from an abundance of bioinformatic discoveries that biochemical data is not random and exhibits interesting structure with respect to clustering. Metric-space indexing exploits a data set's intrinsic clustering to speed the execution of similarity queries, even when the data cannot be mapped to a coordinate system. Database management systems that seamlessly integrate semantically rich query languages with a metric-storage and retrieval mechanism will allow biologists to simply and concisely develop informatic studies that have traditionally been large and labor intensive.


Subject(s)
Computational Biology , Programming Languages
SELECTION OF CITATIONS
SEARCH DETAIL
...