Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 43
Filter
Add more filters










Publication year range
1.
BMC Bioinformatics ; 14: 352, 2013 Dec 04.
Article in English | MEDLINE | ID: mdl-24305467

ABSTRACT

BACKGROUND: The ever on-going technical developments in Next Generation Sequencing have led to an increase in detected disease related mutations. Many bioinformatics approaches exist to analyse these variants, and of those the methods that use 3D structure information generally outperform those that do not use this information. 3D structure information today is available for about twenty percent of the human exome, and homology modelling can double that fraction. This percentage is rapidly increasing so that we can expect to analyse the majority of all human exome variants in the near future using protein structure information. RESULTS: We collected a test dataset of well-described mutations in proteins for which 3D-structure information is available. This test dataset was used to analyse the possibilities and the limitations of methods based on sequence information alone, hybrid methods, machine learning based methods, and structure based methods. CONCLUSIONS: Our analysis shows that the use of structural features improves the classification of mutations. This study suggests strategies for future analyses of disease causing mutations, and it suggests which bioinformatics approaches should be developed to make progress in this field.


Subject(s)
Computational Biology/methods , Genetic Variation , Molecular Sequence Annotation/methods , Proteins/genetics , Artificial Intelligence , Cluster Analysis , Conserved Sequence/genetics , Databases, Genetic , Exome/genetics , Genome, Human/genetics , High-Throughput Nucleotide Sequencing/methods , High-Throughput Nucleotide Sequencing/trends , Humans , Mutation/genetics , Polymorphism, Single Nucleotide/genetics , Proteins/chemistry , Sequence Alignment/trends , Sequence Homology, Amino Acid
3.
Brief Bioinform ; 14(1): 56-66, 2013 Jan.
Article in English | MEDLINE | ID: mdl-22492192

ABSTRACT

UNLABELLED: Error Correction is important for most next-generation sequencing applications because highly accurate sequenced reads will likely lead to higher quality results. Many techniques for error correction of sequencing data from next-gen platforms have been developed in the recent years. However, compared with the fast development of sequencing technologies, there is a lack of standardized evaluation procedure for different error-correction methods, making it difficult to assess their relative merits and demerits. In this article, we provide a comprehensive review of many error-correction methods, and establish a common set of benchmark data and evaluation criteria to provide a comparative assessment. We present experimental results on quality, run-time, memory usage and scalability of several error-correction methods. Apart from providing explicit recommendations useful to practitioners, the review serves to identify the current state of the art and promising directions for future research. AVAILABILITY: All error-correction programs used in this article are downloaded from hosting websites. The evaluation tool kit is publicly available at: http://aluru-sun.ece.iastate.edu/doku.php?id=ecr.


Subject(s)
Sequence Analysis, DNA/trends , Software , Algorithms , Animals , Chromosome Mapping/statistics & numerical data , Chromosome Mapping/trends , Computational Biology , Databases, Genetic/statistics & numerical data , Databases, Genetic/trends , Forecasting , Humans , Sequence Alignment/statistics & numerical data , Sequence Alignment/trends , Sequence Analysis, DNA/statistics & numerical data
4.
J Genet Genomics ; 38(3): 95-109, 2011 Mar 20.
Article in English | MEDLINE | ID: mdl-21477781

ABSTRACT

This article reviews basic concepts, general applications, and the potential impact of next-generation sequencing (NGS) technologies on genomics, with particular reference to currently available and possible future platforms and bioinformatics. NGS technologies have demonstrated the capacity to sequence DNA at unprecedented speed, thereby enabling previously unimaginable scientific achievements and novel biological applications. But, the massive data produced by NGS also presents a significant challenge for data storage, analyses, and management solutions. Advanced bioinformatic tools are essential for the successful application of NGS technology. As evidenced throughout this review, NGS technologies will have a striking impact on genomic research and the entire biological field. With its ability to tackle the unsolved challenges unconquered by previous genomic technologies, NGS is likely to unravel the complexity of the human genome in terms of genetic variations, some of which may be confined to susceptible loci for some common human conditions. The impact of NGS technologies on genomics will be far reaching and likely change the field for years to come.


Subject(s)
Genomics/instrumentation , Genomics/methods , Precision Medicine/trends , Sequence Analysis, DNA/instrumentation , Sequence Analysis, DNA/methods , Computational Biology , Genetic Variation , Genomics/economics , Humans , Nanotechnology , Sequence Alignment/methods , Sequence Alignment/trends , Sequence Analysis, DNA/economics , Software/trends
5.
J Hered ; 102(1): 130-8, 2011.
Article in English | MEDLINE | ID: mdl-20696667

ABSTRACT

The acquisition of large multilocus sequence data is providing researchers with an unprecedented amount of information to resolve difficult phylogenetic problems. With these large quantities of data comes the increasing challenge regarding the best methods of analysis. We review the current trends in molecular phylogenetic analysis, focusing specifically on the topics of multiple sequence alignment and methods of tree reconstruction. We suggest that traditional methods are inadequate for these highly heterogeneous data sets and that researchers employ newer more sophisticated search algorithms in their analyses. If we are to best extract the information present in these data sets, a sound understanding of basic phylogenetic principles combined with modern methodological techniques are necessary.


Subject(s)
Databases, Genetic , Phylogeny , Sequence Alignment/trends , Sequence Analysis, DNA/trends , Algorithms , Evolution, Molecular , Genetic Loci , Models, Biological , Species Specificity
6.
Adv Exp Med Biol ; 680: 693-700, 2010.
Article in English | MEDLINE | ID: mdl-20865556

ABSTRACT

Next Generation Sequencing technologies are limited by the lack of standard bioinformatics infrastructures that can reduce data storage, increase data processing performance, and integrate diverse information. HDF technologies address these requirements and have a long history of use in data-intensive science communities. They include general data file formats, libraries, and tools for working with the data. Compared to emerging standards, such as the SAM/BAM formats, HDF5-based systems demonstrate significantly better scalability, can support multiple indexes, store multiple data types, and are self-describing. For these reasons, HDF5 and its BioHDF extension are well suited for implementing data models to support the next generation of bioinformatics applications.


Subject(s)
Sequence Alignment/statistics & numerical data , Sequence Analysis/statistics & numerical data , Computational Biology , Computer Simulation , Database Management Systems , Databases, Genetic , Sequence Alignment/standards , Sequence Alignment/trends , Sequence Analysis/standards , Sequence Analysis/trends , Software/standards , Software/trends , Software Design , User-Computer Interface
7.
RNA ; 15(9): 1623-31, 2009 Sep.
Article in English | MEDLINE | ID: mdl-19622678

ABSTRACT

Multiple sequence alignments are powerful tools for understanding the structures, functions, and evolutionary histories of linear biological macromolecules (DNA, RNA, and proteins), and for finding homologs in sequence databases. We address several ontological issues related to RNA sequence alignments that are informed by structure. Multiple sequence alignments are usually shown as two-dimensional (2D) matrices, with rows representing individual sequences, and columns identifying nucleotides from different sequences that correspond structurally, functionally, and/or evolutionarily. However, the requirement that sequences and structures correspond nucleotide-by-nucleotide is unrealistic and hinders representation of important biological relationships. High-throughput sequencing efforts are also rapidly making 2D alignments unmanageable because of vertical and horizontal expansion as more sequences are added. Solving the shortcomings of traditional RNA sequence alignments requires explicit annotation of the meaning of each relationship within the alignment. We introduce the notion of "correspondence," which is an equivalence relation between RNA elements in sets of sequences as the basis of an RNA alignment ontology. The purpose of this ontology is twofold: first, to enable the development of new representations of RNA data and of software tools that resolve the expansion problems with current RNA sequence alignments, and second, to facilitate the integration of sequence data with secondary and three-dimensional structural information, as well as other experimental information, to create simultaneously more accurate and more exploitable RNA alignments.


Subject(s)
RNA/analysis , Sequence Alignment/methods , Software , Animals , Base Sequence , Humans , Models, Biological , Molecular Sequence Data , Nucleic Acid Conformation , Phylogeny , RNA/chemistry , Sequence Alignment/trends , Sequence Analysis, RNA/methods , Sequence Homology, Nucleic Acid
8.
N Biotechnol ; 25(4): 195-203, 2009 Apr.
Article in English | MEDLINE | ID: mdl-19429539

ABSTRACT

Next-generation high-throughput DNA sequencing techniques are opening fascinating opportunities in the life sciences. Novel fields and applications in biology and medicine are becoming a reality, beyond the genomic sequencing which was original development goal and application. Serving as examples are: personal genomics with detailed analysis of individual genome stretches; precise analysis of RNA transcripts for gene expression, surpassing and replacing in several respects analysis by various microarray platforms, for instance in reliable and precise quantification of transcripts and as a tool for identification and analysis of DNA regions interacting with regulatory proteins in functional regulation of gene expression. The next-generation sequencing technologies offer novel and rapid ways for genome-wide characterisation and profiling of mRNAs, small RNAs, transcription factor regions, structure of chromatin and DNA methylation patterns, microbiology and metagenomics. In this article, development of commercial sequencing devices is reviewed and some European contributions to the field are mentioned. Presently commercially available very high-throughput DNA sequencing platforms, as well as techniques under development, are described and their applications in bio-medical fields discussed.


Subject(s)
Chromosome Mapping/instrumentation , Chromosome Mapping/methods , Sequence Alignment/instrumentation , Sequence Alignment/methods , Sequence Analysis, DNA/instrumentation , Sequence Analysis, DNA/methods , Chromosome Mapping/trends , Sequence Alignment/trends , Sequence Analysis, DNA/trends
9.
Nat Methods ; 5(12): 989, 2008 Dec.
Article in English | MEDLINE | ID: mdl-19054852

ABSTRACT

Sequencing technology is now advanced enough to decode individual human genomes. Will it prove to be better than existing methods for discovering the genetic basis of human phenotypic variation?


Subject(s)
Chromosome Mapping/trends , Genetic Variation/genetics , Linkage Disequilibrium/genetics , Polymorphism, Single Nucleotide/genetics , Sequence Alignment/trends , Sequence Analysis, DNA/trends
10.
BMC Bioinformatics ; 9: 554, 2008 Dec 22.
Article in English | MEDLINE | ID: mdl-19102758

ABSTRACT

BACKGROUND: Multiple sequence alignments are a fundamental tool for the comparative analysis of proteins and nucleic acids. However, large data sets are no longer manageable for visualization and investigation using the traditional stacked sequence alignment representation. RESULTS: We introduce ProfileGrids that represent a multiple sequence alignment as a matrix color-coded according to the residue frequency occurring at each column position. JProfileGrid is a Java application for computing and analyzing ProfileGrids. A dynamic interaction with the alignment information is achieved by changing the ProfileGrid color scheme, by extracting sequence subsets at selected residues of interest, and by relating alignment information to residue physical properties. Conserved family motifs can be identified by the overlay of similarity plot calculations on a ProfileGrid. Figures suitable for publication can be generated from the saved spreadsheet output of the colored matrices as well as by the export of conservation information for use in the PyMOL molecular visualization program.We demonstrate the utility of ProfileGrids on 300 bacterial homologs of the RecA family - a universally conserved protein involved in DNA recombination and repair. Careful attention was paid to curating the collected RecA sequences since ProfileGrids allow the easy identification of rare residues in an alignment. We relate the RecA alignment sequence conservation to the following three topics: the recently identified DNA binding residues, the unexplored MAW motif, and a unique Bacillus subtilis RecA homolog sequence feature. CONCLUSION: ProfileGrids allow large protein families to be visualized more effectively than the traditional stacked sequence alignment form. This new graphical representation facilitates the determination of the sequence conservation at residue positions of interest, enables the examination of structural patterns by using residue physical properties, and permits the display of rare sequence features within the context of an entire alignment. JProfileGrid is free for non-commercial use and is available from http://www.profilegrid.org. Furthermore, we present a curated RecA protein collection that is more diverse than previous data sets; and, therefore, this RecA ProfileGrid is a rich source of information for nanoanatomy analysis.


Subject(s)
Bacterial Proteins/chemistry , Multigene Family , Rec A Recombinases/chemistry , Sequence Alignment/methods , Sequence Analysis, Protein/methods , Software , Amino Acid Sequence , Molecular Sequence Data , Sequence Alignment/trends , Sequence Analysis, Protein/trends , Software/trends
11.
Nat Biotechnol ; 26(10): 1135-45, 2008 Oct.
Article in English | MEDLINE | ID: mdl-18846087

ABSTRACT

DNA sequence represents a single format onto which a broad range of biological phenomena can be projected for high-throughput data collection. Over the past three years, massively parallel DNA sequencing platforms have become widely available, reducing the cost of DNA sequencing by over two orders of magnitude, and democratizing the field by putting the sequencing capacity of a major genome center in the hands of individual investigators. These new technologies are rapidly evolving, and near-term challenges include the development of robust protocols for generating sequencing libraries, building effective new approaches to data-analysis, and often a rethinking of experimental design. Next-generation DNA sequencing has the potential to dramatically accelerate biological and biomedical research, by enabling the comprehensive analysis of genomes, transcriptomes and interactomes to become inexpensive, routine and widespread, rather than requiring significant production-scale efforts.


Subject(s)
Chromosome Mapping/trends , Forecasting , Genomics/trends , Sequence Alignment/trends , Sequence Analysis, DNA/trends
12.
Nat Biotechnol ; 26(10): 1146-53, 2008 Oct.
Article in English | MEDLINE | ID: mdl-18846088

ABSTRACT

A nanopore-based device provides single-molecule detection and analytical capabilities that are achieved by electrophoretically driving molecules in solution through a nano-scale pore. The nanopore provides a highly confined space within which single nucleic acid polymers can be analyzed at high throughput by one of a variety of means, and the perfect processivity that can be enforced in a narrow pore ensures that the native order of the nucleobases in a polynucleotide is reflected in the sequence of signals that is detected. Kilobase length polymers (single-stranded genomic DNA or RNA) or small molecules (e.g., nucleosides) can be identified and characterized without amplification or labeling, a unique analytical capability that makes inexpensive, rapid DNA sequencing a possibility. Further research and development to overcome current challenges to nanopore identification of each successive nucleotide in a DNA strand offers the prospect of 'third generation' instruments that will sequence a diploid mammalian genome for approximately $1,000 in approximately 24 h.


Subject(s)
Chromosome Mapping/trends , DNA/genetics , Forecasting , Nanostructures/chemistry , Nanotechnology/trends , Sequence Alignment/trends , Sequence Analysis, DNA/trends , DNA/chemistry , Genomics/trends , Nanostructures/ultrastructure
13.
Brief Bioinform ; 9(3): 210-9, 2008 May.
Article in English | MEDLINE | ID: mdl-18344544

ABSTRACT

Classifications of proteins into groups of related sequences are in some respects like a periodic table for biology, allowing us to understand the underlying molecular biology of any organism. Pfam is a large collection of protein domains and families. Its scientific goal is to provide a complete and accurate classification of protein families and domains. The next release of the database will contain over 10,000 entries, which leads us to reflect on how far we are from completing this work. Currently Pfam matches 72% of known protein sequences, but for proteins with known structure Pfam matches 95%, which we believe represents the likely upper bound. Based on our analysis a further 28,000 families would be required to achieve this level of coverage for the current sequence database. We also show that as more sequences are added to the sequence databases the fraction of sequences that Pfam matches is reduced, suggesting that continued addition of new families is essential to maintain its relevance.


Subject(s)
Database Management Systems/trends , Databases, Protein/trends , Information Storage and Retrieval/trends , Proteins/chemistry , Proteins/classification , Sequence Alignment/trends , Sequence Analysis, Protein/trends
14.
Brief Bioinform ; 9(4): 286-98, 2008 Jul.
Article in English | MEDLINE | ID: mdl-18372315

ABSTRACT

The accuracy and scalability of multiple sequence alignment (MSA) of DNAs and proteins have long been and are still important issues in bioinformatics. To rapidly construct a reasonable MSA, we developed the initial version of the MAFFT program in 2002. MSA software is now facing greater challenges in both scalability and accuracy than those of 5 years ago. As increasing amounts of sequence data are being generated by large-scale sequencing projects, scalability is now critical in many situations. The requirement of accuracy has also entered a new stage since the discovery of functional noncoding RNAs (ncRNAs); the secondary structure should be considered for constructing a high-quality alignment of distantly related ncRNAs. To deal with these problems, in 2007, we updated MAFFT to Version 6 with two new techniques: the PartTree algorithm and the Four-way consistency objective function. The former improved the scalability of progressive alignment and the latter improved the accuracy of ncRNA alignment. We review these and other techniques that MAFFT uses and suggest possible future directions of MSA software as a basis of comparative analyses. MAFFT is available at http://align.bmr.kyushu-u.ac.jp/mafft/software/.


Subject(s)
Algorithms , Artificial Intelligence , Pattern Recognition, Automated/trends , Sequence Alignment/trends , Sequence Analysis/trends , Software/trends , Pattern Recognition, Automated/methods , Sequence Alignment/methods , Sequence Analysis/methods
16.
J Comput Biol ; 14(5): 564-77, 2007 Jun.
Article in English | MEDLINE | ID: mdl-17683261

ABSTRACT

This paper proposes a parameterized polynomial time approximation scheme (PTAS) for aligning two protein structures, in the case where one protein structure is represented by a contact map graph and the other by a contact map graph or a distance matrix. If the sequential order of alignment is not required, the time complexity is polynomial in the protein size and exponential with respect to two parameters D(u)/D(l) and D(c)/D(l), which usually can be treated as constants. In particular, D(u) is the distance threshold determining if two residues are in contact or not, D(c) is the maximally allowed distance between two matched residues after two proteins are superimposed, and D(l) is the minimum inter-residue distance in a typical protein. This result clearly demonstrates that the computational hardness of the contact map based protein structure alignment problem is related not to protein size but to several parameters modeling the problem. The result is achieved by decomposing the protein structure using tree decomposition and discretizing the rigid-body transformation space. Preliminary experimental results indicate that on a Linux PC, it takes from ten minutes to one hour to align two proteins with approximately 100 residues.


Subject(s)
Algorithms , Computational Biology/methods , Sequence Alignment/methods , Structural Homology, Protein , Animals , Computational Biology/trends , Flavodoxin/chemistry , Humans , Protein Folding , Sequence Alignment/trends
17.
J Comput Biol ; 14(5): 594-614, 2007 Jun.
Article in English | MEDLINE | ID: mdl-17683263

ABSTRACT

We present a novel approach to managing redundancy in sequence databanks such as GenBank. We store clusters of near-identical sequences as a representative union-sequence and a set of corresponding edits to that sequence. During search, the query is compared to only the union-sequences representing each cluster; cluster members are then only reconstructed and aligned if the union-sequence achieves a sufficiently high score. Using this approach with BLAST results in a 27% reduction in collection size and a corresponding 22% decrease in search time with no significant change in accuracy. We also describe our method for clustering that uses fingerprinting, an approach that has been successfully applied to collections of text and web documents in Information Retrieval. Our clustering approach is ten times faster on the GenBank nonredundant protein database than the fastest existing approach, CD-HIT. We have integrated our approach into FSA-BLAST, our new Open Source version of BLAST (available from http://www.fsa-blast.org/). As a result, FSA-BLAST is twice as fast as NCBI-BLAST with no significant change in accuracy.


Subject(s)
Databases, Protein , Sequence Alignment/methods , Sequence Analysis, Protein/methods , Sequence Homology, Amino Acid , Amino Acid Sequence , Animals , Databases, Protein/trends , Humans , Molecular Sequence Data , Sequence Alignment/trends , Sequence Analysis, Protein/trends
18.
J Comput Biol ; 14(5): 637-54, 2007 Jun.
Article in English | MEDLINE | ID: mdl-17683265

ABSTRACT

Aligning proteins based on their structural similarity is a fundamental problem in molecular biology with applications in many settings, including structure classification, database search, function prediction, and assessment of folding prediction methods. Structural alignment can be done via several methods, including contact map overlap (CMO) maximization that aligns proteins in a way that maximizes the number of common residue contacts. In this paper, we develop a reduction-based exact algorithm for the CMO problem. Our approach solves CMO directly rather than after transformation to other combinatorial optimization problems. We exploit the mathematical structure of the problem in order to develop a number of efficient lower bounding, upper bounding, and reduction schemes. Computational experiments demonstrate that our algorithm runs significantly faster than existing exact algorithms and solves some hard CMO instances that were not solved in the past. In addition, the algorithm produces protein clusters that are in excellent agreement with the SCOP classification. An implementation of our algorithm is accessible as an on-line server at http://eudoxus.scs.uiuc.edu/cmos/cmos.html.


Subject(s)
Algorithms , Sequence Alignment , Sequence Analysis, Protein , Structural Homology, Protein , Animals , Bacterial Proteins/chemistry , Bacterial Proteins/genetics , Computational Biology/trends , Models, Chemical , Sequence Alignment/methods , Sequence Alignment/trends , Sequence Analysis, Protein/methods , Sequence Analysis, Protein/trends
19.
J Comput Biol ; 14(5): 655-68, 2007 Jun.
Article in English | MEDLINE | ID: mdl-17683266

ABSTRACT

Long-range correlations in genomic base composition are a ubiquitous statistical feature among many eukaryotic genomes. In this article, these correlations are shown to substantially influence the statistics of sequence alignment scores. Using a Gaussian approximation to model the correlated score landscape, we calculate the corrections to the scale parameter lambda of the extreme value distribution of alignment scores. Our approximate analytic results are supported by a detailed numerical study based on a simple algorithm to efficiently generate long-range correlated random sequences. We find both, mean and exponential tail of the score distribution for long-range correlated sequences to be substantially shifted compared to random sequences with independent nucleotides. The significance of measured alignment scores will therefore change upon incorporation of the correlations in the null model. We discuss the magnitude of this effect in a biological context.


Subject(s)
Computer Simulation , Models, Genetic , Models, Statistical , Sequence Alignment/statistics & numerical data , Sequence Analysis, DNA/statistics & numerical data , Sequence Homology, Nucleic Acid , Animals , Humans , Sequence Alignment/methods , Sequence Alignment/trends , Sequence Analysis, DNA/methods , Sequence Analysis, DNA/trends
20.
BMC Bioinformatics ; 8: 298, 2007 Aug 09.
Article in English | MEDLINE | ID: mdl-17688688

ABSTRACT

BACKGROUND: Approximately 5% of Pfam families are enzymatic, but only a small fraction of the sequences within these families (<0.5%) have had the residues responsible for catalysis determined. To increase the active site annotations in the Pfam database, we have developed a strict set of rules, chosen to reduce the rate of false positives, which enable the transfer of experimentally determined active site residue data to other sequences within the same Pfam family. DESCRIPTION: We have created a large database of predicted active site residues. On comparing our active site predictions to those found in UniProtKB, Catalytic Site Atlas, PROSITE and MEROPS we find that we make many novel predictions. On investigating the small subset of predictions made by these databases that are not predicted by us, we found these sequences did not meet our strict criteria for prediction. We assessed the sensitivity and specificity of our methodology and estimate that only 3% of our predicted sequences are false positives. CONCLUSION: We have predicted 606110 active site residues, of which 94% are not found in UniProtKB, and have increased the active site annotations in Pfam by more than 200 fold. Although implemented for Pfam, the tool we have developed for transferring the data can be applied to any alignment with associated experimental active site data and is available for download. Our active site predictions are re-calculated at each Pfam release to ensure they are comprehensive and up to date. They provide one of the largest available databases of active site annotation.


Subject(s)
Databases, Protein , Amino Acid Sequence , Binding Sites , Databases, Protein/trends , Molecular Sequence Data , Predictive Value of Tests , Sequence Alignment/methods , Sequence Alignment/trends , Sequence Homology, Amino Acid , Software Design
SELECTION OF CITATIONS
SEARCH DETAIL
...