Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 17 de 17
Filter
Add more filters










Publication year range
1.
BMC Bioinformatics ; 18(Suppl 16): 575, 2017 12 28.
Article in English | MEDLINE | ID: mdl-29297307

ABSTRACT

BACKGROUND: In current statistical methods for calling differentially expressed genes in RNA-Seq experiments, the assumption is that an adjusted observed gene count represents an unknown true gene count. This adjustment usually consists of a normalization step to account for heterogeneous sample library sizes, and then the resulting normalized gene counts are used as input for parametric or non-parametric differential gene expression tests. A distribution of true gene counts, each with a different probability, can result in the same observed gene count. Importantly, sequencing coverage information is currently not explicitly incorporated into any of the statistical models used for RNA-Seq analysis. RESULTS: We developed a fast Bayesian method which uses the sequencing coverage information determined from the concentration of an RNA sample to estimate the posterior distribution of a true gene count. Our method has better or comparable performance compared to NOISeq and GFOLD, according to the results from simulations and experiments with real unreplicated data. We incorporated a previously unused sequencing coverage parameter into a procedure for differential gene expression analysis with RNA-Seq data. CONCLUSIONS: Our results suggest that our method can be used to overcome analytical bottlenecks in experiments with limited number of replicates and low sequencing coverage. The method is implemented in CORNAS (Coverage-dependent RNA-Seq), and is available at https://github.com/joel-lzb/CORNAS .


Subject(s)
Databases, Genetic , Gene Expression Regulation , Sequence Analysis, RNA/methods , Area Under Curve , Bayes Theorem , Computer Simulation , Gene Expression Profiling , Humans , Organ Specificity/genetics , Polymerase Chain Reaction , Predictive Value of Tests , RNA/genetics , ROC Curve , Reproducibility of Results
2.
PLoS One ; 4(6): e5861, 2009 Jun 10.
Article in English | MEDLINE | ID: mdl-19516900

ABSTRACT

Allergy is a major health problem in industrialized countries. The number of transgenic food crops is growing rapidly creating the need for allergenicity assessment before they are introduced into human food chain. While existing bioinformatic methods have achieved good accuracies for highly conserved sequences, the discrimination of allergens and non-allergens from allergen-like non-allergen sequences remains difficult. We describe AllerHunter, a web-based computational system for the assessment of potential allergenicity and allergic cross-reactivity in proteins. It combines an iterative pairwise sequence similarity encoding scheme with SVM as the discriminating engine. The pairwise vectorization framework allows the system to model essential features in allergens that are involved in cross-reactivity, but not limited to distinct sets of physicochemical properties. The system was rigorously trained and tested using 1,356 known allergen and 13,449 putative non-allergen sequences. Extensive testing was performed for validation of the prediction models. The system is effective for distinguishing allergens and non-allergens from allergen-like non-allergen sequences. Testing results showed that AllerHunter, with a sensitivity of 83.4% and specificity of 96.4% (accuracy = 95.3%, area under the receiver operating characteristic curve AROC = 0.928+/-0.004 and Matthew's correlation coefficient MCC = 0.738), performs significantly better than a number of existing methods using an independent dataset of 1443 protein sequences. AllerHunter is available at (http://tiger.dbs.nus.edu.sg/AllerHunter).


Subject(s)
Allergens/chemistry , Computational Biology/methods , Hypersensitivity/diagnosis , Hypersensitivity/genetics , Algorithms , Databases, Protein , Humans , Models, Statistical , Protein Folding , ROC Curve , Reproducibility of Results , Sensitivity and Specificity , Sequence Analysis, Protein , Software
3.
J Pharmacol Exp Ther ; 330(1): 304-15, 2009 Jul.
Article in English | MEDLINE | ID: mdl-19357322

ABSTRACT

Low target discovery rate has been linked to inadequate consideration of multiple factors that collectively contribute to druggability. These factors include sequence, structural, physicochemical, and systems profiles. Methods individually exploring each of these profiles for target identification have been developed, but they have not been collectively used. We evaluated the collective capability of these methods in identifying promising targets from 1019 research targets based on the multiple profiles of up to 348 successful targets. The collective method combining at least three profiles identified 50, 25, 10, and 4% of the 30, 84, 41, and 864 phase III, II, I, and nonclinical trial targets as promising, including eight to nine targets of positive phase III results. This method dropped 89% of the 19 discontinued clinical trial targets and 97% of the 65 targets failed in high-throughput screening or knockout studies. Collective consideration of multiple profiles demonstrated promising potential in identifying innovative targets.


Subject(s)
Chemical Phenomena/drug effects , Drug Delivery Systems/methods , Drug Discovery/methods , Gene Targeting/methods , Animals , Clinical Trials as Topic/methods , Clinical Trials as Topic/trends , Drug Delivery Systems/trends , Drug Design , Drug Discovery/trends , Gene Targeting/trends , Humans , Structure-Activity Relationship
4.
BMC Bioinformatics ; 10: 80, 2009 Mar 06.
Article in English | MEDLINE | ID: mdl-19267900

ABSTRACT

BACKGROUND: DNA copy number variation (CNV) has been recognized as an important source of genetic variation. Array comparative genomic hybridization (aCGH) is commonly used for CNV detection, but the microarray platform has a number of inherent limitations. RESULTS: Here, we describe a method to detect copy number variation using shotgun sequencing, CNV-seq. The method is based on a robust statistical model that describes the complete analysis procedure and allows the computation of essential confidence values for detection of CNV. Our results show that the number of reads, not the length of the reads is the key factor determining the resolution of detection. This favors the next-generation sequencing methods that rapidly produce large amount of short reads. CONCLUSION: Simulation of various sequencing methods with coverage between 0.1x to 8x show overall specificity between 91.7 - 99.9%, and sensitivity between 72.2 - 96.5%. We also show the results for assessment of CNV between two individual human genomes.


Subject(s)
Algorithms , Genetic Variation/genetics , Sequence Analysis, DNA/methods , DNA/chemistry , Gene Dosage , Genome, Human , Genomics/methods , Humans
5.
Bioinformatics ; 25(7): 979-80, 2009 Apr 01.
Article in English | MEDLINE | ID: mdl-19213741

ABSTRACT

SUMMARY: A variety of specialist databases have been developed to facilitate the study of allergens. However, these databases either contain different subsets of allergen data or are deficient in tools for assessing potential allergenicity of proteins. Here, we describe Allergen Atlas, a comprehensive repository of experimentally validated allergen sequences collected from in-house laboratory, online data submission, literature reports and all existing general-purpose and specialist databases. Each entry was manually verified, classified and hyperlinked to major databases including Swiss-Prot, Protein Data Bank (PDB), Gene Ontology (GO), Pfam and PubMed. The database is integrated with analysis tools that include: (i) keyword search, (ii) BLAST, (iii) position-specific iterative BLAST (PSI-BLAST), (iv) FAO/WHO criteria search, (v) graphical representation of allergen information network and (vi) online data submission. The latest version contains information of 1593 allergen sequences (496 IUIS allergens, 978 experimentally verified allergens and 119 new sequences), 56 IgE epitope sequences, 679 links to PDB structures and 155 links to Pfam domains. AVAILABILITY: Allergen Atlas is freely available at http://tiger.dbs.nus.edu.sg/ATLAS/.


Subject(s)
Allergens/chemistry , Databases, Protein , Proteins/immunology , Computational Biology , Information Storage and Retrieval , Internet , Proteins/chemistry
6.
BMC Bioinformatics ; 9 Suppl 12: S21, 2008 Dec 12.
Article in English | MEDLINE | ID: mdl-19091021

ABSTRACT

BACKGROUND: Bioinformatics tools are commonly used for assessing potential protein allergenicity. While these methods have achieved good accuracies for highly conserved sequences, they are less effective when the overall similarity is low. In this study, we assessed the feasibility of using position-specific scoring matrices as a basis for predicting potential allergenicity in proteins. RESULTS: Two simple methods for predicting potential allergenicity in proteins, based on general and group-specific allergen profiles, are presented. Testing results indicate that the performances of both methods are comparable to the best results of other methods. The group-specific profile approach, with a sensitivity of 84.04% and specificity of 96.52%, gives similar results as those obtained using the general profile approach (sensitivity = 82.45%, specificity = 96.92%). CONCLUSION: We show that position-specific scoring matrices are highly promising for constructing computational models suitable for allergenicity assessment. These data suggest it may be possible to apply a targeted approach for allergenicity assessment based on the profiles of allergens of interest.


Subject(s)
Computational Biology/methods , Proteins/chemistry , Proteins/immunology , Algorithms , Allergens , Animals , Databases, Factual , Databases, Protein , False Positive Reactions , Humans , Hypersensitivity, Immediate/diagnosis , Hypersensitivity, Immediate/immunology , Models, Statistical , Predictive Value of Tests , Reproducibility of Results , Sequence Analysis, Protein , Software
7.
Front Biosci ; 13: 6072-8, 2008 May 01.
Article in English | MEDLINE | ID: mdl-18508644

ABSTRACT

The constant increase in atopic allergy and other hypersensitivity reactions has intensified the need for successful therapeutic approaches. Existing bioinformatic tools for predicting allergenic potential are primarily based on sequence similarity searches along the entire protein sequence and do not address the dual issues of conformational and overlapping B-cell epitope recognition sites. In this study, we report AllerPred, a computational system that is capable of capturing multiple overlapping continuous and discontinuous B-cell epitope binding patterns in allergenic proteins using SVM as its prediction engine. A novel representation of local protein sequence descriptors enables the system to model multiple overlapping continuous and discontinuous B-cell epitope binding patterns within a protein sequence. The model was rigorously trained and tested using 669 IUIS allergens and 1237 non-allergens. Testing results showed that the area under the receiver operating curve (AROC) of SVM models is 0.81 with 76 percent sensitivity at specificity of 76 percent . This approach consistently outperforms existing allergenicity prediction systems using a standardized testing dataset of experimentally validated allergens and non-allergen sequences.


Subject(s)
Allergens , Models, Immunological , Proteins/chemistry , Proteins/immunology , Algorithms , Amino Acid Sequence , Computational Biology , Epitopes/chemistry , Epitopes/immunology , Predictive Value of Tests
8.
BMC Genomics ; 8: 391, 2007 Oct 26.
Article in English | MEDLINE | ID: mdl-17963481

ABSTRACT

BACKGROUND: Repeats are present in all genomes, and often have important functions. However, in large genome sequencing projects, many repetitive regions remain uncharacterized. The genome of the protozoan parasite Trypanosoma cruzi consists of more than 50% repeats. These repeats include surface molecule genes, and several other gene families. In the T. cruzi genome sequencing project, it was clear that not all copies of repetitive genes were present in the assembly, due to collapse of nearly identical repeats. However, at the time of publication of the T. cruzi genome, it was not clear to what extent this had occurred. RESULTS: We have developed a pipeline to estimate the genomic repeat content, where shotgun reads are aligned to the genomic sequence and the gene copy number is estimated using the average shotgun coverage. This method was applied to the genome of T. cruzi and copy numbers of all protein coding sequences and pseudogenes were estimated. The 22,640 results were stored in a database available online. 18% of all protein coding sequences and pseudogenes were estimated to exist in 14 or more copies in the T. cruzi CL Brener genome. The average coverage of the annotated protein coding sequences and pseudogenes indicate a total gene copy number, including allelic gene variants, of over 40,000. CONCLUSION: Our results indicate that the number of protein coding sequences and pseudogenes in the T. cruzi genome may be twice the previous estimate. We have constructed a database of the T. cruzi gene repeat data that is available as a resource to the community. The main purpose of the database is to enable biologists interested in repeated, unfinished regions to closely examine and resolve these regions themselves using all available shotgun data, instead of having to rely on annotated consensus sequences that often are erroneous and possibly misleading. Five repetitive genes were studied in more detail, in order to illustrate how the database can be used to analyze and extract information about gene repeats with different characteristics in Trypanosoma cruzi.


Subject(s)
Databases, Genetic , Genetic Variation , Repetitive Sequences, Nucleic Acid , Trypanosoma cruzi/genetics , Amino Acid Sequence , Animals , Antigens, Surface/genetics , Conserved Sequence , DNA, Protozoan , Gene Amplification , Gene Dosage , Genes, Protozoan/physiology , Genome, Protozoan , Membrane Proteins/genetics , Models, Biological , Molecular Sequence Data , Sequence Homology, Amino Acid
9.
Comput Methods Programs Biomed ; 86(1): 87-92, 2007 Apr.
Article in English | MEDLINE | ID: mdl-17292508

ABSTRACT

Modern alignment methods designed to work rapidly and efficiently with large datasets often do so at the cost of method sensitivity. To overcome this, we have developed a novel alignment program, GRAT, built to accurately align short, highly similar DNA sequences. The program runs rapidly and requires no more memory and CPU power than a desktop computer. In addition, specificity is ensured by statistically separating the true alignments from spurious matches using phred quality values. An efficient separation is especially important when searching large datasets and whenever there are repeats present in the dataset. Results are superior in comparison to widely used existing software, and analysis of two large genomic datasets show the usefulness and scalability of the algorithm.


Subject(s)
Sequence Alignment/instrumentation , Sequence Analysis, DNA , Software Design , Algorithms , Animals , Chickens
10.
Bioinformatics ; 23(4): 504-6, 2007 Feb 15.
Article in English | MEDLINE | ID: mdl-17150996

ABSTRACT

UNLABELLED: Assessment of potential allergenicity and patterns of cross-reactivity is necessary whenever novel proteins are introduced into human food chain. Current bioinformatic methods in allergology focus mainly on the prediction of allergenic proteins, with no information on cross-reactivity patterns among known allergens. In this study, we present AllerTool, a web server with essential tools for the assessment of predicted as well as published cross-reactivity patterns of allergens. The analysis tools include graphical representation of allergen cross-reactivity information; a local sequence comparison tool that displays information of known cross-reactive allergens; a sequence similarity search tool for assessment of cross-reactivity in accordance to FAO/WHO Codex alimentarius guidelines; and a method based on support vector machine (SVM). A 10-fold cross-validation results showed that the area under the receiver operating curve (A(ROC)) of SVM models is 0.90 with 86.00% sensitivity (SE) at specificity (SP) of 86.00%. AVAILABILITY: AllerTool is freely available at http://research.i2r.a-star.edu.sg/AllerTool/.


Subject(s)
Allergens/chemistry , Allergens/immunology , Cross Reactions/immunology , Proteins/chemistry , Proteins/immunology , Sequence Analysis, Protein/methods , Software , Algorithms , Amino Acid Sequence , Databases, Protein , Molecular Sequence Data , User-Computer Interface
11.
BMC Bioinformatics ; 7: 155, 2006 Mar 20.
Article in English | MEDLINE | ID: mdl-16549006

ABSTRACT

BACKGROUND: Many genome projects are left unfinished due to complex, repeated regions. Finishing is the most time consuming step in sequencing and current finishing tools are not designed with particular attention to the repeat problem. RESULTS: We have developed DNPTrapper, a shotgun sequence finishing tool, specifically designed to address the problems posed by the presence of repeated regions in the target sequence. The program detects and visualizes single base differences between nearly identical repeat copies, and offers the overview and flexibility needed to rapidly resolve complex regions within a working session. The use of a database allows large amounts of data to be stored and handled, and allows viewing of mammalian size genomes. The program is available under an Open Source license. CONCLUSION: With DNPTrapper, it is possible to separate repeated regions that previously were considered impossible to resolve, and finishing tasks that previously took days or weeks can be resolved within hours or even minutes.


Subject(s)
Algorithms , DNA/genetics , Documentation/methods , Repetitive Sequences, Nucleic Acid/genetics , Sequence Analysis, DNA/methods , Software , User-Computer Interface , Base Sequence , DNA/analysis , DNA/chemistry , Molecular Sequence Data
12.
BMC Bioinformatics ; 7 Suppl 5: S20, 2006 Dec 18.
Article in English | MEDLINE | ID: mdl-17254305

ABSTRACT

BACKGROUND: The accurate prediction of a comprehensive set of messenger RNAs (targets) regulated by animal microRNAs (miRNAs) remains an open problem. In particular, the prediction of targets that do not possess evolutionarily conserved complementarity to their miRNA regulators is not adequately addressed by current tools. RESULTS: We have developed MicroTar, an animal miRNA target prediction tool based on miRNA-target complementarity and thermodynamic data. The algorithm uses predicted free energies of unbound mRNA and putative mRNA-miRNA heterodimers, implicitly addressing the accessibility of the mRNA 3' untranslated region. MicroTar does not rely on evolutionary conservation to discern functional targets, and is able to predict both conserved and non-conserved targets. MicroTar source code and predictions are accessible at http://tiger.dbs.nus.edu.sg/microtar/, where both serial and parallel versions of the program can be downloaded under an open-source licence. CONCLUSION: MicroTar achieves better sensitivity than previously reported predictions when tested on three distinct datasets of experimentally-verified miRNA-target interactions in C. elegans, Drosophila, and mouse.


Subject(s)
MicroRNAs/chemistry , Nucleic Acid Heteroduplexes/chemistry , RNA, Messenger/chemistry , Software , Algorithms , Animals , Caenorhabditis elegans , Drosophila melanogaster , Mice , Nucleic Acid Conformation
13.
Proc Natl Acad Sci U S A ; 102(36): 12891-6, 2005 Sep 06.
Article in English | MEDLINE | ID: mdl-16118271

ABSTRACT

The identification of new virus species is a key issue for the study of infectious disease but is technically very difficult. We developed a system for large-scale molecular virus screening of clinical samples based on host DNA depletion, random PCR amplification, large-scale sequencing, and bioinformatics. The technology was applied to pooled human respiratory tract samples. The first experiments detected seven human virus species without the use of any specific reagent. Among the detected viruses were one coronavirus and one parvovirus, both of which were at that time uncharacterized. The parvovirus, provisionally named human bocavirus, was in a retrospective clinical study detected in 17 additional patients and associated with lower respiratory tract infections in children. The molecular virus screening procedure provides a general culture-independent solution to the problem of detecting unknown virus species in single or pooled samples. We suggest that a systematic exploration of the viruses that infect humans, "the human virome," can be initiated.


Subject(s)
Parvovirus/genetics , Parvovirus/isolation & purification , Respiratory Tract Diseases/virology , Child, Preschool , Cloning, Molecular , Female , Genome, Viral , Humans , Incidence , Infant , Male , Molecular Sequence Data , Nasopharynx/virology , Parvovirus/physiology , Phylogeny
14.
Bioinformatics ; 20(5): 803-4, 2004 Mar 22.
Article in English | MEDLINE | ID: mdl-14751967

ABSTRACT

UNLABELLED: Finishing, i.e. gap closure and editing, is the most time-consuming part of genome sequencing. Repeated sequences together with sequencing errors complicate the assembly and often result in misassemblies that are difficult to correct. Repeat Discrepancy Tagger (ReDiT) is a tool designed to aid in the finishing step. This software processes assembly results produced by any fragment assembly program that outputs ace files. The input sequences are analyzed to determine possible differences between repeated sequences. The output is written as tags in an ace file that can be viewed by, e.g. the Consed sequence editor. AVAILABILITY: The ReDiT program is freely available at http://web.cgb.ki.se/redit


Subject(s)
Chromosome Mapping/methods , Documentation/methods , Expressed Sequence Tags , Repetitive Sequences, Nucleic Acid/genetics , Sequence Analysis, DNA/methods , Software , User-Computer Interface , Algorithms , Base Sequence , Computer Graphics , Gene Expression Profiling , Genome , Molecular Sequence Data , Sequence Alignment/methods , Word Processing/methods
15.
Nucleic Acids Res ; 31(15): 4663-72, 2003 Aug 01.
Article in English | MEDLINE | ID: mdl-12888528

ABSTRACT

Sequencing errors in combination with repeated regions cause major problems in shotgun sequencing, mainly due to the failure of assembly programs to distinguish single base differences between repeat copies from erroneous base calls. In this paper, a new strategy designed to correct errors in shotgun sequence data using defined nucleotide positions, DNPs, is presented. The method distinguishes single base differences from sequencing errors by analyzing multiple alignments consisting of a read and all its overlaps with other reads. The construction of multiple alignments is performed using a novel pattern matching algorithm, which takes advantage of the symmetry between indices that can be computed for similar words of the same length. This allows for rapid construction of multiple alignments, with no previous pair-wise matching of sequence reads required. Results from a C++ implementation of this method show that up to 99% of sequencing errors can be corrected, while up to 87% of the single base differences remain and up to 80% of the corrected reads contain at most one error. The results also show that the method outperforms the error correction method used in the EULER assembler. The prototype software, MisEd, is freely available from the authors for academic use.


Subject(s)
Sequence Analysis, DNA/methods , Algorithms , Genome , Repetitive Sequences, Nucleic Acid , Sequence Alignment/methods , Software , Time Factors
16.
Comput Methods Programs Biomed ; 70(1): 47-59, 2003 Jan.
Article in English | MEDLINE | ID: mdl-12468126

ABSTRACT

The software commonly used for assembly of shotgun sequence data has several limitations. One such limitation becomes obvious when repetitive sequences are encountered. Shotgun assembly is a difficult task, even for non-repetitive regions, but the use of quality assessments of the data and efficient matching algorithms have made it possible to assemble most sequences efficiently. In the case of highly repetitive sequences, however, these algorithms fail to distinguish between sequencing errors and single base differences in regions containing nearly identical repeats. None of the currently available fragment assembly programs are able to correctly assemble highly similar repetitive data, and we, therefore, present a novel shotgun assembly program, Tandem Repeat Assembly Program (TRAP). The main feature of this program is the ability to separate long repetitive regions from each other by distinguishing single base substitutions as well as insertions/deletions from sequencing errors. This is accomplished by using a novel multiple-alignment based analysis method. Since repeats are a common complication in most sequencing projects, this software should be of use for the whole sequencing community.


Subject(s)
Tandem Repeat Sequences , Algorithms , Software
17.
Bioinformatics ; 18(3): 379-88, 2002 Mar.
Article in English | MEDLINE | ID: mdl-11934736

ABSTRACT

An increasingly important problem in genome sequencing is the failure of the commonly used shotgun assembly programs to correctly assemble repetitive sequences. The assembly of non-repetitive regions or regions containing repeats considerably shorter than the average read length is in practice easy to solve, while longer repeats have been a difficult problem. We here present a statistical method to separate arbitrarily long, almost identical repeats, which makes it possible to correctly assemble complex repetitive sequence regions. The differences between repeat units may be as low as 1% and the sequencing error may be up to ten times higher. The method is based on the realization that a comparison of only a part of all overlapping sequences at a time in a data set does not generate enough information for a conclusive analysis. Our method uses optimal multi-alignments consisting of all the overlaps of each read. This makes it possible to determine defined nucleotide positions, DNPs, which constitute the differences between the repeat units. Differences between repeats are distinguished from sequencing errors using statistical methods, where the probabilities of obtaining certain combinations of candidate DNPs are calculated using the information from the multi-alignments. The use of DNPs and combinations of DNPs will allow for optimal and rapid assemblies of repeated regions. This method can solve repeats that differ in only two positions in a read length, which is the theoretical limit for repeat separation. We predict that this method will be highly useful in shotgun sequencing in the future.


Subject(s)
Computational Biology/methods , Computer Simulation , Models, Statistical , Sequence Analysis, DNA/methods , Sequence Analysis, DNA/statistics & numerical data , Algorithms , Base Sequence , Cluster Analysis , Deoxyribonucleoproteins/genetics , Feasibility Studies , Models, Genetic , Molecular Sequence Data , Repetitive Sequences, Nucleic Acid/genetics , Sensitivity and Specificity , Sequence Alignment/methods , Sequence Alignment/statistics & numerical data
SELECTION OF CITATIONS
SEARCH DETAIL
...