Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 44
Filtrar
1.
PLoS One ; 18(6): e0286074, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37279196

RESUMO

Compression as an accelerant of computation is increasingly recognized as an important component in engineering fast real-world machine learning methods for big data; c.f., its impact on genome-scale approximate string matching. Previous work showed that compression can accelerate algorithms for Hidden Markov Models (HMM) with discrete observations, both for the classical frequentist HMM algorithms-Forward Filtering, Backward Smoothing and Viterbi-and Gibbs sampling for Bayesian HMM. For Bayesian HMM with continuous-valued observations, compression was shown to greatly accelerate computations for specific types of data. For instance, data from large-scale experiments interrogating structural genetic variation can be assumed to be piece-wise constant with noise, or, equivalently, data generated by HMM with dominant self-transition probabilities. Here we extend the compressive computation approach to the classical frequentist HMM algorithms on continuous-valued observations, providing the first compressive approach for this problem. In a large-scale simulation study, we demonstrate empirically that in many settings compressed HMM algorithms very clearly outperform the classical algorithms with no, or only an insignificant effect, on the computed probabilities and infered state paths of maximal likelihood. This provides an efficient approach to big data computations with HMM. An open-source implementation of the method is available from https://github.com/lucabello/wavelet-hmms.


Assuntos
Algoritmos , Cadeias de Markov , Teorema de Bayes , Probabilidade , Simulação por Computador
2.
Mol Inform ; 41(12): e2200043, 2022 12.
Artigo em Inglês | MEDLINE | ID: mdl-35732584

RESUMO

Computer aided synthesis planning, suggesting synthetic routes for molecules of interest, is a rapidly growing field. The machine learning methods used are often dependent on access to large datasets for training, but finite experimental budgets limit how much data can be obtained from experiments. This suggests the use of schemes for data collection such as active learning, which identifies the data points of highest impact for model accuracy, and which has been used in recent studies with success. However, little has been done to explore the robustness of the methods predicting reaction yield when used together with active learning to reduce the amount of experimental data needed for training. This study aims to investigate the influence of machine learning algorithms and the number of initial data points on reaction yield prediction for two public high-throughput experimentation datasets. Our results show that active learning based on output margin reached a pre-defined AUROC faster than random sampling on both datasets. Analysis of feature importance of the trained machine learning models suggests active learning had a larger influence on the model accuracy when only a few features were important for the model prediction.


Assuntos
Aprendizado de Máquina
3.
BMC Bioinformatics ; 22(1): 487, 2021 Oct 09.
Artigo em Inglês | MEDLINE | ID: mdl-34627154

RESUMO

BACKGROUND: Alignment-free methods are a popular approach for comparing biological sequences, including complete genomes. The methods range from probability distributions of sequence composition to first and higher-order Markov chains, where a k-th order Markov chain over DNA has [Formula: see text] formal parameters. To circumvent this exponential growth in parameters, variable-length Markov chains (VLMCs) have gained popularity for applications in molecular biology and other areas. VLMCs adapt the depth depending on sequence context and thus curtail excesses in the number of parameters. The scarcity of available fast, or even parallel software tools, prompted the development of a parallel implementation using lazy suffix trees and a hash-based alternative. RESULTS: An extensive evaluation was performed on genomes ranging from 12Mbp to 22Gbp. Relevant learning parameters were chosen guided by the Bayesian Information Criterion (BIC) to avoid over-fitting. Our implementation greatly improves upon the state-of-the-art even in serial execution. It exhibits very good parallel scaling with speed-ups for long sequences close to the optimum indicated by Amdahl's law of 3 for 4 threads and about 6 for 16 threads, respectively. CONCLUSIONS: Our parallel implementation released as open-source under the GPLv3 license provides a practically useful alternative to the state-of-the-art which allows the construction of VLMCs even for very large genomes significantly faster than previously possible. Additionally, our parameter selection based on BIC gives guidance to end-users comparing genomes.


Assuntos
Genoma , Software , Teorema de Bayes , DNA , Cadeias de Markov
4.
Alzheimers Res Ther ; 13(1): 151, 2021 09 06.
Artigo em Inglês | MEDLINE | ID: mdl-34488882

RESUMO

BACKGROUND: In Alzheimer's disease, amyloid- ß (A ß) peptides aggregate in the lowering CSF amyloid levels - a key pathological hallmark of the disease. However, lowered CSF amyloid levels may also be present in cognitively unimpaired elderly individuals. Therefore, it is of great value to explain the variance in disease progression among patients with A ß pathology. METHODS: A cohort of n=2293 participants, of whom n=749 were A ß positive, was selected from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database to study heterogeneity in disease progression for individuals with A ß pathology. The analysis used baseline clinical variables including demographics, genetic markers, and neuropsychological data to predict how the cognitive ability and AD diagnosis of subjects progressed using statistical models and machine learning. Due to the relatively low prevalence of A ß pathology, models fit only to A ß-positive subjects were compared to models fit to an extended cohort including subjects without established A ß pathology, adjusting for covariate differences between the cohorts. RESULTS: A ß pathology status was determined based on the A ß42/A ß40 ratio. The best predictive model of change in cognitive test scores for A ß-positive subjects at the 2-year follow-up achieved an R2 score of 0.388 while the best model predicting adverse changes in diagnosis achieved a weighted F1 score of 0.791. A ß-positive subjects declined faster on average than those without A ß pathology, but the specific level of CSF A ß was not predictive of progression rate. When predicting cognitive score change 4 years after baseline, the best model achieved an R2 score of 0.325 and it was found that fitting models to the extended cohort improved performance. Moreover, using all clinical variables outperformed the best model based only on a suite of cognitive test scores which achieved an R2 score of 0.228. CONCLUSION: Our analysis shows that CSF levels of A ß are not strong predictors of the rate of cognitive decline in A ß-positive subjects when adjusting for other variables. Baseline assessments of cognitive function accounts for the majority of variance explained in the prediction of 2-year decline but is insufficient for achieving optimal results in longer-term predictions. Predicting changes both in cognitive test scores and in diagnosis provides multiple perspectives of the progression of potential AD subjects.


Assuntos
Doença de Alzheimer , Disfunção Cognitiva , Idoso , Doença de Alzheimer/complicações , Peptídeos beta-Amiloides , Biomarcadores , Cognição , Disfunção Cognitiva/diagnóstico , Progressão da Doença , Humanos , Testes Neuropsicológicos , Proteínas tau
5.
PeerJ Comput Sci ; 7: e397, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-33817043

RESUMO

The Alternating Direction Method of Multipliers (ADMM) is a popular and promising distributed framework for solving large-scale machine learning problems. We consider decentralized consensus-based ADMM in which nodes may only communicate with one-hop neighbors. This may cause slow convergence. We investigate the impact of network topology on the performance of an ADMM-based learning of Support Vector Machine using expander, and mean-degree graphs, and additionally some of the common modern network topologies. In particular, we investigate to which degree the expansion property of the network influences the convergence in terms of iterations, training and communication time. We furthermore suggest which topology is preferable. Additionally, we provide an implementation that makes these theoretical advances easily available. The results show that the performance of decentralized ADMM-based learning of SVMs in terms of convergence is improved using graphs with large spectral gaps, higher and homogeneous degrees.

6.
Proc Natl Acad Sci U S A ; 117(22): 12411-12418, 2020 06 02.
Artigo em Inglês | MEDLINE | ID: mdl-32430323

RESUMO

Genetic factors and socioeconomic status (SES) inequalities play a large role in educational attainment, and both have been associated with variations in brain structure and cognition. However, genetics and SES are correlated, and no prior study has assessed their neural associations independently. Here we used a polygenic score for educational attainment (EduYears-PGS), as well as SES, in a longitudinal study of 551 adolescents to tease apart genetic and environmental associations with brain development and cognition. Subjects received a structural MRI scan at ages 14 and 19. At both time points, they performed three working memory (WM) tasks. SES and EduYears-PGS were correlated (r = 0.27) and had both common and independent associations with brain structure and cognition. Specifically, lower SES was related to less total cortical surface area and lower WM. EduYears-PGS was also related to total cortical surface area, but in addition had a regional association with surface area in the right parietal lobe, a region related to nonverbal cognitive functions, including mathematics, spatial cognition, and WM. SES, but not EduYears-PGS, was related to a change in total cortical surface area from age 14 to 19. This study demonstrates a regional association of EduYears-PGS and the independent prediction of SES with cognitive function and brain development. It suggests that the SES inequalities, in particular parental education, are related to global aspects of cortical development, and exert a persistent influence on brain development during adolescence.


Assuntos
Encéfalo/crescimento & desenvolvimento , Cognição , Escolaridade , Sucesso Acadêmico , Adolescente , Adulto , Encéfalo/diagnóstico por imagem , Encéfalo/fisiologia , Feminino , Humanos , Estudos Longitudinais , Imageamento por Ressonância Magnética , Masculino , Memória de Curto Prazo , Herança Multifatorial , Classe Social , Adulto Jovem
7.
PeerJ ; 8: e8225, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-32025365

RESUMO

Natural history museums are unique spaces for interdisciplinary research and educational innovation. Through extensive exhibits and public programming and by hosting rich communities of amateurs, students, and researchers at all stages of their careers, they can provide a place-based window to focus on integration of science and discovery, as well as a locus for community engagement. At the same time, like a synthesis radio telescope, when joined together through emerging digital resources, the global community of museums (the 'Global Museum') is more than the sum of its parts, allowing insights and answers to diverse biological, environmental, and societal questions at the global scale, across eons of time, and spanning vast diversity across the Tree of Life. We argue that, whereas natural history collections and museums began with a focus on describing the diversity and peculiarities of species on Earth, they are now increasingly leveraged in new ways that significantly expand their impact and relevance. These new directions include the possibility to ask new, often interdisciplinary questions in basic and applied science, such as in biomimetic design, and by contributing to solutions to climate change, global health and food security challenges. As institutions, they have long been incubators for cutting-edge research in biology while simultaneously providing core infrastructure for research on present and future societal needs. Here we explore how the intersection between pressing issues in environmental and human health and rapid technological innovation have reinforced the relevance of museum collections. We do this by providing examples as food for thought for both the broader academic community and museum scientists on the evolving role of museums. We also identify challenges to the realization of the full potential of natural history collections and the Global Museum to science and society and discuss the critical need to grow these collections. We then focus on mapping and modelling of museum data (including place-based approaches and discovery), and explore the main projects, platforms and databases enabling this growth. Finally, we aim to improve relevant protocols for the long-term storage of specimens and tissues, ensuring proper connection with tomorrow's technologies and hence further increasing the relevance of natural history museums.

8.
J Healthc Inform Res ; 4(1): 1-18, 2020 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-35415439

RESUMO

Many factors affect blood glucose levels in type 1 diabetics, several of which vary largely both in magnitude and delay of the effect. Modern rapid-acting insulins generally have a peak time after 60-90 min, while carbohydrate intake can affect blood glucose levels more rapidly for high glycemic index foods, or slower for other carbohydrate sources. It is important to have good estimates of the development of glucose levels in the near future both for diabetic patients managing their insulin distribution manually, as well as for closed-loop systems making decisions about the distribution. Modern continuous glucose monitoring systems provide excellent sources of data to train machine learning models to predict future glucose levels. In this paper, we present an approach for predicting blood glucose levels for diabetics up to 1 h into the future. The approach is based on recurrent neural networks trained in an end-to-end fashion, requiring nothing but the glucose level history for the patient. Our approach obtains results that are comparable to the state of the art on the Ohio T1DM dataset for blood glucose level prediction. In addition to predicting the future glucose value, our model provides an estimate of its certainty, helping users to interpret the predicted levels. This is realized by training the recurrent neural network to parameterize a univariate Gaussian distribution over the output. The approach needs no feature engineering or data preprocessing and is computationally inexpensive. We evaluate our method using the standard root-mean-squared error (RMSE) metric, along with a blood glucose-specific metric called the surveillance error grid (SEG). We further study the properties of the distribution that is learned by the model, using experiments that determine the nature of the certainty estimate that the model is able to capture.

9.
Algorithms Mol Biol ; 14: 20, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-31572486

RESUMO

BACKGROUND: Full Bayesian inference for detecting copy number variants (CNV) from whole-genome sequencing (WGS) data is still largely infeasible due to computational demands. A recently introduced approach to perform Forward-Backward Gibbs sampling using dynamic Haar wavelet compression has alleviated issues of convergence and, to some extent, speed. Yet, the problem remains challenging in practice. RESULTS: In this paper, we propose an improved algorithmic framework for this approach. We provide new space-efficient data structures to query sufficient statistics in logarithmic time, based on a linear-time, in-place transform of the data, which also improves on the compression ratio. We also propose a new approach to efficiently store and update marginal state counts obtained from the Gibbs sampler. CONCLUSIONS: Using this approach, we discover several CNV candidates in two rat populations divergently selected for tame and aggressive behavior, consistent with earlier results concerning the domestication syndrome as well as experimental observations. Computationally, we observe a 29.5-fold decrease in memory, an average 5.8-fold speedup, as well as a 191-fold decrease in minor page faults. We also observe that metrics varied greatly in the old implementation, but not the new one. We conjecture that this is due to the better compression scheme. The fully Bayesian segmentation of the entire WGS data set required 3.5 min and 1.24 GB of memory, and can hence be performed on a commodity laptop.

10.
PeerJ ; 7: e6399, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-30783571

RESUMO

Building the Tree of Life (ToL) is a major challenge of modern biology, requiring advances in cyberinfrastructure, data collection, theory, and more. Here, we argue that phylogenomics stands to benefit by embracing the many heterogeneous genomic signals emerging from the first decade of large-scale phylogenetic analysis spawned by high-throughput sequencing (HTS). Such signals include those most commonly encountered in phylogenomic datasets, such as incomplete lineage sorting, but also those reticulate processes emerging with greater frequency, such as recombination and introgression. Here we focus specifically on how phylogenetic methods can accommodate the heterogeneity incurred by such population genetic processes; we do not discuss phylogenetic methods that ignore such processes, such as concatenation or supermatrix approaches or supertrees. We suggest that methods of data acquisition and the types of markers used in phylogenomics will remain restricted until a posteriori methods of marker choice are made possible with routine whole-genome sequencing of taxa of interest. We discuss limitations and potential extensions of a model supporting innovation in phylogenomics today, the multispecies coalescent model (MSC). Macroevolutionary models that use phylogenies, such as character mapping, often ignore the heterogeneity on which building phylogenies increasingly rely and suggest that assimilating such heterogeneity is an important goal moving forward. Finally, we argue that an integrative cyberinfrastructure linking all steps of the process of building the ToL, from specimen acquisition in the field to publication and tracking of phylogenomic data, as well as a culture that values contributors at each step, are essential for progress.

11.
Drug Discov Today Technol ; 32-33: 65-72, 2019 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-33386096

RESUMO

Application of AI technologies in synthesis prediction has developed very rapidly in recent years. We attempt here to give a comprehensive summary on the latest advancement on retro-synthesis planning, forward synthesis prediction as well as quantum chemistry-based reaction prediction models. Besides an introduction on the AI/ML models for addressing various synthesis related problems, the sources of the reaction datasets used in model building is also covered. In addition to the predictive models, the robotics based high throughput experimentation technology will be another crucial factor for conducting synthesis in an automated fashion. Some state-of-the-art of high throughput experimentation practices carried out in the pharmaceutical industry are highlighted in this chapter to give the reader a sense of how future chemistry will be conducted to make compounds faster and cheaper.


Assuntos
Inteligência Artificial , Desenho Assistido por Computador , Medicamentos Sintéticos/química , Humanos
12.
Methods Mol Biol ; 1833: 83-93, 2018.
Artigo em Inglês | MEDLINE | ID: mdl-30039365

RESUMO

CNV detection requires a high-quality segmentation of genomic data. In many WGS experiments, sample and control are sequenced together in a multiplexed fashion using DNA barcoding for economic reasons. Using the differential read depth of these two conditions cancels out systematic additive errors. Due to this detrending, the resulting data is appropriate for inference using a hidden Markov model (HMM), arguably one of the principal models for labeled segmentation. However, while the usual frequentist approaches such as Baum-Welch are problematic for several reasons, they are often preferred to Bayesian HMM inference, which normally requires prohibitively long running times and exceeds a typical user's computational resources on a genome scale data. HaMMLET solves this problem using a dynamic wavelet compression scheme, which makes Bayesian segmentation of WGS data feasible on standard consumer hardware.


Assuntos
Código de Barras de DNA Taxonômico/métodos , Sequenciamento de Nucleotídeos em Larga Escala , Análise de Sequência de DNA/métodos , Cadeias de Markov
13.
PLoS Comput Biol ; 12(5): e1004871, 2016 05.
Artigo em Inglês | MEDLINE | ID: mdl-27177143

RESUMO

By integrating Haar wavelets with Hidden Markov Models, we achieve drastically reduced running times for Bayesian inference using Forward-Backward Gibbs sampling. We show that this improves detection of genomic copy number variants (CNV) in array CGH experiments compared to the state-of-the-art, including standard Gibbs sampling. The method concentrates computational effort on chromosomal segments which are difficult to call, by dynamically and adaptively recomputing consecutive blocks of observations likely to share a copy number. This makes routine diagnostic use and re-analysis of legacy data collections feasible; to this end, we also propose an effective automatic prior. An open source software implementation of our method is available at http://schlieplab.org/Software/HaMMLET/ (DOI: 10.5281/zenodo.46262). This paper was selected for oral presentation at RECOMB 2016, and an abstract is published in the conference proceedings.


Assuntos
Hibridização Genômica Comparativa/estatística & dados numéricos , Variações do Número de Cópias de DNA , Modelos Genéticos , Teorema de Bayes , Neoplasias da Mama/genética , Linhagem Celular , Biologia Computacional , Simulação por Computador , Compressão de Dados , Feminino , Genoma Humano , Humanos , Cadeias de Markov , Software
14.
BMC Bioinformatics ; 17(1): 224, 2016 May 28.
Artigo em Inglês | MEDLINE | ID: mdl-27233515

RESUMO

BACKGROUND: Discovery of microRNAs (miRNAs) relies on predictive models for characteristic features from miRNA precursors (pre-miRNAs). The short length of miRNA genes and the lack of pronounced sequence features complicate this task. To accommodate the peculiarities of plant and animal miRNAs systems, tools for both systems have evolved differently. However, these tools are biased towards the species for which they were primarily developed and, consequently, their predictive performance on data sets from other species of the same kingdom might be lower. While these biases are intrinsic to the species, their characterization can lead to computational approaches capable of diminishing their negative effect on the accuracy of pre-miRNAs predictive models. We investigate in this study how 45 predictive models induced for data sets from 45 species, distributed in eight subphyla/classes, perform when applied to a species different from the species used in its induction. RESULTS: Our computational experiments show that the separability of pre-miRNAs and pseudo pre-miRNAs instances is species-dependent and no feature set performs well for all species, even within the same subphylum/class. Mitigating this species dependency, we show that an ensemble of classifiers reduced the classification errors for all 45 species. As the ensemble members were obtained using meaningful, and yet computationally viable feature sets, the ensembles also have a lower computational cost than individual classifiers that rely on energy stability parameters, which are of prohibitive computational cost in large scale applications. CONCLUSION: In this study, the combination of multiple pre-miRNAs feature sets and multiple learning biases enhanced the predictive accuracy of pre-miRNAs classifiers of 45 species. This is certainly a promising approach to be incorporated in miRNA discovery tools towards more accurate and less species-dependent tools. The material to reproduce the results from this paper can be downloaded from http://dx.doi.org/10.5281/zenodo.49754 .


Assuntos
Algoritmos , Biologia Computacional/métodos , MicroRNAs/genética , Precursores de RNA/genética , Animais , Humanos , MicroRNAs/química , Precursores de RNA/química , Software , Especificidade da Espécie
15.
BMC Bioinformatics ; 15: 124, 2014 May 02.
Artigo em Inglês | MEDLINE | ID: mdl-24884650

RESUMO

BACKGROUND: Computational discovery of microRNAs (miRNA) is based on pre-determined sets of features from miRNA precursors (pre-miRNA). Some feature sets are composed of sequence-structure patterns commonly found in pre-miRNAs, while others are a combination of more sophisticated RNA features. In this work, we analyze the discriminant power of seven feature sets, which are used in six pre-miRNA prediction tools. The analysis is based on the classification performance achieved with these feature sets for the training algorithms used in these tools. We also evaluate feature discrimination through the F-score and feature importance in the induction of random forests. RESULTS: Small or non-significant differences were found among the estimated classification performances of classifiers induced using sets with diversification of features, despite the wide differences in their dimension. Inspired in these results, we obtained a lower-dimensional feature set, which achieved a sensitivity of 90% and a specificity of 95%. These estimates are within 0.1% of the maximal values obtained with any feature set (SELECT, Section "Results and discussion") while it is 34 times faster to compute. Even compared to another feature set (FS2, see Section "Results and discussion"), which is the computationally least expensive feature set of those from the literature which perform within 0.1% of the maximal values, it is 34 times faster to compute. The results obtained by the tools used as references in the experiments carried out showed that five out of these six tools have lower sensitivity or specificity. CONCLUSION: In miRNA discovery the number of putative miRNA loci is in the order of millions. Analysis of putative pre-miRNAs using a computationally expensive feature set would be wasteful or even unfeasible for large genomes. In this work, we propose a relatively inexpensive feature set and explore most of the learning aspects implemented in current ab-initio pre-miRNA prediction tools, which may lead to the development of efficient ab-initio pre-miRNA discovery tools.The material to reproduce the main results from this paper can be downloaded from http://bioinformatics.rutgers.edu/Static/Software/discriminant.tar.gz.


Assuntos
MicroRNAs/química , Precursores de RNA/química , Algoritmos , Inteligência Artificial , Composição de Bases , Biologia Computacional/métodos , Humanos , Software
16.
Sci Rep ; 4: 4780, 2014 Apr 24.
Artigo em Inglês | MEDLINE | ID: mdl-24759094

RESUMO

A broad swath of eukaryotic microbial biodiversity cannot be cultivated in the lab and is therefore inaccessible to conventional genome-wide comparative methods. One promising approach to study these lineages is single cell genomics (SCG), whereby an individual cell is captured from nature and genome data are produced from the amplified total DNA. Here we tested the efficacy of SCG to generate a draft genome assembly from a single sample, in this case a cell belonging to the broadly distributed MAST-4 uncultured marine stramenopiles. Using de novo gene prediction, we identified 6,996 protein-encoding genes in the MAST-4 genome. This genetic inventory was sufficient to place the cell within the ToL using multigene phylogenetics and provided preliminary insights into the complex evolutionary history of horizontal gene transfer (HGT) in the MAST-4 lineage.


Assuntos
Genômica , Análise de Célula Única , Estramenópilas/genética , Biodiversidade , Biologia Computacional , Filogenia , Proteoma , Proteômica , RNA Ribossômico 18S/genética , Água do Mar , Estramenópilas/classificação , Estramenópilas/metabolismo
17.
Bioinformatics ; 30(14): 1950-7, 2014 Jul 15.
Artigo em Inglês | MEDLINE | ID: mdl-24618471

RESUMO

MOTIVATION: Counting the frequencies of k-mers in read libraries is often a first step in the analysis of high-throughput sequencing data. Infrequent k-mers are assumed to be a result of sequencing errors. The frequent k-mers constitute a reduced but error-free representation of the experiment, which can inform read error correction or serve as the input to de novo assembly methods. Ideally, the memory requirement for counting should be linear in the number of frequent k-mers and not in the, typically much larger, total number of k-mers in the read library. RESULTS: We present a novel method that balances time, space and accuracy requirements to efficiently extract frequent k-mers even for high-coverage libraries and large genomes such as human. Our method is designed to minimize cache misses in a cache-efficient manner by using a pattern-blocked Bloom filter to remove infrequent k-mers from consideration in combination with a novel sort-and-compact scheme, instead of a hash, for the actual counting. Although this increases theoretical complexity, the savings in cache misses reduce the empirical running times. A variant of method can resort to a counting Bloom filter for even larger savings in memory at the expense of false-negative rates in addition to the false-positive rates common to all Bloom filter-based approaches. A comparison with the state-of-the-art shows reduced memory requirements and running times. AVAILABILITY AND IMPLEMENTATION: The tools are freely available for download at http://bioinformatics.rutgers.edu/Software/Turtle and http://figshare.com/articles/Turtle/791582.


Assuntos
Algoritmos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos , Software , Genoma Humano , Humanos
18.
J Comput Biol ; 19(10): 1162-75, 2012 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-23057825

RESUMO

Scaffolding is an important subproblem in de novo genome assembly, in which mate pair data are used to construct a linear sequence of contigs separated by gaps. Here we present SLIQ, a set of simple linear inequalities derived from the geometry of contigs on the line that can be used to predict the relative positions and orientations of contigs from individual mate pair reads and thus produce a contig digraph. The SLIQ inequalities can also filter out unreliable mate pairs and can be used as a preprocessing step for any scaffolding algorithm. We tested the SLIQ inequalities on five real data sets ranging in complexity from simple bacterial genomes to complex mammalian genomes and compared the results to the majority voting procedure used by many other scaffolding algorithms. SLIQ predicted the relative positions and orientations of the contigs with high accuracy in all cases and gave more accurate position predictions than majority voting for complex genomes, in particular the human genome. Finally, we present a simple scaffolding algorithm that produces linear scaffolds given a contig digraph. We show that our algorithm is very efficient compared to other scaffolding algorithms while maintaining high accuracy in predicting both contig positions and orientations for real data sets.


Assuntos
Algoritmos , Mapeamento de Sequências Contíguas/métodos , Genoma Humano , Análise de Sequência de DNA/métodos , Humanos
19.
Bioinformatics ; 28(22): 2875-82, 2012 Nov 15.
Artigo em Inglês | MEDLINE | ID: mdl-23060616

RESUMO

MOTIVATION: Next-generation sequencing techniques have facilitated a large-scale analysis of human genetic variation. Despite the advances in sequencing speed, the computational discovery of structural variants is not yet standard. It is likely that many variants have remained undiscovered in most sequenced individuals. RESULTS: Here, we present a novel internal segment size based approach, which organizes all, including concordant, reads into a read alignment graph, where max-cliques represent maximal contradiction-free groups of alignments. A novel algorithm then enumerates all max-cliques and statistically evaluates them for their potential to reflect insertions or deletions. For the first time in the literature, we compare a large range of state-of-the-art approaches using simulated Illumina reads from a fully annotated genome and present relevant performance statistics. We achieve superior performance, in particular, for deletions or insertions (indels) of length 20-100 nt. This has been previously identified as a remaining major challenge in structural variation discovery, in particular, for insert size based approaches. In this size range, we even outperform split-read aligners. We achieve competitive results also on biological data, where our method is the only one to make a substantial amount of correct predictions, which, additionally, are disjoint from those by split-read aligners. AVAILABILITY: CLEVER is open source (GPL) and available from http://clever-sv.googlecode.com. CONTACT: as@cwi.nl or tm@cwi.nl. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Variação Genética , Genoma Humano , Simulação por Computador , Humanos , Mutação INDEL
20.
Bioinformatics ; 28(18): i325-i332, 2012 Sep 15.
Artigo em Inglês | MEDLINE | ID: mdl-22962448

RESUMO

MOTIVATION: Mapping billions of reads from next generation sequencing experiments to reference genomes is a crucial task, which can require hundreds of hours of running time on a single CPU even for the fastest known implementations. Traditional approaches have difficulties dealing with matches of large edit distance, particularly in the presence of frequent or large insertions and deletions (indels). This is a serious obstacle both in determining the spectrum and abundance of genetic variations and in personal genomics. RESULTS: For the first time, we adopt the approximate string matching paradigm of geometric embedding to read mapping, thus rephrasing it to nearest neighbor queries in a q-gram frequency vector space. Using the L(1) distance between frequency vectors has the benefit of providing lower bounds for an edit distance with affine gap costs. Using a cache-oblivious kd-tree, we realize running times, which match the state-of-the-art. Additionally, running time and memory requirements are about constant for read lengths between 100 and 1000 bp. We provide a first proof-of-concept that geometric embedding is a promising paradigm for read mapping and that L(1) distance might serve to detect structural variations. TreQ, our initial implementation of that concept, performs more accurate than many popular read mappers over a wide range of structural variants. AVAILABILITY AND IMPLEMENTATION: TreQ will be released under the GNU Public License (GPL), and precomputed genome indices will be provided for download at http://treq.sf.net. CONTACT: pavelm@cs.rutgers.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala/métodos , Mutação INDEL , Análise de Sequência de DNA/métodos , Mapeamento Cromossômico , Variação Genética , Genoma Humano , Genômica/métodos , Humanos , Nucleotídeos/química
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...