Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 25
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
BMC Bioinformatics ; 23(1): 25, 2022 Jan 06.
Artigo em Inglês | MEDLINE | ID: mdl-34991450

RESUMO

BACKGROUND: Sequencing technologies are prone to errors, making error correction (EC) necessary for downstream applications. EC tools need to be manually configured for optimal performance. We find that the optimal parameters (e.g., k-mer size) are both tool- and dataset-dependent. Moreover, evaluating the performance (i.e., Alignment-rate or Gain) of a given tool usually relies on a reference genome, but quality reference genomes are not always available. We introduce Lerna for the automated configuration of k-mer-based EC tools. Lerna first creates a language model (LM) of the uncorrected genomic reads, and then, based on this LM, calculates a metric called the perplexity metric to evaluate the corrected reads for different parameter choices. Next, it finds the one that produces the highest alignment rate without using a reference genome. The fundamental intuition of our approach is that the perplexity metric is inversely correlated with the quality of the assembly after error correction. Therefore, Lerna leverages the perplexity metric for automated tuning of k-mer sizes without needing a reference genome. RESULTS: First, we show that the best k-mer value can vary for different datasets, even for the same EC tool. This motivates our design that automates k-mer size selection without using a reference genome. Second, we show the gains of our LM using its component attention-based transformers. We show the model's estimation of the perplexity metric before and after error correction. The lower the perplexity after correction, the better the k-mer size. We also show that the alignment rate and assembly quality computed for the corrected reads are strongly negatively correlated with the perplexity, enabling the automated selection of k-mer values for better error correction, and hence, improved assembly quality. We validate our approach on both short and long reads. Additionally, we show that our attention-based models have significant runtime improvement for the entire pipeline-18[Formula: see text] faster than previous works, due to parallelizing the attention mechanism and the use of JIT compilation for GPU inferencing. CONCLUSION: Lerna improves de novo genome assembly by optimizing EC tools. Our code is made available in a public repository at: https://github.com/icanforce/lerna-genomics .


Assuntos
Algoritmos , Sequenciamento de Nucleotídeos em Larga Escala , Sequência de Bases , Genômica , Análise de Sequência de DNA , Software
2.
Sci Rep ; 11(1): 22459, 2021 11 17.
Artigo em Inglês | MEDLINE | ID: mdl-34789789

RESUMO

Data transmission accounts for significant energy consumption in wireless sensor networks where streaming data is generated by the sensors. This impedes their use in many settings, including livestock monitoring over large pastures (which forms our target application). We present Ambrosia, a lightweight protocol that utilizes a window-based timeseries forecasting mechanism for data reduction. Ambrosia employs a configurable error threshold to ensure that the accuracy of end applications is unaffected by the data transfer reduction. Experimental evaluations using LoRa and BLE on a real livestock monitoring deployment demonstrate 60% reduction in data transmission and a 2 [Formula: see text] increase in battery lifetime.

3.
BMC Bioinformatics ; 22(1): 237, 2021 May 10.
Artigo em Inglês | MEDLINE | ID: mdl-33971820

RESUMO

BACKGROUND: MicroRNAs (miRNAs) function in post-transcriptional regulation of gene expression by binding to target messenger RNAs (mRNAs). Because of the key part that miRNAs play, understanding the correct regulatory role of miRNAs in diverse patho-physiological conditions is of great interest. Although it is known that miRNAs act combinatorially to regulate genes, precise identification of miRNA-gene interactions and their specific functional roles in regulatory comodules remains a challenge. We developed THEIA, an effective method for simultaneously predicting miRNA-gene interactions and regulatory comodules, which group functionally related miRNAs and genes via non-negative matrix factorization (NMF). RESULTS: We apply THEIA to RNA sequencing data from breast invasive carcinoma samples and demonstrate its effectiveness in discovering biologically significant regulatory comodules that are significantly enriched in spatial miRNA clusters, biological pathways, and various cancers. CONCLUSIONS: THEIA is a theoretically rigorous optimization algorithm that simultaneously predicts the strength and direction (i.e., up-regulation or down-regulation) of the effect of modules of miRNAs on a gene. We posit that if THEIA is capable of recovering known clusters of genes and miRNA, then the clusters found by our method not previously identified by literature are also likely to have biological significance. We believe that these novel regulatory comodules found by our method will be a springboard for further research into the specific functional roles of these new functional ensembles of miRNAs and genes,especially those related to diseases like breast cancer.


Assuntos
Redes Reguladoras de Genes , MicroRNAs , Perfilação da Expressão Gênica , Regulação da Expressão Gênica , Humanos , MicroRNAs/genética , RNA Mensageiro , Análise de Sequência de RNA
4.
Sci Adv ; 6(21): eaaz5913, 2020 05.
Artigo em Inglês | MEDLINE | ID: mdl-32494742

RESUMO

Despite great progress in biomaterial design strategies for replacing damaged articular cartilage, prevention of stem cell-derived chondrocyte hypertrophy and resulting inferior tissue formation is still a critical challenge. Here, by using engineered biomaterials and a high-throughput system for screening of combinatorial cues in cartilage microenvironments, we demonstrate that biomaterial cross-linking density that regulates matrix degradation and stiffness-together with defined presentation of growth factors, mechanical stimulation, and arginine-glycine-aspartic acid (RGD) peptides-can guide human mesenchymal stem cell (hMSC) differentiation into articular or hypertrophic cartilage phenotypes. Faster-degrading, soft matrices promoted articular cartilage tissue formation of hMSCs by inducing their proliferation and maturation, while slower-degrading, stiff matrices promoted cells to differentiate into hypertrophic chondrocytes through Yes-associated protein (YAP)-dependent mechanotransduction. in vitro and in vivo chondrogenesis studies also suggest that down-regulation of the Wingless and INT-1 (WNT) signaling pathway is required for better quality articular cartilage-like tissue production.


Assuntos
Cartilagem Articular , Células-Tronco Mesenquimais , Materiais Biocompatíveis/metabolismo , Cartilagem Articular/metabolismo , Diferenciação Celular , Mecanotransdução Celular/fisiologia , Células-Tronco Mesenquimais/metabolismo , Fenótipo , Células-Tronco , Engenharia Tecidual/métodos
5.
Sci Rep ; 10(1): 2390, 2020 Feb 06.
Artigo em Inglês | MEDLINE | ID: mdl-32024907

RESUMO

An amendment to this paper has been published and can be accessed via a link at the top of the paper.

6.
Sci Rep ; 9(1): 16157, 2019 11 06.
Artigo em Inglês | MEDLINE | ID: mdl-31695060

RESUMO

The performance of most error-correction (EC) algorithms that operate on genomics reads is dependent on the proper choice of its configuration parameters, such as the value of k in k-mer based techniques. In this work, we target the problem of finding the best values of these configuration parameters to optimize error correction and consequently improve genome assembly. We perform this in an adaptive manner, adapted to different datasets and to EC tools, due to the observation that different configuration parameters are optimal for different datasets, i.e., from different platforms and species, and vary with the EC algorithm being applied. We use language modeling techniques from the Natural Language Processing (NLP) domain in our algorithmic suite, Athena, to automatically tune the performance-sensitive configuration parameters. Through the use of N-Gram and Recurrent Neural Network (RNN) language modeling, we validate the intuition that the EC performance can be computed quantitatively and efficiently using the "perplexity" metric, repurposed from NLP. After training the language model, we show that the perplexity metric calculated from a sample of the test (or production) data has a strong negative correlation with the quality of error correction of erroneous NGS reads. Therefore, we use the perplexity metric to guide a hill climbing-based search, converging toward the best configuration parameter value. Our approach is suitable for both de novo and comparative sequencing (resequencing), eliminating the need for a reference genome to serve as the ground truth. We find that Athena can automatically find the optimal value of k with a very high accuracy for 7 real datasets and using 3 different k-mer based EC algorithms, Lighter, Blue, and Racer. The inverse relation between the perplexity metric and alignment rate exists under all our tested conditions-for real and synthetic datasets, for all kinds of sequencing errors (insertion, deletion, and substitution), and for high and low error rates. The absolute value of that correlation is at least 73%. In our experiments, the best value of k found by Athena achieves an alignment rate within 0.53% of the oracle best value of k found through brute force searching (i.e., scanning through the entire range of k values). Athena's selected value of k lies within the top-3 best k values using N-Gram models and the top-5 best k values using RNN models With best parameter selection by Athena, the assembly quality (NG50) is improved by a Geometric Mean of 4.72X across the 7 real datasets.


Assuntos
Algoritmos , Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Processamento de Linguagem Natural , Oligonucleotídeos/genética , Automação , Sequência de Bases , Conjuntos de Dados como Assunto , Redes Neurais de Computação , Alinhamento de Sequência
7.
Sci Rep ; 9(1): 14882, 2019 10 16.
Artigo em Inglês | MEDLINE | ID: mdl-31619717

RESUMO

Remarkable advancements in high-throughput gene sequencing technologies have led to an exponential growth in the number of sequenced genomes. However, unavailability of highly parallel and scalable de novo assembly algorithms have hindered biologists attempting to swiftly assemble high-quality complex genomes. Popular de Bruijn graph assemblers, such as IDBA-UD, generate high-quality assemblies by iterating over a set of k-values used in the construction of de Bruijn graphs (DBG). However, this process of sequentially iterating from small to large k-values slows down the process of assembly. In this paper, we propose ScalaDBG, which metamorphoses this sequential process, building DBGs for each distinct k-value in parallel. We develop an innovative mechanism to "patch" a higher k-valued graph with contigs generated from a lower k-valued graph. Moreover, ScalaDBG leverages multi-level parallelism, by both scaling up on all cores of a node, and scaling out to multiple nodes simultaneously. We demonstrate that ScalaDBG completes assembling the genome faster than IDBA-UD, but with similar accuracy on a variety of datasets (6.8X faster for one of the most complex genome in our dataset).


Assuntos
Algoritmos , Mapeamento de Sequências Contíguas/métodos , Genoma , Análise de Sequência de DNA/estatística & dados numéricos , Software , Sequência de Bases , Benchmarking , Conjuntos de Dados como Assunto , Escherichia coli/genética , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Staphylococcus aureus/genética
8.
BMC Bioinformatics ; 20(1): 488, 2019 Oct 07.
Artigo em Inglês | MEDLINE | ID: mdl-31590652

RESUMO

BACKGROUND: The data deluge can leverage sophisticated ML techniques for functionally annotating the regulatory non-coding genome. The challenge lies in selecting the appropriate classifier for the specific functional annotation problem, within the bounds of the hardware constraints and the model's complexity. In our system AIKYATAN, we annotate distal epigenomic regulatory sites, e.g., enhancers. Specifically, we develop a binary classifier that classifies genome sequences as distal regulatory regions or not, given their histone modifications' combinatorial signatures. This problem is challenging because the regulatory regions are distal to the genes, with diverse signatures across classes (e.g., enhancers and insulators) and even within each class (e.g., different enhancer sub-classes). RESULTS: We develop a suite of ML models, under the banner AIKYATAN, including SVM models, random forest variants, and deep learning architectures, for distal regulatory element (DRE) detection. We demonstrate, with strong empirical evidence, deep learning approaches have a computational advantage. Plus, convolutional neural networks (CNN) provide the best-in-class accuracy, superior to the vanilla variant. With the human embryonic cell line H1, CNN achieves an accuracy of 97.9% and an order of magnitude lower runtime than the kernel SVM. Running on a GPU, the training time is sped up 21x and 30x (over CPU) for DNN and CNN, respectively. Finally, our CNN model enjoys superior prediction performance vis-'a-vis the competition. Specifically, AIKYATAN-CNN achieved 40% higher validation rate versus CSIANN and the same accuracy as RFECS. CONCLUSIONS: Our exhaustive experiments using an array of ML tools validate the need for a model that is not only expressive but can scale with increasing data volumes and diversity. In addition, a subset of these datasets have image-like properties and benefit from spatial pooling of features. Our AIKYATAN suite leverages diverse epigenomic datasets that can then be modeled using CNNs with optimized activation and pooling functions. The goal is to capture the salient features of the integrated epigenomic datasets for deciphering the distal (non-coding) regulatory elements, which have been found to be associated with functional variants. Our source code will be made publicly available at: https://bitbucket.org/cellsandmachines/aikyatan.


Assuntos
Mapeamento Cromossômico/métodos , Aprendizado Profundo , Epigenômica/métodos , Sequências Reguladoras de Ácido Nucleico , Software , Linhagem Celular , Humanos
9.
Brief Bioinform ; 20(1): 235-244, 2019 01 18.
Artigo em Inglês | MEDLINE | ID: mdl-28968781

RESUMO

Federation is a popular concept in building distributed cyberinfrastructures, whereby computational resources are provided by multiple organizations through a unified portal, decreasing the complexity of moving data back and forth among multiple organizations. Federation has been used in bioinformatics only to a limited extent, namely, federation of datastores, e.g. SBGrid Consortium for structural biology and Gene Expression Omnibus (GEO) for functional genomics. Here, we posit that it is important to federate both computational resources (CPU, GPU, FPGA, etc.) and datastores to support popular bioinformatics portals, with fast-increasing data volumes and increasing processing requirements. A prime example, and one that we discuss here, is in genomics and metagenomics. It is critical that the processing of the data be done without having to transport the data across large network distances. We exemplify our design and development through our experience with metagenomics-RAST (MG-RAST), the most popular metagenomics analysis pipeline. Currently, it is hosted completely at Argonne National Laboratory. However, through a recently started collaborative National Institutes of Health project, we are taking steps toward federating this infrastructure. Being a widely used resource, we have to move toward federation without disrupting 50 K annual users. In this article, we describe the computational tools that will be useful for federating a bioinformatics infrastructure and the open research challenges that we see in federating such infrastructures. It is hoped that our manuscript can serve to spur greater federation of bioinformatics infrastructures by showing the steps involved, and thus, allow them to scale to support larger user bases.


Assuntos
Genômica/estatística & dados numéricos , Disseminação de Informação/métodos , Big Data , Biologia Computacional/métodos , Confidencialidade , Bases de Dados Genéticas/estatística & dados numéricos , Privacidade Genética , Humanos , Metagenômica/estatística & dados numéricos , Software , Estados Unidos
10.
Brief Bioinform ; 20(4): 1151-1159, 2019 07 19.
Artigo em Inglês | MEDLINE | ID: mdl-29028869

RESUMO

As technologies change, MG-RAST is adapting. Newly available software is being included to improve accuracy and performance. As a computational service constantly running large volume scientific workflows, MG-RAST is the right location to perform benchmarking and implement algorithmic or platform improvements, in many cases involving trade-offs between specificity, sensitivity and run-time cost. The work in [Glass EM, Dribinsky Y, Yilmaz P, et al. ISME J 2014;8:1-3] is an example; we use existing well-studied data sets as gold standards representing different environments and different technologies to evaluate any changes to the pipeline. Currently, we use well-understood data sets in MG-RAST as platform for benchmarking. The use of artificial data sets for pipeline performance optimization has not added value, as these data sets are not presenting the same challenges as real-world data sets. In addition, the MG-RAST team welcomes suggestions for improvements of the workflow. We are currently working on versions 4.02 and 4.1, both of which contain significant input from the community and our partners that will enable double barcoding, stronger inferences supported by longer-read technologies, and will increase throughput while maintaining sensitivity by using Diamond and SortMeRNA. On the technical platform side, the MG-RAST team intends to support the Common Workflow Language as a standard to specify bioinformatics workflows, both to facilitate development and efficient high-performance implementation of the community's data analysis tasks.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala/métodos , Metagenoma , Metagenômica/métodos , Software , Algoritmos , Orçamentos , Biologia Computacional/métodos , Sequenciamento de Nucleotídeos em Larga Escala/economia , Sequenciamento de Nucleotídeos em Larga Escala/estatística & dados numéricos , Internet , Metagenômica/economia , Metagenômica/estatística & dados numéricos , Análise de Sequência de DNA/economia , Análise de Sequência de DNA/métodos , Análise de Sequência de DNA/estatística & dados numéricos , Interface Usuário-Computador , Fluxo de Trabalho
11.
IEEE/ACM Trans Comput Biol Bioinform ; 15(4): 1037-1051, 2018.
Artigo em Inglês | MEDLINE | ID: mdl-29993641

RESUMO

BACKGROUND: MicroRNAs (miRNAs) are approximately 22-nucleotide long regulatory RNA that mediate RNA interference by binding to cognate mRNA target regions. Here, we present a distributed kernel SVM-based binary classification scheme to predict miRNA targets. It captures the spatial profile of miRNA-mRNA interactions via smooth B-spline curves. This is accomplished separately for various input features, such as thermodynamic and sequence-based features. Further, we use a principled approach to uniformly model both canonical and non-canonical seed matches, using a novel seed enrichment metric. Finally, we verify our miRNA-mRNA pairings using an Elastic Net-based regression model on TCGA expression data for four cancer types to estimate the miRNAs that together regulate any given mRNA. RESULTS: We present a suite of algorithms for miRNA target prediction, under the banner Avishkar, with superior prediction performance over the competition. Specifically, our final kernel SVM model, with an Apache Spark backend, achieves an average true positive rate (TPR) of more than 75 percent, when keeping the false positive rate of 20 percent, for non-canonical human miRNA target sites. This is an improvement of over 150 percent in the TPR for non-canonical sites, over the best-in-class algorithm. We are able to achieve such superior performance by representing the thermodynamic and sequence profiles of miRNA-mRNA interaction as curves, devising a novel seed enrichment metric, and learning an ensemble of miRNA family-specific kernel SVM classifiers. We provide an easy-to-use system for large-scale interactive analysis and prediction of miRNA targets. All operations in our system, namely candidate set generation, feature generation and transformation, training, prediction, and computing performance metrics are fully distributed and are scalable. CONCLUSIONS: We have developed an efficient SVM-based model for miRNA target prediction using recent CLIP-seq data, demonstrating superior performance, evaluated using ROC curves for different species (human or mouse), or different target types (canonical or non-canonical). We analyzed the agreement between the target pairings using CLIP-seq data and using expression data from four cancer types. To the best of our knowledge, we provide the first distributed framework for miRNA target prediction based on Apache Hadoop and Spark. AVAILABILITY: All source code and sample data are publicly available at https://bitbucket.org/cellsandmachines/avishkar. Our scalable implementation of kernel SVM using Apache Spark, which can be used to solve large-scale non-linear binary classification problems, is available at https://bitbucket.org/cellsandmachines/kernelsvmspark.


Assuntos
Biologia Computacional/métodos , Perfilação da Expressão Gênica/métodos , MicroRNAs/genética , Algoritmos , Bases de Dados Genéticas , Humanos , MicroRNAs/análise , MicroRNAs/metabolismo , Curva ROC , Reprodutibilidade dos Testes , Alinhamento de Sequência/métodos , Análise de Sequência de RNA/métodos , Máquina de Vetores de Suporte
12.
Theranostics ; 8(1): 277-291, 2018.
Artigo em Inglês | MEDLINE | ID: mdl-29290807

RESUMO

MicroRNAs (miRNAs) are short non-coding RNAs that regulate expression of target messenger RNAs (mRNAs) post-transcriptionally. Understanding the precise regulatory role of miRNAs is of great interest since miRNAs have been shown to play an important role in development, diseases, and other biological processes. Early work on miRNA target prediction has focused on static sequence-driven miRNA-mRNA complementarity. However, recent research also utilizes expression-level data to study context-dependent regulation effects in a more dynamic, physiologically-relevant setting. Methods: We propose a novel artificial neural network (ANN) based method, named Tiresias, to predict such targets in a context-dependent manner by combining sequence and expression data. In order to predict the interacting pairs among miRNAs and mRNAs and their regulatory weights, we develop a two-stage ANN and present how to train it appropriately. Tiresias is designed to study various regulation models, ranging from a simple linear model to a complex non-linear model. Tiresias has a single hyper-parameter to control the sparsity of miRNA-mRNA interactions, which we optimize using Bayesian optimization. Results: Tiresias performs better than existing computational methods such as GenMiR++, Elastic Net, and PIMiM, achieving an F1 score of >0.8 for a certain level of regulation strength. For the TCGA breast invasive carcinoma dataset, Tiresias results in the rate of up to 82% in detecting the experimentally-validated interactions between miRNAs and mRNAs, even if we assume that true regulations may result in a low level of regulation strength. Conclusion: Tiresias is a two-stage ANN, computational method that deciphers context-dependent microRNA regulatory interactions. Experiment results demonstrate that Tiresias outperforms existing solutions and can achieve a high F1 score. Source code of Tiresias is available at https://bitbucket.org/cellsandmachines/.


Assuntos
Biologia Computacional/métodos , MicroRNAs/metabolismo , Redes Neurais de Computação , Animais , Teorema de Bayes , Regulação Neoplásica da Expressão Gênica , Humanos , Aprendizado de Máquina , RNA Mensageiro/metabolismo
13.
Biomaterials ; 155: 13-24, 2018 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-29156422

RESUMO

The cells of the vascular system are highly sensitive to biophysical cues from their local cellular microenvironment. To engineer improved materials for vascular devices and delivery of cell therapies, a key challenge is to understand the mechanisms that cells use to sense biophysical cues from their environment. Syndecans are heparan sulfate proteoglycans (HSPGs) that consist of a protein core modified with heparan sulfate glycosaminoglycan chains. Due to their presence on the cell surface and their interaction with cytoskeletal and focal adhesion associated molecules, cell surface proteoglycans are well poised to serve as mechanosensors of the cellular microenvironment. Nanotopological cues have become recognized as major regulators of cell growth, migration and phenotype. We hypothesized that syndecan-1 could serve as a mechanosensor for nanotopological cues and can mediate the responsiveness of vascular smooth muscle cells to nanoengineered materials. We created engineered substrates made of polyurethane acrylate with nanogrooves using ultraviolet-assisted capillary force lithography. We cultured vascular smooth muscle cells with knockout of syndecan-1 on engineered substrates with varying compliance and nanotopology. We found that knockout of syndecan-1 reduced alignment of vascular smooth muscle cells to the nanogrooves under inflammatory treatments. In addition, we found that loss of syndecan-1 increased nuclear localization of Yap/Taz and phospho-Smad2/3 in response to nanogrooves. Syndecan-1 knockout vascular smooth muscle cells also had elevated levels of Rho-associated protein kinase-1 (Rock1), leading to increased cell stiffness and an enhanced contractile state in the cells. Together, our findings support that syndecan-1 knockout leads to alterations in mechanosensing of nanotopographical cues through alterations of in rho-associated signaling pathways, cell mechanics and mediators of the Hippo and TGF-ß signaling pathways.


Assuntos
Sindecana-1/química , Técnicas Biossensoriais , Proteoglicanas de Heparan Sulfato/química , Músculo Liso Vascular/metabolismo , Transdução de Sinais , Sindecana-1/metabolismo , Fator de Crescimento Transformador beta/química , Fator de Crescimento Transformador beta/metabolismo
14.
Theranostics ; 7(18): 4445-4469, 2017.
Artigo em Inglês | MEDLINE | ID: mdl-29158838

RESUMO

The emergence of targeted and efficient genome editing technologies, such as repurposed bacterial programmable nucleases (e.g., CRISPR-Cas systems), has abetted the development of cell engineering approaches. Lessons learned from the development of RNA-interference (RNA-i) therapies can spur the translation of genome editing, such as those enabling the translation of human pluripotent stem cell engineering. In this review, we discuss the opportunities and the challenges of repurposing bacterial nucleases for genome editing, while appreciating their roles, primarily at the epigenomic granularity. First, we discuss the evolution of high-precision, genome editing technologies, highlighting CRISPR-Cas9. They exist in the form of programmable nucleases, engineered with sequence-specific localizing domains, and with the ability to revolutionize human stem cell technologies through precision targeting with greater on-target activities. Next, we highlight the major challenges that need to be met prior to bench-to-bedside translation, often learning from the path-to-clinic of complementary technologies, such as RNA-i. Finally, we suggest potential bioinformatics developments and CRISPR delivery vehicles that can be deployed to circumvent some of the challenges confronting genome editing technologies en route to the clinic.


Assuntos
Sistemas CRISPR-Cas/genética , Repetições Palindrômicas Curtas Agrupadas e Regularmente Espaçadas/genética , Células-Tronco Pluripotentes/fisiologia , Animais , Bactérias/genética , Edição de Genes/métodos , Engenharia Genética/métodos , Humanos
15.
Sci Rep ; 6: 38433, 2016 12 08.
Artigo em Inglês | MEDLINE | ID: mdl-27929098

RESUMO

We present EP-DNN, a protocol for predicting enhancers based on chromatin features, in different cell types. Specifically, we use a deep neural network (DNN)-based architecture to extract enhancer signatures in a representative human embryonic stem cell type (H1) and a differentiated lung cell type (IMR90). We train EP-DNN using p300 binding sites, as enhancers, and TSS and random non-DHS sites, as non-enhancers. We perform same-cell and cross-cell predictions to quantify the validation rate and compare against two state-of-the-art methods, DEEP-ENCODE and RFECS. We find that EP-DNN has superior accuracy with a validation rate of 91.6%, relative to 85.3% for DEEP-ENCODE and 85.5% for RFECS, for a given number of enhancer predictions and also scales better for a larger number of enhancer predictions. Moreover, our H1 → IMR90 predictions turn out to be more accurate than IMR90 → IMR90, potentially because H1 exhibits a richer signature set and our EP-DNN model is expressive enough to extract these subtleties. Our work shows how to leverage the full expressivity of deep learning models, using multiple hidden layers, while avoiding overfitting on the training data. We also lay the foundation for exploration of cross-cell enhancer predictions, potentially reducing the need for expensive experimentation.


Assuntos
Cromatina/genética , Biologia Computacional , Elementos Facilitadores Genéticos/genética , Redes Neurais de Computação , Algoritmos , Células-Tronco Embrionárias Humanas/citologia , Humanos , Pulmão/citologia
16.
BMC Syst Biol ; 10 Suppl 2: 54, 2016 08 01.
Artigo em Inglês | MEDLINE | ID: mdl-27490187

RESUMO

BACKGROUND: Gene expression is mediated by specialized cis-regulatory modules (CRMs), the most prominent of which are called enhancers. Early experiments indicated that enhancers located far from the gene promoters are often responsible for mediating gene transcription. Knowing their properties, regulatory activity, and genomic targets is crucial to the functional understanding of cellular events, ranging from cellular homeostasis to differentiation. Recent genome-wide investigation of epigenomic marks has indicated that enhancer elements could be enriched for certain epigenomic marks, such as, combinatorial patterns of histone modifications. METHODS: Our efforts in this paper are motivated by these recent advances in epigenomic profiling methods, which have uncovered enhancer-associated chromatin features in different cell types and organisms. Specifically, in this paper, we use recent state-of-the-art Deep Learning methods and develop a deep neural network (DNN)-based architecture, called EP-DNN, to predict the presence and types of enhancers in the human genome. It uses as features, the expression levels of the histone modifications at the peaks of the functional sites as well as in its adjacent regions. We apply EP-DNN to four different cell types: H1, IMR90, HepG2, and HeLa S3. We train EP-DNN using p300 binding sites as enhancers, and TSS and random non-DHS sites as non-enhancers. We perform EP-DNN predictions to quantify the validation rate for different levels of confidence in the predictions and also perform comparisons against two state-of-the-art computational models for enhancer predictions, DEEP-ENCODE and RFECS. RESULTS: We find that EP-DNN has superior accuracy and takes less time to make predictions. Next, we develop methods to make EP-DNN interpretable by computing the importance of each input feature in the classification task. This analysis indicates that the important histone modifications were distinct for different cell types, with some overlaps, e.g., H3K27ac was important in cell type H1 but less so in HeLa S3, while H3K4me1 was relatively important in all four cell types. We finally use the feature importance analysis to reduce the number of input features needed to train the DNN, thus reducing training time, which is often the computational bottleneck in the use of a DNN. CONCLUSIONS: In this paper, we developed EP-DNN, which has high accuracy of prediction, with validation rates above 90 % for the operational region of enhancer prediction for all four cell lines that we studied, outperforming DEEP-ENCODE and RFECS. Then, we developed a method to analyze a trained DNN and determine which histone modifications are important, and within that, which features proximal or distal to the enhancer site, are important.


Assuntos
Biologia Computacional/métodos , Elementos Facilitadores Genéticos/genética , Redes Neurais de Computação , Linhagem Celular Tumoral , Regulação da Expressão Gênica , Histonas/metabolismo , Humanos
18.
Nucleic Acids Res ; 44(D1): D590-4, 2016 Jan 04.
Artigo em Inglês | MEDLINE | ID: mdl-26656948

RESUMO

MG-RAST (http://metagenomics.anl.gov) is an open-submission data portal for processing, analyzing, sharing and disseminating metagenomic datasets. The system currently hosts over 200,000 datasets and is continuously updated. The volume of submissions has increased 4-fold over the past 24 months, now averaging 4 terabasepairs per month. In addition to several new features, we report changes to the analysis workflow and the technologies used to scale the pipeline up to the required throughput levels. To show possible uses for the data from MG-RAST, we present several examples integrating data and analyses from MG-RAST into popular third-party analysis tools or sequence alignment tools.


Assuntos
Bases de Dados de Ácidos Nucleicos , Metagenômica , Internet , Alinhamento de Sequência
19.
BMC Genomics ; 16: 999, 2015 Nov 25.
Artigo em Inglês | MEDLINE | ID: mdl-26608597

RESUMO

BACKGROUND: MicroRNAs (miRNAs) are small regulatory RNA that mediate RNA interference by binding to various mRNA target regions. There have been several computational methods for the identification of target mRNAs for miRNAs. However, these have considered all contributory features as scalar representations, primarily, as thermodynamic or sequence-based features. Further, a majority of these methods solely target canonical sites, which are sites with "seed" complementarity. Here, we present a machine-learning classification scheme, titled Avishkar, which captures the spatial profile of miRNA-mRNA interactions via smooth B-spline curves, separately for various input features, such as thermodynamic and sequence features. Further, we use a principled approach to uniformly model canonical and non-canonical seed matches, using a novel seed enrichment metric. RESULTS: We demonstrate that large number of seed-match patterns have high enrichment values, conserved across species, and that majority of miRNA binding sites involve non-canonical matches, corroborating recent findings. Using spatial curves and popular categorical features, such as target site length and location, we train a linear SVM model, utilizing experimental CLIP-seq data. Our model significantly outperforms all established methods, for both canonical and non-canonical sites. We achieve this while using a much larger candidate miRNA-mRNA interaction set than prior work. CONCLUSIONS: We have developed an efficient SVM-based model for miRNA target prediction using recent CLIP-seq data, demonstrating superior performance, evaluated using ROC curves, specifically about 20% better than the state-of-the-art, for different species (human or mouse), or different target types (canonical or non-canonical). To the best of our knowledge we provide the first distributed framework for microRNA target prediction based on Apache Hadoop and Spark. AVAILABILITY: All source code and data is publicly available at https://bitbucket.org/cellsandmachines/avishkar.


Assuntos
Sítios de Ligação , Biologia Computacional/métodos , MicroRNAs/química , MicroRNAs/genética , Interferência de RNA , RNA Mensageiro/química , RNA Mensageiro/genética , Termodinâmica , Regiões 3' não Traduzidas , Regiões 5' não Traduzidas , Animais , Humanos , Camundongos , Curva ROC , Reprodutibilidade dos Testes , Análise de Sequência de RNA , Máquina de Vetores de Suporte
20.
Tissue Eng Part A ; 20(15-16): 2115-26, 2014 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-24694244

RESUMO

Vascular smooth muscle cells (vSMCs) retain the ability to undergo modulation in their phenotypic continuum, ranging from a mature contractile state to a proliferative, secretory state. vSMC differentiation is modulated by a complex array of microenvironmental cues, which include the biochemical milieu of the cells and the architecture and stiffness of the extracellular matrix. In this study, we demonstrate that by using UV-assisted capillary force lithography (CFL) to engineer a polyurethane substratum of defined nanotopography and stiffness, we can facilitate the differentiation of cultured vSMCs, reduce their inflammatory signature, and potentially promote the optimal functioning of the vSMC contractile and cytoskeletal machinery. Specifically, we found that the combination of medial tissue-like stiffness (11 MPa) and anisotropic nanotopography (ridge width_groove width_ridge height of 800_800_600 nm) resulted in significant upregulation of calponin, desmin, and smoothelin, in addition to the downregulation of intercellular adhesion molecule-1, tissue factor, interleukin-6, and monocyte chemoattractant protein-1. Further, our results allude to the mechanistic role of the RhoA/ROCK pathway and caveolin-1 in altered cellular mechanotransduction pathways via differential matrix nanotopography and stiffness. Notably, the nanopatterning of the stiffer substrata (1.1 GPa) resulted in the significant upregulation of RhoA, ROCK1, and ROCK2. This indicates that nanopatterning an 800_800_600 nm pattern on a stiff substratum may trigger the mechanical plasticity of vSMCs resulting in a hypercontractile vSMC phenotype, as observed in diabetes or hypertension. Given that matrix stiffness is an independent risk factor for cardiovascular disease and that CFL can create different matrix nanotopographic patterns with high pattern fidelity, we are poised to create a combinatorial library of arterial test beds, whether they are healthy, diseased, injured, or aged. Such high-throughput testing environments will pave the way for the evolution of the next generation of vascular scaffolds that can effectively crosstalk with the scaffold microenvironment and result in improved clinical outcomes.


Assuntos
Matriz Extracelular/química , Músculo Liso Vascular/citologia , Miócitos de Músculo Liso/fisiologia , Nanotecnologia/métodos , Actinas/metabolismo , Anisotropia , Fenômenos Biomecânicos/efeitos dos fármacos , Diferenciação Celular/efeitos dos fármacos , Polaridade Celular/efeitos dos fármacos , Forma Celular/efeitos dos fármacos , Células Cultivadas , Citoesqueleto/efeitos dos fármacos , Citoesqueleto/metabolismo , Módulo de Elasticidade/efeitos dos fármacos , Matriz Extracelular/efeitos dos fármacos , Humanos , Miócitos de Músculo Liso/citologia , Miócitos de Músculo Liso/efeitos dos fármacos , Fenótipo , Poliuretanos/farmacologia , Reação em Cadeia da Polimerase em Tempo Real , Fibras de Estresse/efeitos dos fármacos , Fibras de Estresse/metabolismo , Artérias Umbilicais/citologia , Proteína rhoA de Ligação ao GTP/metabolismo
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...