Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 9 de 9
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
PLoS Comput Biol ; 20(5): e1012061, 2024 May.
Artigo em Inglês | MEDLINE | ID: mdl-38701099

RESUMO

To optimize proteins for particular traits holds great promise for industrial and pharmaceutical purposes. Machine Learning is increasingly applied in this field to predict properties of proteins, thereby guiding the experimental optimization process. A natural question is: How much progress are we making with such predictions, and how important is the choice of regressor and representation? In this paper, we demonstrate that different assessment criteria for regressor performance can lead to dramatically different conclusions, depending on the choice of metric, and how one defines generalization. We highlight the fundamental issues of sample bias in typical regression scenarios and how this can lead to misleading conclusions about regressor performance. Finally, we make the case for the importance of calibrated uncertainty in this domain.


Assuntos
Biologia Computacional , Aprendizado de Máquina , Engenharia de Proteínas , Engenharia de Proteínas/métodos , Análise de Regressão , Biologia Computacional/métodos , Proteínas/química , Algoritmos
2.
Sci Rep ; 12(1): 17882, 2022 10 25.
Artigo em Inglês | MEDLINE | ID: mdl-36284144

RESUMO

The mining of genomes from non-cultivated microorganisms using metagenomics is a powerful tool to discover novel proteins and other valuable biomolecules. However, function-based metagenome searches are often limited by the time-consuming expression of the active proteins in various heterologous host systems. We here report the initial characterization of novel single-subunit bacteriophage RNA polymerase, EM1 RNAP, identified from a metagenome data set obtained from an elephant dung microbiome. EM1 RNAP and its promoter sequence are distantly related to T7 RNA polymerase. Using EM1 RNAP and a translation-competent Escherichia coli extract, we have developed an efficient medium-throughput pipeline and protocol allowing the expression of metagenome-derived genes and the production of proteins in cell-free system is sufficient for the initial testing of the predicted activities. Here, we have successfully identified and verified 12 enzymes acting on bis(2-hydroxyethyl) terephthalate (BHET) in a completely clone-free approach and proposed an in vitro high-throughput metagenomic screening method.


Assuntos
Metagenoma , Proteínas do Complexo da Replicase Viral , Sistema Livre de Células/metabolismo , RNA Viral/metabolismo , RNA Polimerases Dirigidas por DNA/genética , RNA Polimerases Dirigidas por DNA/metabolismo , Metagenômica/métodos , Escherichia coli/genética , Escherichia coli/metabolismo
3.
Bioinformatics ; 38(4): 941-946, 2022 01 27.
Artigo em Inglês | MEDLINE | ID: mdl-35088833

RESUMO

MOTIVATION: Solubility and expression levels of proteins can be a limiting factor for large-scale studies and industrial production. By determining the solubility and expression directly from the protein sequence, the success rate of wet-lab experiments can be increased. RESULTS: In this study, we focus on predicting the solubility and usability for purification of proteins expressed in Escherichia coli directly from the sequence. Our model NetSolP is based on deep learning protein language models called transformers and we show that it achieves state-of-the-art performance and improves extrapolation across datasets. As we find current methods are built on biased datasets, we curate existing datasets by using strict sequence-identity partitioning and ensure that there is minimal bias in the sequences. AVAILABILITY AND IMPLEMENTATION: The predictor and data are available at https://services.healthtech.dtu.dk/service.php?NetSolP and the open-sourced code is available at https://github.com/tvinet/NetSolP-1.0. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Escherichia coli , Idioma , Proteínas , Software , Solubilidade
4.
Comput Biol Chem ; 95: 107596, 2021 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-34775287

RESUMO

A crucial process in the production of industrial enzymes is recombinant gene expression, which aims to induce enzyme overexpression of the genes in a host microbe. Current approaches for securing overexpression rely on molecular tools such as adjusting the recombinant expression vector, adjusting cultivation conditions, or performing codon optimizations. However, such strategies are time-consuming, and an alternative strategy would be to select genes for better compatibility with the recombinant host. Several methods for predicting soluble expression are available; however, they are all optimized for the expression host Escherichia coli and do not consider the possibility of an expressed protein not being soluble. We show that these tools are not suited for predicting expression potential in the industrially important host Bacillus subtilis. Instead, we build a B. subtilis-specific machine learning model for expressibility prediction. Given millions of unlabelled proteins and a small labeled dataset, we can successfully train such a predictive model. The unlabeled proteins provide a performance boost relative to using amino acid frequencies of the labeled proteins as input. On average, we obtain a modest performance of 0.64 area-under-the-curve (AUC) and 0.2 Matthews correlation coefficient (MCC). However, we find that this is sufficient for the prioritization of expression candidates for high-throughput studies. Moreover, the predicted class probabilities are correlated with expression levels. A number of features related to protein expression, including base frequencies and solubility, are captured by the model.


Assuntos
Bacillus subtilis/genética , Proteínas de Bactérias/genética , Aprendizado de Máquina , Regulação da Expressão Gênica , Proteínas Recombinantes/genética
5.
PLoS One ; 9(9): e106707, 2014.
Artigo em Inglês | MEDLINE | ID: mdl-25208077

RESUMO

A phylogenetic and metagenomic study of elephant feces samples (derived from a three-weeks-old and a six-years-old Asian elephant) was conducted in order to describe the microbiota inhabiting this large land-living animal. The microbial diversity was examined via 16S rRNA gene analysis. We generated more than 44,000 GS-FLX+454 reads for each animal. For the baby elephant, 380 operational taxonomic units (OTUs) were identified at 97% sequence identity level; in the six-years-old animal, close to 3,000 OTUs were identified, suggesting high microbial diversity in the older animal. In both animals most OTUs belonged to Bacteroidetes and Firmicutes. Additionally, for the baby elephant a high number of Proteobacteria was detected. A metagenomic sequencing approach using Illumina technology resulted in the generation of 1.1 Gbp assembled DNA in contigs with a maximum size of 0.6 Mbp. A KEGG pathway analysis suggested high metabolic diversity regarding the use of polymers and aromatic and non-aromatic compounds. In line with the high phylogenetic diversity, a surprising and not previously described biodiversity of glycoside hydrolase (GH) genes was found. Enzymes of 84 GH families were detected. Polysaccharide utilization loci (PULs), which are found in Bacteroidetes, were highly abundant in the dataset; some of these comprised cellulase genes. Furthermore the highest coverage for GH5 and GH9 family enzymes was detected for Bacteroidetes, suggesting that bacteria of this phylum are mainly responsible for the degradation of cellulose in the Asian elephant. Altogether, this study delivers insight into the biomass conversion by one of the largest plant-fed and land-living animals.


Assuntos
Aleitamento Materno , Elefantes/microbiologia , Fezes/microbiologia , Glicosídeo Hidrolases/metabolismo , Metagenômica , Microbiota , Plantas , Animais , Biomassa , Coleta de Dados , Feminino , Glicosídeo Hidrolases/genética , Masculino , Filogenia
6.
Methods ; 50(4): S6-9, 2010 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-20215018

RESUMO

microRNAs are small regulatory RNAs that are currently emerging as new biomarkers for cancer and other diseases. In order for biomarkers to be useful in clinical settings, they should be accurately and reliably detected in clinical samples such as formalin fixed paraffin embedded (FFPE) sections and blood serum or plasma. These types of samples represent a challenge in terms of microRNA quantification. A newly developed method for microRNA qPCR using Locked Nucleic Acid (LNA)-enhanced primers enables accurate and reproducible quantification of microRNAs in scarce clinical samples. Here we show that LNA-based microRNA qPCR enables biomarker screening using very low amounts of total RNA from FFPE samples and the results are compared to microarray analysis data. We also present evidence that the addition of a small carrier RNA prior to total RNA extraction, improves microRNA quantification in blood plasma and laser capture microdissected (LCM) sections of FFPE samples.


Assuntos
MicroRNAs/análise , Reação em Cadeia da Polimerase/métodos , Fixadores , Formaldeído , Humanos , Lasers , MicroRNAs/sangue , MicroRNAs/genética , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Inclusão em Parafina
7.
RNA ; 15(11): 2028-34, 2009 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-19745027

RESUMO

Recently, next-generation sequencing has been introduced as a promising, new platform for assessing the copy number of transcripts, while the existing microarray technology is considered less reliable for absolute, quantitative expression measurements. Nonetheless, so far, results from the two technologies have only been compared based on biological data, leading to the conclusion that, although they are somewhat correlated, expression values differ significantly. Here, we use synthetic RNA samples, resembling human microRNA samples, to find that microarray expression measures actually correlate better with sample RNA content than expression measures obtained from sequencing data. In addition, microarrays appear highly sensitive and perform equivalently to next-generation sequencing in terms of reproducibility and relative ratio quantification.


Assuntos
Expressão Gênica , MicroRNAs/análise , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Análise de Sequência de RNA/métodos , MicroRNAs/síntese química , MicroRNAs/genética , Reprodutibilidade dos Testes
8.
Expert Opin Drug Discov ; 2(1): 19-35, 2007 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-23496035

RESUMO

Throughout time functional immunology has accumulated vast amounts of quantitative and qualitative data relevant to the design and discovery of vaccines. Such data includes, but is not limited to, components of the host and pathogen genome (including antigens and virulence factors), T- and B-cell epitopes and other components of the antigen presentation pathway and allergens. In this review the authors discuss a range of databases that archive such data. Built on such information, increasingly sophisticated data mining techniques have developed that create predictive models of utilitarian value. With special reference to epitope data, the authors discuss the strengths and weaknesses of the available techniques and how they can aid computer-aided vaccine design deliver added value for vaccinology.

9.
BMC Bioinformatics ; 7: 501, 2006 Nov 14.
Artigo em Inglês | MEDLINE | ID: mdl-17105666

RESUMO

BACKGROUND: Modelling the interaction between potentially antigenic peptides and Major Histocompatibility Complex (MHC) molecules is a key step in identifying potential T-cell epitopes. For Class II MHC alleles, the binding groove is open at both ends, causing ambiguity in the positional alignment between the groove and peptide, as well as creating uncertainty as to what parts of the peptide interact with the MHC. Moreover, the antigenic peptides have variable lengths, making naive modelling methods difficult to apply. This paper introduces a kernel method that can handle variable length peptides effectively by quantifying similarities between peptide sequences and integrating these into the kernel. RESULTS: The kernel approach presented here shows increased prediction accuracy with a significantly higher number of true positives and negatives on multiple MHC class II alleles, when testing data sets from MHCPEP 1, MCHBN 2, and MHCBench 3. Evaluation by cross validation, when segregating binders and non-binders, produced an average of 0.824 AROC for the MHCBench data sets (up from 0.756), and an average of 0.96 AROC for multiple alleles of the MHCPEP database. CONCLUSION: The method improves performance over existing state-of-the-art methods of MHC class II peptide binding predictions by using a custom, knowledge-based representation of peptides. Similarity scores, in contrast to a fixed-length, pocket-specific representation of amino acids, provide a flexible and powerful way of modelling MHC binding, and can easily be applied to other dynamic sequence problems.


Assuntos
Biologia Computacional , Mapeamento de Epitopos , Antígenos de Histocompatibilidade Classe II/metabolismo , Peptídeos/metabolismo , Sítios de Ligação , Bases de Dados Genéticas , Antígenos HLA-A/química , Antígenos HLA-A/metabolismo , Antígenos HLA-DR/química , Antígenos HLA-DR/metabolismo , Cadeias HLA-DRB1 , Antígenos de Histocompatibilidade Classe II/química , Humanos , Peptídeos/química , Ligação Proteica , Conformação Proteica , Curva ROC , Reprodutibilidade dos Testes , Alinhamento de Sequência , Análise de Sequência de Proteína , Homologia de Sequência de Aminoácidos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...