Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 7 de 7
Filter
Add more filters










Database
Language
Publication year range
1.
Sci Rep ; 12(1): 16047, 2022 09 26.
Article in English | MEDLINE | ID: mdl-36163232

ABSTRACT

Self-supervised language modeling is a rapidly developing approach for the analysis of protein sequence data. However, work in this area is heterogeneous and diverse, making comparison of models and methods difficult. Moreover, models are often evaluated only on one or two downstream tasks, making it unclear whether the models capture generally useful properties. We introduce the ProteinGLUE benchmark for the evaluation of protein representations: a set of seven per-amino-acid tasks for evaluating learned protein representations. We also offer reference code, and we provide two baseline models with hyperparameters specifically trained for these benchmarks. Pre-training was done on two tasks, masked symbol prediction and next sentence prediction. We show that pre-training yields higher performance on a variety of downstream tasks such as secondary structure and protein interaction interface prediction, compared to no pre-training. However, the larger base model does not outperform the smaller medium model. We expect the ProteinGLUE benchmark dataset introduced here, together with the two baseline pre-trained models and their performance evaluations, to be of great value to the field of protein sequence-based property prediction. Availability: code and datasets from https://github.com/ibivu/protein-glue .


Subject(s)
Benchmarking , Proteins , Amino Acid Sequence , Amino Acids/chemistry , Natural Language Processing
2.
Bioinformatics ; 35(24): 5315-5317, 2019 12 15.
Article in English | MEDLINE | ID: mdl-31368486

ABSTRACT

SUMMARY: PRALINE 2 is a toolkit for custom multiple sequence alignment workflows. It can be used to incorporate sequence annotations, such as secondary structure or (DNA) motifs, into the alignment scoring, as well as to customize many other aspects of a progressive multiple alignment workflow. AVAILABILITY AND IMPLEMENTATION: PRALINE 2 is implemented in Python and available as open source software on GitHub: https://github.com/ibivu/PRALINE/.


Subject(s)
Software , DNA , Protein Structure, Secondary , Sequence Alignment
3.
PLoS Comput Biol ; 14(11): e1006547, 2018 11.
Article in English | MEDLINE | ID: mdl-30383764

ABSTRACT

Protein or DNA motifs are sequence regions which possess biological importance. These regions are often highly conserved among homologous sequences. The generation of multiple sequence alignments (MSAs) with a correct alignment of the conserved sequence motifs is still difficult to achieve, due to the fact that the contribution of these typically short fragments is overshadowed by the rest of the sequence. Here we extended the PRALINE multiple sequence alignment program with a novel motif-aware MSA algorithm in order to address this shortcoming. This method can incorporate explicit information about the presence of externally provided sequence motifs, which is then used in the dynamic programming step by boosting the amino acid substitution matrix towards the motif. The strength of the boost is controlled by a parameter, α. Using a benchmark set of alignments we confirm that a good compromise can be found that improves the matching of motif regions while not significantly reducing the overall alignment quality. By estimating α on an unrelated set of reference alignments we find there is indeed a strong conservation signal for motifs. A number of typical but difficult MSA use cases are explored to exemplify the problems in correctly aligning functional sequence motifs and how the motif-aware alignment method can be employed to alleviate these problems.


Subject(s)
Amino Acid Motifs , DNA/chemistry , Proteins/chemistry , Sequence Alignment/standards , Algorithms , Amino Acid Sequence , Conserved Sequence , HIV-1/chemistry , Sequence Homology, Amino Acid , env Gene Products, Human Immunodeficiency Virus/chemistry
4.
Methods Mol Biol ; 1525: 167-189, 2017.
Article in English | MEDLINE | ID: mdl-27896722

ABSTRACT

The increasing importance of Next Generation Sequencing (NGS) techniques has highlighted the key role of multiple sequence alignment (MSA) in comparative structure and function analysis of biological sequences. MSA often leads to fundamental biological insight into sequence-structure-function relationships of nucleotide or protein sequence families. Significant advances have been achieved in this field, and many useful tools have been developed for constructing alignments, although many biological and methodological issues are still open. This chapter first provides some background information and considerations associated with MSA techniques, concentrating on the alignment of protein sequences. Then, a practical overview of currently available methods and a description of their specific advantages and limitations are given, to serve as a helpful guide or starting point for researchers who aim to construct a reliable MSA.


Subject(s)
Proteins/chemistry , Sequence Alignment/methods , Algorithms , High-Throughput Nucleotide Sequencing/methods , Phylogeny , Proteins/genetics , Sequence Analysis, Protein , Software
5.
Nucleic Acids Res ; 44(8): e72, 2016 05 05.
Article in English | MEDLINE | ID: mdl-26721389

ABSTRACT

Eukaryotic gene expression is regulated by transcription factors (TFs) binding to promoter as well as distal enhancers. TFs recognize short, but specific binding sites (TFBSs) that are located within the promoter and enhancer regions. Functionally relevant TFBSs are often highly conserved during evolution leaving a strong phylogenetic signal. While multiple sequence alignment (MSA) is a potent tool to detect the phylogenetic signal, the current MSA implementations are optimized to align the maximum number of identical nucleotides. This approach might result in the omission of conserved motifs that contain interchangeable nucleotides such as the ETS motif (IUPAC code: GGAW). Here, we introduce ConBind, a novel method to enhance alignment of short motifs, even if their mutual sequence similarity is only partial. ConBind improves the identification of conserved TFBSs by improving the alignment accuracy of TFBS families within orthologous DNA sequences. Functional validation of the Gfi1b + 13 enhancer reveals that ConBind identifies additional functionally important ETS binding sites that were missed by all other tested alignment tools. In addition to the analysis of known regulatory regions, our web tool is useful for the analysis of TFBSs on so far unknown DNA regions identified through ChIP-sequencing.


Subject(s)
Computational Biology/methods , DNA-Binding Proteins/metabolism , Enhancer Elements, Genetic/genetics , Promoter Regions, Genetic/genetics , Sequence Alignment/methods , Transcription Factors/metabolism , Animals , Base Sequence , Binding Sites/genetics , Gene Expression Regulation/genetics , Humans , Sequence Analysis, DNA
6.
PLoS Comput Biol ; 11(10): e1004435, 2015 Oct.
Article in English | MEDLINE | ID: mdl-26505754

ABSTRACT

It has been recently shown that the coarse-graining of the structures of polypeptide chains as self-avoiding tubes can provide an effective representation of the conformational space of proteins. In order to fully exploit the opportunities offered by such a 'tube model' approach, we present here a strategy to combine it with molecular dynamics simulations. This strategy is based on the incorporation of the 'CamTube' force field into the Gromacs molecular dynamics package. By considering the case of a 60-residue polyvaline chain, we show that CamTube molecular dynamics simulations can comprehensively explore the conformational space of proteins. We obtain this result by a 20 µs metadynamics simulation of the polyvaline chain that recapitulates the currently known protein fold universe. We further show that, if residue-specific interaction potentials are added to the CamTube force field, it is possible to fold a protein into a topology close to that of its native state. These results illustrate how the CamTube force field can be used to explore efficiently the universe of protein folds with good accuracy and very limited computational cost.


Subject(s)
Algorithms , Models, Chemical , Molecular Dynamics Simulation , Protein Folding , Proteins/chemistry , Proteins/ultrastructure , Programming Languages , Protein Conformation , Software , Stress, Mechanical
7.
PLoS One ; 10(9): e0138141, 2015.
Article in English | MEDLINE | ID: mdl-26375816

ABSTRACT

BACKGROUND: Cancer is caused by somatic DNA alterations such as gene point mutations, DNA copy number aberrations (CNA) and structural variants (SVs). Genome-wide analyses of SVs in large sample series with well-documented clinical information are still scarce. Consequently, the impact of SVs on carcinogenesis and patient outcome remains poorly understood. This study aimed to perform a systematic analysis of genes that are affected by CNA-associated chromosomal breaks in colorectal cancer (CRC) and to determine the clinical relevance of recurrent breakpoint genes. METHODS: Primary CRC samples of patients with metastatic disease from CAIRO and CAIRO2 clinical trials were previously characterized by array-comparative genomic hybridization. These data were now used to determine the prevalence of CNA-associated chromosomal breaks within genes across 352 CRC samples. In addition, mutation status of the commonly affected APC, TP53, KRAS, PIK3CA, FBXW7, SMAD4, BRAF and NRAS genes was determined for 204 CRC samples by targeted massive parallel sequencing. Clinical relevance was assessed upon stratification of patients based on gene mutations and gene breakpoints that were observed in >3% of CRC cases. RESULTS: In total, 748 genes were identified that were recurrently affected by chromosomal breaks (FDR <0.1). MACROD2 was affected in 41% of CRC samples and another 169 genes showed breakpoints in >3% of cases, indicating that prevalence of gene breakpoints is comparable to the prevalence of well-known gene point mutations. Patient stratification based on gene breakpoints and point mutations revealed one CRC subtype with very poor prognosis. CONCLUSIONS: We conclude that CNA-associated chromosomal breaks within genes represent a highly prevalent and clinically relevant subset of SVs in CRC.


Subject(s)
Biomarkers, Tumor/genetics , Chromosome Breakage , Colorectal Neoplasms/genetics , Genome-Wide Association Study , Mutation/genetics , Clinical Trials, Phase III as Topic , Colorectal Neoplasms/pathology , Humans , Multicenter Studies as Topic , Prognosis , Randomized Controlled Trials as Topic
SELECTION OF CITATIONS
SEARCH DETAIL
...