Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 42
Filter
Add more filters










Publication year range
1.
Int J Mol Sci ; 25(9)2024 May 03.
Article in English | MEDLINE | ID: mdl-38732207

ABSTRACT

Prediction of binding sites for transcription factors is important to understand how the latter regulate gene expression and how this regulation can be modulated for therapeutic purposes. A consistent number of references address this issue with different approaches, Machine Learning being one of the most successful. Nevertheless, we note that many such approaches fail to propose a robust and meaningful method to embed the genetic data under analysis. We try to overcome this problem by proposing a bidirectional transformer-based encoder, empowered by bidirectional long-short term memory layers and with a capsule layer responsible for the final prediction. To evaluate the efficiency of the proposed approach, we use benchmark ChIP-seq datasets of five cell lines available in the ENCODE repository (A549, GM12878, Hep-G2, H1-hESC, and Hela). The results show that the proposed method can predict TFBS within the five different cell lines very well; moreover, cross-cell predictions provide satisfactory results as well. Experiments conducted across cell lines are reinforced by the analysis of five additional lines used only to test the model trained using the others. The results confirm that prediction across cell lines remains very high, allowing an extensive cross-transcription factor analysis to be performed from which several indications of interest for molecular biology may be drawn.


Subject(s)
Deep Learning , Transcription Factors , Humans , Transcription Factors/metabolism , Transcription Factors/genetics , Binding Sites , Computational Biology/methods , HeLa Cells , Protein Binding , Chromatin Immunoprecipitation Sequencing/methods , Cell Line
2.
Gene ; 922: 148556, 2024 Sep 05.
Article in English | MEDLINE | ID: mdl-38754568

ABSTRACT

COVID-19 emergency has pushed the international scientific community to use every resource to combat the spread of the virus, to understand its biology and predict its possible evolution in terms of new variants. Since the first SARS-CoV-2 virus nucleotide and amino acid sequences were made available, information theory was used to study how viral information content was changing over time and then trace the evolution of its mutational landscape. In this work we analyzed SARS-CoV-2 sequences collected mainly in the USA in a period from March 2020 until December 2022 and computed mutation profiles of viral proteins over time through an entropy-based approach using Shannon Entropy and Hellinger distance. This representation allows an at-a-glance view of the mutational landscape of viral proteins over time and can provide new insights on the evolution of the virus from different points of view. Non-structural proteins typically showed flat mutation profiles, characterized by a very low Average mutation Entropy, while accessory and structural proteins showed mostly non uniform and high mutation profiles, often coupled with the predominance of variants. Interestingly NSP2 protein, whose function is currently still debated, falls in the same branch of NSP14 and NSP10 in the phylogenetic tree of mutations constructed through correlations of mutation profiles, suggesting a co-evolution of those proteins and a possible functional link with each other. To the best of our knowledge this is the first study based on a massive amount of data (n = 107,939,973) that analyzes from an entropy point of view the mutational landscape of SARS-CoV-2 over time and depicts a mutational temporal profile of each protein of the virus.


Subject(s)
COVID-19 , Entropy , Mutation , SARS-CoV-2 , SARS-CoV-2/genetics , COVID-19/virology , COVID-19/genetics , Humans , United States , Evolution, Molecular , Viral Proteins/genetics , Viral Nonstructural Proteins/genetics , Spike Glycoprotein, Coronavirus/genetics
3.
J Comput Biol ; 31(5): 416-428, 2024 05.
Article in English | MEDLINE | ID: mdl-38687334

ABSTRACT

A Coding DNA Sequence (CDS) is a fraction of DNA whose nucleotides are grouped into consecutive triplets called codons, each one encoding an amino acid. Because most amino acids can be encoded by more than one codon, the same amino acid chain can be obtained by a very large number of different CDSs. These synonymous CDSs show different features that, also depending on the organism the transcript is expressed in, could affect translational efficiency and yield. The identification of optimal CDSs with respect to given transcript indicators is in general a challenging task, but it has been observed in recent literature that integer linear programming (ILP) can be a very flexible and efficient way to achieve it. In this article, we add evidence to this observation by proposing a new ILP model that simultaneously optimizes different well-grounded indicators. With this model, we efficiently find solutions that dominate those returned by six existing codon optimization heuristics.


Subject(s)
Algorithms , Codon , Models, Genetic , Programming, Linear , Codon/genetics , Base Sequence/genetics , DNA/genetics , Computational Biology/methods
4.
Phys Rev E ; 108(5-1): 054130, 2023 Nov.
Article in English | MEDLINE | ID: mdl-38115426

ABSTRACT

Homophily is the principle whereby "similarity breeds connections." We give a quantitative formulation of this principle within networks. Given a network and a labeled partition of its vertices, the vector indexed by each class of the partition, whose entries are the number of edges of the subgraphs induced by the corresponding classes, is viewed as the observed outcome of the random vector described by picking labeled partitions at random among labeled partitions whose classes have the same cardinalities as the given one. This is the recently introduced random coloring model for network homophily. In this perspective, the value of any homophily score Θ, namely, a nondecreasing real-valued function in the sizes of subgraphs induced by the classes of the partition, evaluated at the observed outcome, can be thought of as the observed value of a random variable. Consequently, according to the score Θ, the input network is homophillic at the significance level α whenever the one-sided tail probability of observing a value of Θ at least as extreme as the observed one is smaller than α. Since, as we show, even approximating α is an NP-hard problem, we resort to classical tails inequality to bound α from above. These upper bounds, obtained by specializing Θ, yield a class of quantifiers of network homophily. Computing the upper bounds requires the knowledge of the covariance matrix of the random vector, which was not previously known within the random coloring model. In this paper we close this gap. Interestingly, the matrix depends on the input partition only through the cardinalities of its classes and depends on the network only through its degrees. Furthermore all the covariances have the same sign, and this sign is a graph invariant. Plugging this structure into the bounds yields a meaningful, easy to compute class of indices for measuring network homophily. As demonstrated in real-world network applications, these indices are effective and reliable, and may lead to discoveries that cannot be captured by the current state of the art.

5.
Viruses ; 15(5)2023 05 18.
Article in English | MEDLINE | ID: mdl-37243274

ABSTRACT

SARS-CoV-2 and its many variants have caused a worldwide emergency. Host cells colonised by SARS-CoV-2 present a significantly different gene expression landscape. As expected, this is particularly true for genes that directly interact with virus proteins. Thus, understanding the role that transcription factors can play in driving differential regulation in patients affected by COVID-19 is a focal point to unveil virus infection. In this regard, we have identified 19 transcription factors which are predicted to target human proteins interacting with Spike glycoprotein of SARS-CoV-2. Transcriptomics RNA-Seq data derived from 13 human organs are used to analyse expression correlation between identified transcription factors and related target genes in both COVID-19 patients and healthy individuals. This resulted in the identification of transcription factors showing the most relevant impact in terms of most evident differential correlation between COVID-19 patients and healthy individuals. This analysis has also identified five organs such as the blood, heart, lung, nasopharynx and respiratory tract in which a major effect of differential regulation mediated by transcription factors is observed. These organs are also known to be affected by COVID-19, thereby providing consistency to our analysis. Furthermore, 31 key human genes differentially regulated by the transcription factors in the five organs are identified and the corresponding KEGG pathways and GO enrichment are also reported. Finally, the drugs targeting those 31 genes are also put forth. This in silico study explores the effects of transcription factors on human genes interacting with Spike glycoprotein of SARS-CoV-2 and intends to provide new insights to inhibit the virus infection.


Subject(s)
COVID-19 , Humans , COVID-19/genetics , SARS-CoV-2 , Transcription Factors/genetics , Transcription Factors/metabolism , Gene Expression Regulation , Glycoproteins/genetics
6.
J Immunol Methods ; 517: 113474, 2023 06.
Article in English | MEDLINE | ID: mdl-37068621

ABSTRACT

BACKGROUND: Class I Major Histocompatibility Complex plays a critical role in the adaptive immune response by binding to peptides processed by Proteasome and Transporter associated with antigen processing complex and presenting them on the cell surface to cytotoxic T-cells. Understanding the process of peptide presentation and studying how presented peptides are distributed in the huge space of all potential epitopes could have a dramatic impact in the context of vaccine design, transplantation, autoimmunity, and cancer development. METHODS: In the present work we propose a graph-driven approach to investigate the landscape of both self (human) and viral (254 organisms) peptides presented on cell surface through class I Major Histocompatibility Complex considering specific HLAs. For each considered HLA (N = 89) we designed a network, namely Peptide Hamming Graph, where nodes are peptides predicted to be presented by a given HLA and an edge is set when the Hamming distance between two peptides is equal or smaller than 2 (i.e. the same amino acid occurs in at least 7 positions of the two sequences). RESULTS: Through the analysis of Peptide Hamming Graphs we studied how predicted presented peptides are distributed in the whole configurational space for different HLAs, identifying sets of viral peptides that can constitute a potential target for the immune system. In particular we selected connected components of the graph made exclusively of viral peptides and sets of viral peptides with high node degree interacting exclusively with viral neighbours. CONCLUSIONS: This work constitutes an innovative approach to study potential cytotoxic T-cell epitopes relying on a network approach, overcoming the classical paradigm based on the identification of potential epitopes only considering their features as single peptides. T-cell cross-reactivity plays a focal role for the efficacy of this strategy increasing the probability of recognition, and consequently a stronger immune response, of presented peptides far from self, sharing a common pattern in terms of sequence similarity.


Subject(s)
HLA Antigens , Peptides , Humans , Antigen Presentation , Histocompatibility Antigens , Epitopes, T-Lymphocyte
7.
Bioinformatics ; 39(1)2023 01 01.
Article in English | MEDLINE | ID: mdl-36440918

ABSTRACT

SUMMARY: It has been observed in different kinds of networks, such as social or biological ones, a typical behavior inspired by the general principle 'similarity breeds connections'. These networks are defined as homophilic as nodes belonging to the same class preferentially interact with each other. In this work, we present HONTO (HOmophily Network TOol), a user-friendly open-source Python3 package designed to evaluate and analyze homophily in complex networks. The tool takes in input from the network along with a partition of its nodes into classes and yields a matrix whose entries are the homophily/heterophily z-score values. To complement the analysis, the tool also provides z-score values of nodes that do not interact with any other node of the same class. Homophily/heterophily z-scores values are presented as a heatmap allowing a visual at-a-glance interpretation of results. AVAILABILITY AND IMPLEMENTATION: Tool's source code is available at https://github.com/cumbof/honto under the MIT license, installable as a package from PyPI (pip install honto) and conda-forge (conda install -c conda-forge honto), and has a wrapper for the Galaxy platform available on the official Galaxy ToolShed (Blankenberg et al., 2014) at https://toolshed.g2.bx.psu.edu/view/fabio/honto.


Subject(s)
Software , Humans
8.
Sci Rep ; 12(1): 9757, 2022 06 13.
Article in English | MEDLINE | ID: mdl-35697749

ABSTRACT

We present a new method for assessing and measuring homophily in networks whose nodes have categorical attributes, namely when the nodes of networks come partitioned into classes (colors). We probe this method in two different classes of networks: (i) protein-protein interaction (PPI) networks, where nodes correspond to proteins, partitioned according to their functional role, and edges represent functional interactions between proteins (ii) Pokec on-line social network, where nodes correspond to users, partitioned according to their age, and edges respresent friendship between users.Similarly to other classical and well consolidated approaches, our method compares the relative edge density of the subgraphs induced by each class with the corresponding expected relative edge density under a null model. The novelty of our approach consists in prescribing an endogenous null model, namely, the sample space of the null model is built on the input network itself. This allows us to give exact explicit expression for the [Formula: see text]-score of the relative edge density of each class as well as other related statistics. The [Formula: see text]-scores directly quantify the statistical significance of the observed homophily via Cebysëv inequality. The expression of each [Formula: see text]-score is entered by the network structure through basic combinatorial invariant such as the number of subgraphs with two spanning edges. Each [Formula: see text]-score is computed in [Formula: see text] time for a network with n nodes and m edges. This leads to an overall efficient computational method for assesing homophily. We complement the analysis of homophily/heterophily by considering [Formula: see text]-scores of the number of isolated nodes in the subgraphs induced by each class, that are computed in O(nm) time. Theoretical results are then exploited to show that, as expected, both the analyzed network classes are significantly homophilic with respect to the considered node properties.

9.
Infect Genet Evol ; 101: 105294, 2022 07.
Article in English | MEDLINE | ID: mdl-35513162

ABSTRACT

This study aimed at updating previous data on HIV-1 integrase variability, by using effective bioinformatics methods combining different statistical instruments from simple entropy and mutation rate to more specific approaches such as Hellinger distance. A total of 2133 HIV-1 integrase sequences were analyzed in: i) 1460 samples from drug-naïve [DN] individuals; ii) 386 samples from drug-experienced but INI-naïve [IN] individuals; iii) 287 samples from INI-experienced [IE] individuals. Within the three groups, 76 amino acid positions were highly conserved (≤0.2% variation, Hellinger distance: <0.25%), with 35 fully invariant positions; while, 80 positions were conserved (>0.2% to <1% variation, Hellinger distance: <1%). The H12-H16-C40-C43 and D64-D116-E152 motifs were all well conserved. Some residues were affected by dramatic changes in their mutation distributions, especially between DN and IE samples (Hellinger distance ≥1%). In particular, 15 positions (D6, S24, V31, S39, L74, A91, S119, T122, T124, T125, V126, K160, N222, S230, C280) showed a significant decrease of mutation rate in IN and/or IE samples compared to DN samples. Conversely, 8 positions showed significantly higher mutation rate in samples from treated individuals (IN and/or IE) compared to DN. Some of these positions, such as E92, T97, G140, Y143, Q148 and N155, were already known to be associated with resistance to integrase inhibitors; other positions including S24, M154, V165 and D270 are not yet documented to be associated with resistance. Our study confirms the high conservation of HIV-1 integrase and identified highly invariant positions using robust and innovative methods. The role of novel mutations located in the critical region of HIV-1 integrase deserves further investigation.


Subject(s)
HIV Infections , HIV Integrase Inhibitors , HIV Integrase , HIV-1 , Drug Resistance, Viral/genetics , HIV Infections/drug therapy , HIV Integrase/chemistry , HIV Integrase Inhibitors/pharmacology , HIV-1/genetics , Humans , Mutation
10.
Virus Res ; 317: 198814, 2022 08.
Article in English | MEDLINE | ID: mdl-35588940

ABSTRACT

Adaptive immune response is triggered when specific pathogen peptides called epitopes are recognised as exogenous according to the paradigm of self/non-self. To be recognized by immune cells, epitopes have to be exposed (presented) on the surface of the cell. Predicting if a peptide is exposed is important to shed light on the rules that govern immune response and, thus, identify potential targets and design vaccine and drugs. We focused on peptides exposed on cell surface and made accessible to immune system through the MHC Class I complex. Before this can happen, three successive selection steps have to take place: a) Proteasome cleveage, b) TAP Transport, and c) binding to MHC-class I. Starting from a set of 211 host human reference viruses, we computed the set of unique peptides occurring in the correspondent proteomes. Then, we obtained the probability values of Proteasome Cleveage, TAP Transport and Binding to MHC Class I associated to those peptides through established prediction software tools. Such values were analysed in conjunction with two other features that could play a major role: the distance from self, strictly linked to the concept of nullomers, and the sequence entropy, measuring the complexity of the peptide amino acid composition. The analysis confirmed and extended previous results on a larger, more significant and consistent data set; we showed that the higher the distances from self, the higher the score of TAP Transport and binding to MHC class I; no significant association was instead found between distance from self and Proteasome Cleveage. Additionally, amino acid peptide composition entropy was significantly associated with the other features. In particular, higher entropies were linked with higher scores of Proteasome Cleveage, TAP Transport, Binding to MHC Class I, and higher distance from self. The relationship among the three selection steps provided evidence of a tight inter-correlation, clearly suggesting it could be the product of a co-evolutive process. We believe that these results give new insights on the complex processes that regulate peptide presentation through MHC class I, and unveil the mechanisms the allow the immune system to distinguish self and viral non-self peptides.


Subject(s)
Proteasome Endopeptidase Complex , Viruses , ATP-Binding Cassette Transporters/genetics , Amino Acids , Antigen Presentation , Entropy , Epitopes , Histocompatibility Antigens Class I/metabolism , Humans , Peptides , Proteasome Endopeptidase Complex/metabolism , Viruses/metabolism
11.
Infect Genet Evol ; 97: 105154, 2022 01.
Article in English | MEDLINE | ID: mdl-34808395

ABSTRACT

The pandemic of COVID-19 has been haunting us for almost the past two years. Although, the vaccination drive is in full swing throughout the world, different mutations of the SARS-CoV-2 virus are making it very difficult to put an end to the pandemic. The second wave in India, one of the worst sufferers of this pandemic, can be mainly attributed to the Delta variant i.e. B.1.617.2. Thus, it is very important to analyse and understand the mutational trajectory of SARS-CoV-2 through the study of the 26 virus proteins. In this regard, more than 17,000 protein sequences of Indian SARS-CoV-2 genomes are analysed using entropy-based approach in order to find the monthly mutational trajectory. Furthermore, Hellinger distance is also used to show the difference of the mutation events between the consecutive months for each of the 26 SARS-CoV-2 protein. The results show that the mutation rates and the mutation events of the viral proteins though changing in the initial months, start stabilizing later on for mainly the four structural proteins while the non-structural proteins mostly exhibit a more constant trend. As a consequence, it can be inferred that the evolution of the new mutative configurations will eventually reduce.


Subject(s)
COVID-19/epidemiology , Genome, Viral , Mutation Rate , SARS-CoV-2/genetics , Spike Glycoprotein, Coronavirus/genetics , Viral Nonstructural Proteins/genetics , Viral Structural Proteins/genetics , COVID-19/virology , Entropy , Epidemiological Monitoring , Evolution, Molecular , Gene Expression , Humans , India/epidemiology , Phylogeny , SARS-CoV-2/classification , SARS-CoV-2/pathogenicity , Spike Glycoprotein, Coronavirus/metabolism , Viral Nonstructural Proteins/classification , Viral Nonstructural Proteins/metabolism , Viral Structural Proteins/classification , Viral Structural Proteins/metabolism
12.
J Theor Biol ; 526: 110806, 2021 10 07.
Article in English | MEDLINE | ID: mdl-34111456

ABSTRACT

The genetic code consists in a set of rules used by living organisms to translate genomic information, contained in genes, into proteins; every amino acid is coded by a set of nucleotide triplets or codons. We refer to codon choice as the choice of a given codon, among the synonymous available ones, to code a given amino acid occurrence. The aim of this work is to shed light on the pivotal role that codon choice plays in regulating the timing of translation process, through patterns of low and high translation efficiency codons. A translation efficiency value, namely codon score, was associated to each codon through a formula based on the number of tRNAs gene copies able to translate the given codon. By using codon scores, those k-mers of the proteome of Saccharomyces cerevisiae, showing low and high average scores associated to the correspondent codons, were computed. The analysis of distribution of both low and high average score k-mers clearly showed that, in particular for higher k-mer size, they occur much more than expected, strongly suggesting a functional role. Moreover performed analysis highlighted that significant k-mers preferentially occur in some protein folding classes, such as those containing alpha helices, and in some functional classes mainly involved in transcription process while codon choice seems to have a very low impact in proteins associated to energy production and metabolism. The relationship between secondary structures and significant k-mers was investigated, revealing that low score k-mers tend to preferentially occur in coil or close to coil regions and almost never in beta sheets, while high score k-mers preferentially occur in alpha helices, avoiding beta sheets, and close to coil regions for high k-mer sizes. Finally the analysis of distribution of significant codon patterns along the proteins highlighted a relevant enrichment of low average score k-mers at the 5' end of protein-coding sequences in the region from 5th to 25th amino acid.


Subject(s)
Proteins , Saccharomyces cerevisiae , Codon/genetics , Protein Biosynthesis/genetics , Protein Folding , Protein Structure, Secondary , Proteins/genetics , Saccharomyces cerevisiae/genetics
13.
Comput Biol Chem ; 92: 107480, 2021 Jun.
Article in English | MEDLINE | ID: mdl-33826970

ABSTRACT

Epigenetics and DNA methylation play a pivotal role in many processes of the cell and we often observe that an aberrant methylation pattern characterizes pathologies. In this work we investigate the role that the flanking sequences of CGs play in the methylation process in human. We built four different CG datasets: methylated, unmethylated, and two randomly extracted ones. We evaluated features associated to the flanking sequences of those CG sets, for different size around the CG, through five measures accounting for different aspects of sequence composition complexity and structure. The analysis performed through those measures revealed evident different behaviors between methylated and unmethylated probe sets. Major differences were observed for GC content and CG dinucleotide frequency in a window size of 300-400 bp and for CG self-attraction in 3K bp. It is remarkable as the effect of methylated CG lasts much more than expected far from the CG.


Subject(s)
CpG Islands/genetics , DNA/genetics , DNA/metabolism , DNA Methylation/genetics , Entropy , Humans
14.
PLoS One ; 15(12): e0243285, 2020.
Article in English | MEDLINE | ID: mdl-33284846

ABSTRACT

More than twenty years ago the reverse vaccinology paradigm came to light trying to design new vaccines based on the analysis of genomic information in order to select those pathogen peptides able to trigger an immune response. In this context, focusing on the proteome of Trypanosoma cruzi, we investigated the link between the probabilities for pathogen peptides to be presented on a cell surface and their distance from human self. We found a reasonable but, as far as we know, undiscovered property: the farther the distance between a peptide and the human-self the higher the probability for that peptide to be presented on a cell surface. We also found that the most distant peptides from human self bind, on average, a broader collection of HLAs than expected, implying a potential immunological role in a large portion of individuals. Finally, introducing a novel quantitative indicator for a peptide to measure its potential immunological role, we proposed a pool of peptides that could be potential epitopes and that can be suitable for experimental testing. The software to compute peptide classes according to the distance from human self is free available at http://www.iasi.cnr.it/~dsantoni/nullomers.


Subject(s)
Chagas Disease/immunology , Histocompatibility Antigens Class I/immunology , Peptides/immunology , Protozoan Proteins/immunology , Trypanosoma cruzi/immunology , Amino Acid Sequence , Epitopes/chemistry , Epitopes/immunology , Humans , Peptides/chemistry , Proteome/chemistry , Proteome/immunology , Protozoan Proteins/chemistry , Trypanosoma cruzi/chemistry
15.
J Immunol Methods ; 481-482: 112787, 2020.
Article in English | MEDLINE | ID: mdl-32335161

ABSTRACT

Alarms periodically emerge for viral pneumonia infections due to coronavirus. In all cases, these are zoonoses passing the barrier between species and infect humans. The legitimate concern of the international community is due to the fact that the new identified coronavirus, named SARS-CoV-2 (previously called 2019-nCoV), has a quite high mortality rate, around 2%, and a strong ability to spread, with an estimated reproduction number higher than 2. Even though all countries are doing their utmost to stop the pandemic, the only reliable solution to tackle the infection is the rapid development of a vaccine. For this purpose, the means of bioinformatics, applied in the context of reverse-vaccinology paradigm, can be of fundamental help to select the most promising peptides able to trigger an effective immune response. In this short report, using the concept of nullomer and introducing a distance from human self, we provide a list of peptides that could deserve experimental investigation in the view of a potential vaccine for SARS-CoV-2.


Subject(s)
Betacoronavirus/immunology , Computational Biology , Epitopes/immunology , COVID-19 , COVID-19 Vaccines , Coronavirus Infections/immunology , Coronavirus Infections/prevention & control , Genes, MHC Class I , Humans , Pandemics , Peptides/immunology , Pneumonia, Viral , SARS-CoV-2 , Software , Viral Proteins/immunology , Viral Vaccines/immunology
16.
Genomics ; 111(6): 1620-1628, 2019 12.
Article in English | MEDLINE | ID: mdl-30453062

ABSTRACT

Nucleosomes are not uniformly distributed along DNA and their positioning (termed "nucleosomal landscape") can be derived using data available for several genomes. In this study we analyzed DNA helical rise profiles through a tetranucleotide code, and we defined the nucleosomal landscape of several sequences forming dinucleosomes and of the sequences of huntingtin, myotonic dystrophy type 1 and fragile mental retardation 2 genes, which contained several repeated sequences. We also analyzed the profiles of some sequences interacting with transcription factors or with RNA polymerase II. In the genomes of Cenorhabditis elegans, Mus musculus and Homo sapiens we found profiles with extremely low helical rise values, characteristic of nucleosome free regions. We defined these regions as "holes" and found that their presence correlates with lamina associated domains sequences. Altogether, this study shows that DNA helical rise profile may have a role in gene expression modulation and in shaping chromosomal structure.


Subject(s)
Caenorhabditis elegans Proteins/genetics , Caenorhabditis elegans/genetics , DNA, Helminth/genetics , RNA Polymerase II/genetics , Transcription Factors/genetics , Animals , Humans , Mice
17.
J Integr Bioinform ; 15(4)2018 Oct 26.
Article in English | MEDLINE | ID: mdl-30367805

ABSTRACT

Finding similarities and differences between metagenomic samples within large repositories has been rather a significant issue for researchers. Over the recent years, content-based retrieval has been suggested by various studies from different perspectives. In this study, a content-based retrieval framework for identifying relevant metagenomic samples is developed. The framework consists of feature extraction, selection methods and similarity measures for whole metagenome sequencing samples. Performance of the developed framework was evaluated on given samples. A ground truth was used to evaluate the system performance such that if the system retrieves patients with the same disease, -called positive samples-, they are labeled as relevant samples otherwise irrelevant. The experimental results show that relevant experiments can be detected by using different fingerprinting approaches. We observed that Latent Semantic Analysis (LSA) Method is a promising fingerprinting approach for representing metagenomic samples and finding relevance among them. Source codes and executable files are available at www.baskent.edu.tr/∼hogul/WMS_retrieval.rar.


Subject(s)
High-Throughput Nucleotide Sequencing/methods , Metagenome , Microbiota , Sequence Analysis, DNA/methods , Software , Algorithms , Humans
18.
J Immunol Methods ; 459: 35-43, 2018 08.
Article in English | MEDLINE | ID: mdl-29800577

ABSTRACT

Identification of peptides binding to MHC class I complex can play a crucial role in retrieving potential targets able to trigger an immune response. Affinity binding of viral peptides can be estimated through effective computational methods that in the most of cases are based on machine learning approach. Achieving a better insight into peptide features that impact on the affinity binding rate is a challenging issue. In the present work we focused on 9-mer peptides of Human immunodeficiency virus type 1 and Human herpes simplex virus 1, studying their binding to MHC class I. Viral 9-mers were partitioned into different classes, where each class is characterized by how far (in terms of mutation steps) the peptides belonging to that class are from human 9-mers. Viral 9-mers were partitioned in different classes, based on the number of mutation steps they are far from human 9-mers. We showed that the overall binding probability significantly differs among classes, and it typically increases as the distance, computed in terms of number of mutation steps from the human set of 9-mers, increases. The binding probability is particularly high when considering viral 9-mers that are far from all human 9-mers more than three mutation steps. A further evidence, providing significance to those special viral peptides and suggesting a potential role they can play, comes from the analysis of their distribution along viral genomes, as it revealed they are not randomly located, but they preferentially occur in specific genes.


Subject(s)
Histocompatibility Antigens Class I/immunology , Oligopeptides/immunology , Viral Proteins/immunology , Epitopes, T-Lymphocyte/immunology , Genome, Viral , HIV-1/immunology , Herpesvirus 1, Human/immunology , Humans , Probability , Protein Binding
19.
DNA Res ; 25(1): 103-112, 2018 Feb 01.
Article in English | MEDLINE | ID: mdl-29069301

ABSTRACT

Proteins are the core and the engine of every process in cells thus the study of mechanisms that drive the regulation of protein expression, is essential. Transcription factors play a central role in this extremely complex task and they synergically co-operate in order to provide a fine tuning of protein expressions. In the present study, we designed a mathematically well-founded procedure to investigate the mutual positioning of transcription factors binding sites related to a given couple of transcription factors in order to evaluate the possible association between them. We obtained a list of highly related transcription factors couples, whose binding site occurrences significantly group together for a given set of gene promoters, identifying the biological contexts in which the couples are involved in and the processes they should contribute to regulate.

20.
PLoS One ; 11(12): e0164540, 2016.
Article in English | MEDLINE | ID: mdl-27906971

ABSTRACT

A nullomer is an oligomer that does not occur as a subsequence in a given DNA sequence, i.e. it is an absent word of that sequence. The importance of nullomers in several applications, from drug discovery to forensic practice, is now debated in the literature. Here, we investigated the nature of nullomers, whether their absence in genomes has just a statistical explanation or it is a peculiar feature of genomic sequences. We introduced an extension of the notion of nullomer, namely high order nullomers, which are nullomers whose mutated sequences are still nullomers. We studied different aspects of them: comparison with nullomers of random sequences, CpG distribution and mean helical rise. In agreement with previous results we found that the number of nullomers in the human genome is much larger than expected by chance. Nevertheless antithetical results were found when considering a random DNA sequence preserving dinucleotide frequencies. The analysis of CpG frequencies in nullomers and high order nullomers revealed, as expected, a high CpG content but it also highlighted a strong dependence of CpG frequencies on the dinucleotide position, suggesting that nullomers have their own peculiar structure and are not simply sequences whose CpG frequency is biased. Furthermore, phylogenetic trees were built on eleven species based on both the similarities between the dinucleotide frequencies and the number of nullomers two species share, showing that nullomers are fairly conserved among close species. Finally the study of mean helical rise of nullomers sequences revealed significantly high mean rise values, reinforcing the hypothesis that those sequences have some peculiar structural features. The obtained results show that nullomers are the consequence of the peculiar structure of DNA (also including biased CpG frequency and CpGs islands), so that the hypermutability model, also taking into account CpG islands, seems to be not sufficient to explain nullomer phenomenon. Finally, high order nullomers could emphasize those features that already make simple nullomers useful in several applications.


Subject(s)
Base Sequence/genetics , Computational Biology , DNA/genetics , Oligonucleotides/genetics , CpG Islands/genetics , Genome, Human , Humans , Nucleic Acid Conformation
SELECTION OF CITATIONS
SEARCH DETAIL
...