Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 17 de 17
Filter
Add more filters










Publication year range
1.
NAR Cancer ; 3(2): zcab017, 2021 Jun.
Article in English | MEDLINE | ID: mdl-34027407

ABSTRACT

Cancer evolves through the accumulation of somatic mutations over time. Although several methods have been developed to characterize mutational processes in cancers, these have not been specifically designed to identify mutational patterns that predict patient prognosis. Here we present CLICnet, a method that utilizes mutational data to cluster patients by survival rate. CLICnet employs Restricted Boltzmann Machines, a type of generative neural network, which allows for the capture of complex mutational patterns associated with patient survival in different cancer types. For some cancer types, clustering produced by CLICnet also predicts benefit from anti-PD1 immune checkpoint blockade therapy, whereas for other cancer types, the mutational processes associated with survival are different from those associated with the improved anti-PD1 survival benefit. Thus, CLICnet has the ability to systematically identify and catalogue combinations of mutations that predict cancer survival, unveiling intricate associations between mutations, survival, and immunotherapy benefit.

2.
Int J Mol Sci ; 22(6)2021 Mar 12.
Article in English | MEDLINE | ID: mdl-33809353

ABSTRACT

The exponential growth of biomedical data in recent years has urged the application of numerous machine learning techniques to address emerging problems in biology and clinical research. By enabling the automatic feature extraction, selection, and generation of predictive models, these methods can be used to efficiently study complex biological systems. Machine learning techniques are frequently integrated with bioinformatic methods, as well as curated databases and biological networks, to enhance training and validation, identify the best interpretable features, and enable feature and model investigation. Here, we review recently developed methods that incorporate machine learning within the same framework with techniques from molecular evolution, protein structure analysis, systems biology, and disease genomics. We outline the challenges posed for machine learning, and, in particular, deep learning in biomedicine, and suggest unique opportunities for machine learning techniques integrated with established bioinformatics approaches to overcome some of these challenges.


Subject(s)
Computational Biology/trends , Databases, Factual/trends , Machine Learning/trends , Systems Biology/trends , Algorithms , Humans
3.
Nat Commun ; 12(1): 1504, 2021 03 08.
Article in English | MEDLINE | ID: mdl-33686085

ABSTRACT

Elucidating functionality in non-coding regions is a key challenge in human genomics. It has been shown that intolerance to variation of coding and proximal non-coding sequence is a strong predictor of human disease relevance. Here, we integrate intolerance to variation, functional genomic annotations and primary genomic sequence to build JARVIS: a comprehensive deep learning model to prioritize non-coding regions, outperforming other human lineage-specific scores. Despite being agnostic to evolutionary conservation, JARVIS performs comparably or outperforms conservation-based scores in classifying pathogenic single-nucleotide and structural variants. In constructing JARVIS, we introduce the genome-wide residual variation intolerance score (gwRVIS), applying a sliding-window approach to whole genome sequencing data from 62,784 individuals. gwRVIS distinguishes Mendelian disease genes from more tolerant CCDS regions and highlights ultra-conserved non-coding elements as the most intolerant regions in the human genome. Both JARVIS and gwRVIS capture previously inaccessible human-lineage constraint information and will enhance our understanding of the non-coding genome.


Subject(s)
Deep Learning , Genome, Human , Genomics , DNA, Intergenic , Genetic Variation , Humans , Sequence Analysis, DNA , Whole Genome Sequencing
4.
Microbiome ; 9(1): 78, 2021 03 29.
Article in English | MEDLINE | ID: mdl-33781338

ABSTRACT

BACKGROUND: Double-stranded DNA bacteriophages (dsDNA phages) play pivotal roles in structuring human gut microbiomes; yet, the gut virome is far from being fully characterized, and additional groups of phages, including highly abundant ones, continue to be discovered by metagenome mining. A multilevel framework for taxonomic classification of viruses was recently adopted, facilitating the classification of phages into evolutionary informative taxonomic units based on hallmark genes. Together with advanced approaches for sequence assembly and powerful methods of sequence analysis, this revised framework offers the opportunity to discover and classify unknown phage taxa in the human gut. RESULTS: A search of human gut metagenomes for circular contigs encoding phage hallmark genes resulted in the identification of 3738 apparently complete phage genomes that represent 451 putative genera. Several of these phage genera are only distantly related to previously identified phages and are likely to found new families. Two of the candidate families, "Flandersviridae" and "Quimbyviridae", include some of the most common and abundant members of the human gut virome that infect Bacteroides, Parabacteroides, and Prevotella. The third proposed family, "Gratiaviridae," consists of less abundant phages that are distantly related to the families Autographiviridae, Drexlerviridae, and Chaseviridae. Analysis of CRISPR spacers indicates that phages of all three putative families infect bacteria of the phylum Bacteroidetes. Comparative genomic analysis of the three candidate phage families revealed features without precedent in phage genomes. Some "Quimbyviridae" phages possess Diversity-Generating Retroelements (DGRs) that generate hypervariable target genes nested within defense-related genes, whereas the previously known targets of phage-encoded DGRs are structural genes. Several "Flandersviridae" phages encode enzymes of the isoprenoid pathway, a lipid biosynthesis pathway that so far has not been known to be manipulated by phages. The "Gratiaviridae" phages encode a HipA-family protein kinase and glycosyltransferase, suggesting these phages modify the host cell wall, preventing superinfection by other phages. Hundreds of phages in these three and other families are shown to encode catalases and iron-sequestering enzymes that can be predicted to enhance cellular tolerance to reactive oxygen species. CONCLUSIONS: Analysis of phage genomes identified in whole-community human gut metagenomes resulted in the delineation of at least three new candidate families of Caudovirales and revealed diverse putative mechanisms underlying phage-host interactions in the human gut. Addition of these phylogenetically classified, diverse, and distinct phages to public databases will facilitate taxonomic decomposition and functional characterization of human gut viromes. Video abstract.


Subject(s)
Bacteriophages , Gastrointestinal Microbiome , Microbiota , Bacteria/genetics , Bacteriophages/genetics , Gastrointestinal Microbiome/genetics , Genome, Viral/genetics , Humans , Metagenome , Phylogeny
5.
BMC Biol ; 18(1): 186, 2020 11 30.
Article in English | MEDLINE | ID: mdl-33256718

ABSTRACT

BACKGROUND: A crucial factor in mitigating respiratory viral outbreaks is early determination of the duration of the incubation period and, accordingly, the required quarantine time for potentially exposed individuals. At the time of the COVID-19 pandemic, optimization of quarantine regimes becomes paramount for public health, societal well-being, and global economy. However, biological factors that determine the duration of the virus incubation period remain poorly understood. RESULTS: We demonstrate a strong positive correlation between the length of the incubation period and disease severity for a wide range of human pathogenic viruses. Using a machine learning approach, we develop a predictive model that accurately estimates, solely from several virus genome features, in particular, the number of protein-coding genes and the GC content, the incubation time ranges for diverse human pathogenic RNA viruses including SARS-CoV-2. The predictive approach described here can directly help in establishing the appropriate quarantine durations and thus facilitate controlling future outbreaks. CONCLUSIONS: The length of the incubation period in viral diseases strongly correlates with disease severity, emphasizing the biological and epidemiological importance of the incubation period. Perhaps, surprisingly, incubation times of pathogenic RNA viruses can be accurately predicted solely from generic features of virus genomes. Elucidation of the biological underpinnings of the connections between these features and disease progression can be expected to reveal key aspects of virus pathogenesis.


Subject(s)
COVID-19/pathology , COVID-19/virology , Infectious Disease Incubation Period , SARS-CoV-2/genetics , Computer Simulation , Genome, Viral , Humans , Models, Biological , Mutation , Quarantine
6.
Nucleic Acids Res ; 48(21): e121, 2020 12 02.
Article in English | MEDLINE | ID: mdl-33045744

ABSTRACT

Recent advances in metagenomic sequencing have enabled discovery of diverse, distinct microbes and viruses. Bacteriophages, the most abundant biological entity on Earth, evolve rapidly, and therefore, detection of unknown bacteriophages in sequence datasets is a challenge. Most of the existing detection methods rely on sequence similarity to known bacteriophage sequences, impeding the identification and characterization of distinct, highly divergent bacteriophage families. Here we present Seeker, a deep-learning tool for alignment-free identification of phage sequences. Seeker allows rapid detection of phages in sequence datasets and differentiation of phage sequences from bacterial ones, even when those phages exhibit little sequence similarity to established phage families. We comprehensively validate Seeker's ability to identify previously unidentified phages, and employ this method to detect unknown phages, some of which are highly divergent from the known phage families. We provide a web portal (seeker.pythonanywhere.com) and a user-friendly Python package (github.com/gussow/seeker) allowing researchers to easily apply Seeker in metagenomic studies, for the detection of diverse unknown bacteriophages.


Subject(s)
Bacteria/virology , Bacteriophages/genetics , DNA, Viral/genetics , Genome, Viral , Metagenome , Software , Bacteria/genetics , Bacteriophages/classification , Biological Evolution , Deep Learning , Humans , Metagenomics/methods , Phylogeny , Sequence Analysis, DNA
7.
Nucleic Acids Res ; 48(16): 8828-8847, 2020 09 18.
Article in English | MEDLINE | ID: mdl-32735657

ABSTRACT

CRISPR-associated Rossmann Fold (CARF) and SMODS-associated and fused to various effector domains (SAVED) are key components of cyclic oligonucleotide-based antiphage signaling systems (CBASS) that sense cyclic oligonucleotides and transmit the signal to an effector inducing cell dormancy or death. Most of the CARFs are components of a CBASS built into type III CRISPR-Cas systems, where the CARF domain binds cyclic oligoA (cOA) synthesized by Cas10 polymerase-cyclase and allosterically activates the effector, typically a promiscuous ribonuclease. Additionally, this signaling pathway includes a ring nuclease, often also a CARF domain (either the sensor itself or a specialized enzyme) that cleaves cOA and mitigates dormancy or death induction. We present a comprehensive census of CARF and SAVED domains in bacteria and archaea, and their sequence- and structure-based classification. There are 10 major families of CARF domains and multiple smaller groups that differ in structural features, association with distinct effectors, and presence or absence of the ring nuclease activity. By comparative genome analysis, we predict specific functions of CARF and SAVED domains and partition the CARF domains into those with both sensor and ring nuclease functions, and sensor-only ones. Several families of ring nucleases functionally associated with sensor-only CARF domains are also predicted.


Subject(s)
Archaea/genetics , Archaeal Proteins/genetics , Bacteria/genetics , Bacterial Proteins/genetics , CRISPR-Cas Systems , Protein Domains , Archaea/enzymology , Archaeal Proteins/chemistry , Bacteria/enzymology , Bacterial Proteins/chemistry , Evolution, Molecular
8.
Nat Commun ; 11(1): 3784, 2020 07 29.
Article in English | MEDLINE | ID: mdl-32728052

ABSTRACT

The CRISPR-Cas are adaptive bacterial and archaeal immunity systems that have been harnessed for the development of powerful genome editing and engineering tools. In the incessant host-parasite arms race, viruses evolved multiple anti-defense mechanisms including diverse anti-CRISPR proteins (Acrs) that specifically inhibit CRISPR-Cas and therefore have enormous potential for application as modulators of genome editing tools. Most Acrs are small and highly variable proteins which makes their bioinformatic prediction a formidable task. We present a machine-learning approach for comprehensive Acr prediction. The model shows high predictive power when tested against an unseen test set and was employed to predict 2,500 candidate Acr families. Experimental validation of top candidates revealed two unknown Acrs (AcrIC9, IC10) and three other top candidates were coincidentally identified and found to possess anti-CRISPR activity. These results substantially expand the repertoire of predicted Acrs and provide a resource for experimental Acr discovery.


Subject(s)
Bacteriophages/genetics , CRISPR-Associated Protein 9/antagonists & inhibitors , Machine Learning , Sequence Analysis, Protein/methods , Viral Proteins/genetics , Archaea/genetics , Archaea/virology , Bacteria/genetics , Bacteria/virology , CRISPR-Associated Protein 9/genetics , CRISPR-Cas Systems/genetics , Computational Biology/methods , Datasets as Topic , Gene Editing/methods , Host-Parasite Interactions/genetics , Sequence Homology, Amino Acid
9.
bioRxiv ; 2020 Apr 09.
Article in English | MEDLINE | ID: mdl-32511301

ABSTRACT

SARS-CoV-2 poses an immediate, major threat to public health across the globe. Here we report an in-depth molecular analysis to reconstruct the evolutionary origins of the enhanced pathogenicity of SARS-CoV-2 and other coronaviruses that are severe human pathogens. Using integrated comparative genomics and machine learning techniques, we identify key genomic features that differentiate SARS-CoV-2 and the viruses behind the two previous deadly coronavirus outbreaks, SARS-CoV and MERS-CoV, from less pathogenic coronaviruses. These features include enhancement of the nuclear localization signals in the nucleocapsid protein and distinct inserts in the spike glycoprotein that appear to be associated with high case fatality rate of these coronaviruses as well as the host switch from animals to humans. The identified features could be crucial elements of coronavirus pathogenicity and possible targets for diagnostics, prognostication and interventions.

10.
Proc Natl Acad Sci U S A ; 117(26): 15193-15199, 2020 06 30.
Article in English | MEDLINE | ID: mdl-32522874

ABSTRACT

Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) poses an immediate, major threat to public health across the globe. Here we report an in-depth molecular analysis to reconstruct the evolutionary origins of the enhanced pathogenicity of SARS-CoV-2 and other coronaviruses that are severe human pathogens. Using integrated comparative genomics and machine learning techniques, we identify key genomic features that differentiate SARS-CoV-2 and the viruses behind the two previous deadly coronavirus outbreaks, SARS-CoV and Middle East respiratory syndrome coronavirus (MERS-CoV), from less pathogenic coronaviruses. These features include enhancement of the nuclear localization signals in the nucleocapsid protein and distinct inserts in the spike glycoprotein that appear to be associated with high case fatality rate of these coronaviruses as well as the host switch from animals to humans. The identified features could be crucial contributors to coronavirus pathogenicity and possible targets for diagnostics, prognostication, and interventions.


Subject(s)
Betacoronavirus/genetics , Evolution, Molecular , Genome, Viral , Nucleocapsid Proteins/genetics , Spike Glycoprotein, Coronavirus/genetics , Animals , Betacoronavirus/classification , Betacoronavirus/pathogenicity , Host Specificity , Humans , Machine Learning , Middle East Respiratory Syndrome Coronavirus/classification , Middle East Respiratory Syndrome Coronavirus/genetics , Middle East Respiratory Syndrome Coronavirus/pathogenicity , Mutagenesis, Insertional , Nuclear Localization Signals/genetics , Nucleocapsid Proteins/chemistry , Phylogeny , SARS-CoV-2 , Sequence Homology , Spike Glycoprotein, Coronavirus/chemistry , Virulence/genetics
12.
Nat Microbiol ; 3(1): 38-46, 2018 Jan.
Article in English | MEDLINE | ID: mdl-29133882

ABSTRACT

Metagenomic sequence analysis is rapidly becoming the primary source of virus discovery 1-3 . A substantial majority of the currently available virus genomes come from metagenomics, and some of these represent extremely abundant viruses, even if never grown in the laboratory. A particularly striking case of a virus discovered via metagenomics is crAssphage, which is by far the most abundant human-associated virus known, comprising up to 90% of sequences in the gut virome 4 . Over 80% of the predicted proteins encoded in the approximately 100 kilobase crAssphage genome showed no significant similarity to available protein sequences, precluding classification of this virus and hampering further study. Here we combine a comprehensive search of genomic and metagenomic databases with sensitive methods for protein sequence analysis to identify an expansive, diverse group of bacteriophages related to crAssphage and predict the functions of the majority of phage proteins, in particular those that comprise the structural, replication and expression modules. Most, if not all, of the crAss-like phages appear to be associated with diverse bacteria from the phylum Bacteroidetes, which includes some of the most abundant bacteria in the human gut microbiome and that are also common in various other habitats. These findings provide for experimental characterization of the most abundant but poorly understood members of the human-associated virome.


Subject(s)
Bacteriophages/classification , Bacteriophages/genetics , Gastrointestinal Microbiome/genetics , Genomics , Metagenomics , Bacteroidetes/virology , Databases, Protein , Genome, Viral/genetics , Humans , Models, Genetic , Molecular Sequence Data , Phylogeny , Sequence Analysis, Protein , Viral Proteins/chemistry , Viral Proteins/genetics
13.
PLoS One ; 12(8): e0181604, 2017.
Article in English | MEDLINE | ID: mdl-28797091

ABSTRACT

There is broad agreement that genetic mutations occurring outside of the protein-coding regions play a key role in human disease. Despite this consensus, we are not yet capable of discerning which portions of non-coding sequence are important in the context of human disease. Here, we present Orion, an approach that detects regions of the non-coding genome that are depleted of variation, suggesting that the regions are intolerant of mutations and subject to purifying selection in the human lineage. We show that Orion is highly correlated with known intolerant regions as well as regions that harbor putatively pathogenic variation. This approach provides a mechanism to identify pathogenic variation in the human non-coding genome and will have immediate utility in the diagnostic interpretation of patient genomes and in large case control studies using whole-genome sequences.


Subject(s)
Genetic Variation , Genome, Human , Genetic Predisposition to Disease , Genetics, Population , Humans , Models, Genetic , Mutation , Open Reading Frames , Selection, Genetic
14.
Genome Res ; 26(10): 1411-1416, 2016 10.
Article in English | MEDLINE | ID: mdl-27516621

ABSTRACT

Cultured neuronal networks monitored with microelectrode arrays (MEAs) have been used widely to evaluate pharmaceutical compounds for potential neurotoxic effects. A newer application of MEAs has been in the development of in vitro models of neurological disease. Here, we directly evaluated the utility of MEAs to recapitulate in vivo phenotypes of mature microRNA-128 (miR-128) deficiency, which causes fatal seizures in mice. We show that inhibition of miR-128 results in significantly increased neuronal activity in cultured neuronal networks derived from primary mouse cortical neurons. These results support the utility of MEAs in developing in vitro models of neuroexcitability disorders, such as epilepsy, and further suggest that MEAs provide an effective tool for the rapid identification of microRNAs that promote seizures when dysregulated.


Subject(s)
Action Potentials , MicroRNAs/genetics , Neurons/physiology , Patch-Clamp Techniques/methods , Seizures/genetics , Tissue Array Analysis/methods , Animals , Cells, Cultured , Cerebral Cortex/cytology , Mice , Mice, Inbred C57BL , Neurons/metabolism , Seizures/physiopathology
15.
Genome Biol ; 17: 9, 2016 Jan 18.
Article in English | MEDLINE | ID: mdl-26781712

ABSTRACT

Ranking human genes based on their tolerance to functional genetic variation can greatly facilitate patient genome interpretation. It is well established, however, that different parts of proteins can have different functions, suggesting that it will ultimately be more informative to focus attention on functionally distinct portions of genes. Here we evaluate the intolerance of genic sub-regions using two biological sub-region classifications. We show that the intolerance scores of these sub-regions significantly correlate with reported pathogenic mutations. This observation extends the utility of intolerance scores to indicating where pathogenic mutations are mostly likely to fall within genes.


Subject(s)
Genetic Variation , Genome, Human , Protein Structure, Tertiary/genetics , Exons/genetics , Humans , Mutation , Open Reading Frames/genetics
16.
PLoS Genet ; 11(9): e1005492, 2015 Sep.
Article in English | MEDLINE | ID: mdl-26332131

ABSTRACT

Noncoding sequence contains pathogenic mutations. Yet, compared with mutations in protein-coding sequence, pathogenic regulatory mutations are notoriously difficult to recognize. Most fundamentally, we are not yet adept at recognizing the sequence stretches in the human genome that are most important in regulating the expression of genes. For this reason, it is difficult to apply to the regulatory regions the same kinds of analytical paradigms that are being successfully applied to identify mutations among protein-coding regions that influence risk. To determine whether dosage sensitive genes have distinct patterns among their noncoding sequence, we present two primary approaches that focus solely on a gene's proximal noncoding regulatory sequence. The first approach is a regulatory sequence analogue of the recently introduced residual variation intolerance score (RVIS), termed noncoding RVIS, or ncRVIS. The ncRVIS compares observed and predicted levels of standing variation in the regulatory sequence of human genes. The second approach, termed ncGERP, reflects the phylogenetic conservation of a gene's regulatory sequence using GERP++. We assess how well these two approaches correlate with four gene lists that use different ways to identify genes known or likely to cause disease through changes in expression: 1) genes that are known to cause disease through haploinsufficiency, 2) genes curated as dosage sensitive in ClinGen's Genome Dosage Map, 3) genes judged likely to be under purifying selection for mutations that change expression levels because they are statistically depleted of loss-of-function variants in the general population, and 4) genes judged unlikely to cause disease based on the presence of copy number variants in the general population. We find that both noncoding scores are highly predictive of dosage sensitivity using any of these criteria. In a similar way to ncGERP, we assess two ensemble-based predictors of regional noncoding importance, ncCADD and ncGWAVA, and find both scores are significantly predictive of human dosage sensitive genes and appear to carry information beyond conservation, as assessed by ncGERP. These results highlight that the intolerance of noncoding sequence stretches in the human genome can provide a critical complementary tool to other genome annotation approaches to help identify the parts of the human genome increasingly likely to harbor mutations that influence risk of disease.


Subject(s)
Gene Dosage , Genetic Variation , Regulatory Sequences, Nucleic Acid , DNA Copy Number Variations , Haploinsufficiency , Humans , Mental Disorders/genetics , Mutation , Nervous System Diseases/genetics
17.
Am J Hum Genet ; 89(4): 572-9, 2011 Oct 07.
Article in English | MEDLINE | ID: mdl-21963259

ABSTRACT

XX female gonadal dysgenesis (XX-GD) is a rare, genetically heterogeneous disorder characterized by lack of spontaneous pubertal development, primary amenorrhea, uterine hypoplasia, and hypergonadotropic hypogonadism as a result of streak gonads. Most cases are unexplained but thought to be autosomal recessive. We elucidated the genetic basis of XX-GD in a highly consanguineous Palestinian family by using homozygosity mapping and candidate-gene and whole-exome sequencing. Affected females were homozygous for a 3 bp deletion (NM_016556.2, c.600_602del) in the PSMC3IP gene, leading to deletion of a glutamic acid residue (p.Glu201del) in the highly conserved C-terminal acidic domain. Proteasome 26S subunit, ATPase, 3-Interacting Protein (PSMC3IP)/Tat Binding Protein Interacting Protein (TBPIP) is a nuclear, tissue-specific protein with multiple functions. It is critical for meiotic recombination as indicated by the known role of its yeast ortholog, Hop2. Through the C terminus (not present in yeast), PSMC3IP also coactivates ligand-driven transcription mediated by estrogen, androgen, glucocorticoid, progesterone, and thyroid nuclear receptors. In cell lines, the p.Glu201del mutation abolished PSMC3IP activation of estrogen-driven transcription. Impaired estrogenic signaling can lead to ovarian dysgenesis both by affecting the size of the follicular pool created during fetal development and by failing to counteract follicular atresia during puberty. PSMC3IP joins previous genes known to be mutated in XX-GD, the FSH receptor, and BMP15, highlighting the importance of hormonal signaling in ovarian development and maintenance and suggesting a common pathway perturbed in isolated XX-GD. By analogy to other XX-GD genes, PSMC3IP is also a candidate gene for premature ovarian failure, and its role in folliculogenesis should be further investigated.


Subject(s)
Chromosomes, Human, X , Estrogens/metabolism , Gonadal Dysgenesis/genetics , Nuclear Proteins/genetics , Trans-Activators/genetics , Consanguinity , Female , Gene Deletion , Genetic Markers , Genotype , Gonadal Dysgenesis, 46,XX/genetics , Haplotypes , Hearing Loss, Sensorineural/genetics , Homozygote , Humans , Male , Pedigree , Proteasome Endopeptidase Complex/metabolism , Transcription, Genetic
SELECTION OF CITATIONS
SEARCH DETAIL
...