Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 11 de 11
Filter
Add more filters










Publication year range
1.
Bioinformatics ; 39(11)2023 11 01.
Article in English | MEDLINE | ID: mdl-37878789

ABSTRACT

MOTIVATION: Whole genome alignment of eukaryote species remains an important method for the determination of sequence and structural variations and can also be used to ascertain the representative non-redundant core-genome sequence of a population. Many whole genome alignment tools were first developed for the more mature analysis of prokaryote species with few current tools containing the functionality to process larger genomes of eukaryotes as well as genomes of more divergent species. In addition, the functionality of these tools becomes computationally prohibitive due to the significant compute resources needed to handle larger genomes. RESULTS: In this research, we present CoreDetector, an easy-to-use general-purpose program that can align the core-genome sequences for a range of genome sizes and divergence levels. To illustrate the flexibility of CoreDetector, we conducted alignments of a large set of closely related fungal pathogen and hexaploid wheat cultivar genomes as well as more divergent fly and rodent species genomes. In all cases, compared to existing multiple genome alignment tools, CoreDetector exhibited improved flexibility, efficiency, and competitive accuracy in tested cases. AVAILABILITY AND IMPLEMENTATION: CoreDetector was developed in the cross platform, and easily deployable, Java language. A packaged pipeline is readily executable in a bash terminal without any external need for Perl or Python environments. Installation, example data, and usage instructions for CoreDetector are freely available from https://github.com/mfruzan/CoreDetector.


Subject(s)
Genomics , Software , Genomics/methods , Algorithms , Sequence Alignment , Genome
2.
Plant Methods ; 19(1): 96, 2023 Sep 02.
Article in English | MEDLINE | ID: mdl-37660084

ABSTRACT

BACKGROUND: Genomic prediction has become a powerful modelling tool for assessing line performance in plant and livestock breeding programmes. Among the genomic prediction modelling approaches, linear based models have proven to provide accurate predictions even when the number of genetic markers exceeds the number of data samples. However, breeding programmes are now compiling data from large numbers of lines and test environments for analyses, rendering these approaches computationally prohibitive. Machine learning (ML) now offers a solution to this problem through the construction of fully connected deep learning architectures and high parallelisation of the predictive task. However, the fully connected nature of these architectures immediately generates an over-parameterisation of the network that needs addressing for efficient and accurate predictions. RESULTS: In this research we explore the use of an ML architecture governed by variational Bayesian sparsity in its initial layers that we have called VBS-ML. The use of VBS-ML provides a mechanism for feature selection of important markers linked to the trait, immediately reducing the network over-parameterisation. Selected markers then propagate to the remaining fully connected feed-forward components of the ML network to form the final genomic prediction. We illustrated the approach with four large Australian wheat breeding data sets that range from 2665 lines to 10375 lines genotyped across a large set of markers. For all data sets, the use of the VBS-ML architecture improved genomic prediction accuracy over legacy linear based modelling approaches. CONCLUSIONS: An ML architecture governed under a variational Bayesian paradigm was shown to improve genomic prediction accuracy over legacy modelling approaches. This VBS-ML approach can be used to dramatically decrease the parameter burden on the network and provide a computationally feasible approach for improving genomic prediction conducted with large breeding population numbers and genetic markers.

3.
Front Plant Sci ; 14: 1096225, 2023.
Article in English | MEDLINE | ID: mdl-36818880

ABSTRACT

Despite frequent co-occurrence of drought and heat stress, the molecular mechanisms governing plant responses to these stresses in combination have not often been studied. This is particularly evident in non-model, perennial plants. We conducted large scale physiological and transcriptome analyses to identify genes and pathways associated with grapevine response to drought and/or heat stress during stress progression and recovery. We identified gene clusters with expression correlated to leaf temperature and water stress and five hub genes for the combined stress co-expression network. Several differentially expressed genes were common to the individual and combined stresses, but the majority were unique to the individual or combined stress treatments. These included heat-shock proteins, mitogen-activated kinases, sugar metabolizing enzymes, and transcription factors, while phenylpropanoid biosynthesis and histone modifying genes were unique to the combined stress treatment. Following physiological recovery, differentially expressed genes were found only in plants under heat stress, both alone and combined with drought. Taken collectively, our results suggest that the effect of the combined stress on physiology and gene expression is more severe than that of individual stresses, but not simply additive, and that epigenetic chromatin modifications may play an important role in grapevine responses to combined drought and heat stress.

4.
Gigascience ; 112022 05 17.
Article in English | MEDLINE | ID: mdl-35579550

ABSTRACT

BACKGROUND: In diploid organisms, whole-genome haplotype assembly relies on the accurate identification and assignment of heterozygous single-nucleotide polymorphism alleles to the correct homologous chromosomes. This appropriate phasing of these alleles ensures that combinations of single-nucleotide polymorphisms on any chromosome, called haplotypes, can then be used in downstream genetic analysis approaches including determining their potential association with important phenotypic traits. A number of statistical algorithms and complementary computational software tools have been developed for whole-genome haplotype construction from genomic sequence data. However, many algorithms lack the ability to phase long haplotype blocks and simultaneously achieve a competitive accuracy. RESULTS: In this research we present HaploMaker, a novel reference-based haplotype assembly algorithm capable of accurately and efficiently phasing long haplotypes using paired-end short reads and longer Pacific Biosciences reads from diploid genomic sequences. To achieve this we frame the problem as a directed acyclic graph with edges weighted on read evidence and use efficient path traversal and minimization techniques to optimally phase haplotypes. We compared the HaploMaker algorithm with 3 other common reference-based haplotype assembly tools using public haplotype data of human individuals from the Platinum Genome project. With short-read sequences, the HaploMaker algorithm maintained a competitively low switch error rate across all haplotype lengths and was superior in phasing longer genomic regions. For longer Pacific Biosciences reads, the phasing accuracy of HaploMaker remained competitive for all block lengths and generated substantially longer block lengths than the competing algorithms. CONCLUSIONS: HaploMaker provides an improved haplotype assembly algorithm for diploid genomic sequences by accurately phasing longer haplotypes. The computationally efficient and portable nature of the Java implementation of the algorithm will ensure that it has maximal impact in reference-sequence-based haplotype assembly applications.


Subject(s)
Algorithms , Genomics , Humans , Alleles , Haplotypes , Polymorphism, Single Nucleotide , Sequence Analysis, DNA/methods
5.
Front Plant Sci ; 10: 1244, 2019.
Article in English | MEDLINE | ID: mdl-31649706

ABSTRACT

Seed mutagenesis is one strategy to create a population with thousands of useful mutations for the direct selection of desirable traits, to introduce diversity into varietal improvement programs, or to generate a mutant collection to support gene functional analysis. However, phenotyping such large collections, where each individual may carry many mutations, is a bottleneck for downstream analysis. Targeting Induced Local Lesions in Genomes (TILLinG), when coupled with next-generation sequencing allows high-throughput mutation discovery and selection by genotyping. We mutagenized an advanced durum breeding line, UAD0951096_F2:5 and performed short-read (2x125 bp) Illumina sequencing of the exome of 100 lines using an available exome capture platform. To improve variant calling, we generated a consolidated exome reference using the recently available genome sequences of the cultivars Svevo and Kronos to facilitate the alignment of reads from the UAD0951096_F2:5 derived mutants. The resulting exome reference was 484.4 Mbp. We also developed a user-friendly, searchable database and bioinformatic analysis pipeline that allowed us to predict zygosity of the mutations discovered and extracts flanking sequences for rapid marker development. Here, we present these tools with the aim of allowing researchers fast and accurate downstream selection of mutations discovered by TILLinG by sequencing to support functional annotation of the durum wheat genome.

6.
Zebrafish ; 14(5): 492-494, 2017 10.
Article in English | MEDLINE | ID: mdl-28873048

ABSTRACT

Gene Ontology (GO) analysis is a powerful tool in systems biology, which uses a defined nomenclature to annotate genes/proteins within three categories: "Molecular Function," "Biological Process," and "Cellular Component." GO analysis can assist in revealing functional mechanisms underlying observed patterns in transcriptomic, genomic, and proteomic data. The already extensive and increasing use of zebrafish for modeling genetic and other diseases highlights the need to develop a GO analytical tool for this organism. The web tool Comparative GO was originally developed for GO analysis of bacterial data in 2013 ( www.comparativego.com ). We have now upgraded and elaborated this web tool for analysis of zebrafish genetic data using GOs and annotations from the Gene Ontology Consortium.


Subject(s)
Gene Ontology , Internet , Software , Zebrafish/genetics , Animals , Gene Expression Profiling , Genomics , Proteomics
7.
PLoS One ; 12(2): e0170486, 2017.
Article in English | MEDLINE | ID: mdl-28199395

ABSTRACT

Gene Ontology (GO) classification of statistically significantly differentially expressed genes is commonly used to interpret transcriptomics data as a part of functional genomic analysis. In this approach, all significantly expressed genes contribute equally to the final GO classification regardless of their actual expression levels. Gene expression levels can significantly affect protein production and hence should be reflected in GO term enrichment. Genes with low expression levels can also participate in GO term enrichment through cumulative effects. In this report, we have introduced a new GO enrichment method that is suitable for multiple samples and time series experiments that uses a statistical outlier test to detect GO categories with special patterns of variation that can potentially identify candidate biological mechanisms. To demonstrate the value of our approach, we have performed two case studies. Whole transcriptome expression profiles of Salmonella enteritidis and Alzheimer's disease (AD) were analysed in order to determine GO term enrichment across the entire transcriptome instead of a subset of differentially expressed genes used in traditional GO analysis. Our result highlights the key role of inflammation related functional groups in AD pathology as granulocyte colony-stimulating factor receptor binding, neuromedin U binding, and interleukin were remarkably upregulated in AD brain when all using all of the gene expression data in the transcriptome. Mitochondrial components and the molybdopterin synthase complex were identified as potential key cellular components involved in AD pathology.


Subject(s)
Alzheimer Disease/genetics , Databases, Nucleic Acid , Gene Expression Regulation, Bacterial , Gene Ontology , Salmonella enteritidis/genetics , Transcriptome , Alzheimer Disease/metabolism , Humans , Salmonella enteritidis/metabolism
9.
BMC Genomics ; 15: 769, 2014 Sep 08.
Article in English | MEDLINE | ID: mdl-25196724

ABSTRACT

BACKGROUND: Streptococcus pneumoniae (the pneumococcus) is the world's foremost microbial pathogen, killing more people each year than HIV, TB or malaria. The capacity to penetrate deeper host tissues contributes substantially to the ability of this organism to cause disease. Here we investigated, for the first time, functional genomics modulation of 3 pneumococcal strains (serotype 2 [D39], serotype 4 [WCH43] and serotype 6A [WCH16]) during transition from the nasopharynx to lungs to blood and to brain of mice at both promoter and domain activation levels. RESULTS: We found 7 highly activated transcription factors (TFs) [argR, codY, hup, rpoD, rr02, scrR and smrC] capable of binding to a large number of up-regulated genes, potentially constituting the regulatory backbone of pneumococcal pathogenesis. Strain D39 showed a distinct profile in employing a large number of TFs during blood infection. Interestingly, the same highly activated TFs used by D39 in blood are also used by WCH16 and WCH43 during brain infection. This indicates that different pneumococcal strains might activate a similar set of TFs and regulatory elements depending on the final site of infection. Hierarchical clustering analysis showed that all the highly activated TFs, except rpoD, clustered together with a high level of similarity in all 3 strains, which might suggest redundancy in the regulatory roles of these TFs during infection. Discriminant function analysis of the TFs in various niches highlights differential regulatory backgrounds of the 3 strains, and pathogenesis data confirms codY as the most significant predictor discriminating between these strains in various niches, particularly in the blood. Moreover, the predicted TF and domain activation profiles of the 3 strains correspond with their distinct pathogenicity characteristics. CONCLUSIONS: Our findings suggest that the pneumococcus changes the short binding sites in the promoter regions of genes in a niche-specific manner to enhance its ability to disseminate from one host niche to another. This study provides a framework for an improved understanding of the dynamics of pneumococcal pathogenesis, and opens a new avenue into similar investigations in other pathogenic bacteria.


Subject(s)
Bacterial Proteins/genetics , Genomics , Pneumococcal Infections/microbiology , Streptococcus pneumoniae/genetics , Streptococcus pneumoniae/metabolism , Transcription Factors/genetics , Animals , Bacterial Proteins/metabolism , Binding Sites , Cluster Analysis , Female , Gene Expression Regulation, Bacterial , Gene Regulatory Networks , Genetic Fitness , Mice , Promoter Regions, Genetic , Protein Binding , Protein Interaction Domains and Motifs , Proteomics , Streptococcus pneumoniae/pathogenicity , Transcription Factors/chemistry , Transcription Factors/metabolism
10.
PLoS One ; 8(10): e76042, 2013.
Article in English | MEDLINE | ID: mdl-24124532

ABSTRACT

MOTIVATION: Predicting the part of speech (POS) tag of an unknown word in a sentence is a significant challenge. This is particularly difficult in biomedicine, where POS tags serve as an input to training sophisticated literature summarization techniques, such as those based on Hidden Markov Models (HMM). Different approaches have been taken to deal with the POS tagger challenge, but with one exception--the TnT POS tagger--previous publications on POS tagging have omitted details of the suffix analysis used for handling unknown words. The suffix of an English word is a strong predictor of a POS tag for that word. As a pre-requisite for an accurate HMM POS tagger for biomedical publications, we present an efficient suffix prediction method for integration into a POS tagger. RESULTS: We have implemented a fully functional HMM POS tagger using experimentally optimised suffix based prediction. Our simple suffix analysis method, significantly outperformed the probability interpolation based TnT method. We have also shown how important suffix analysis can be for probability estimation of a known word (in the training corpus) with an unseen POS tag; a common scenario with a small training corpus. We then integrated this simple method in our POS tagger and determined an optimised parameter set for both methods, which can help developers to optimise their current algorithm, based on our results. We also introduce the concept of counting methods in maximum likelihood estimation for the first time and show how counting methods can affect the prediction result. Finally, we describe how machine-learning techniques were applied to identify words, for which prediction of POS tags were always incorrect and propose a method to handle words of this type. AVAILABILITY AND IMPLEMENTATION: Java source code, binaries and setup instructions are freely available at http://genomes.sapac.edu.au/text_mining/pos_tagger.zip.

11.
PLoS One ; 8(3): e58759, 2013.
Article in English | MEDLINE | ID: mdl-23536820

ABSTRACT

UNLABELLED: The primary means of classifying new functions for genes and proteins relies on Gene Ontology (GO), which defines genes/proteins using a controlled vocabulary in terms of their Molecular Function, Biological Process and Cellular Component. The challenge is to present this information to researchers to compare and discover patterns in multiple datasets using visually comprehensible and user-friendly statistical reports. Importantly, while there are many GO resources available for eukaryotes, there are none suitable for simultaneous, graphical and statistical comparison between multiple datasets. In addition, none of them supports comprehensive resources for bacteria. By using Streptococcus pneumoniae as a model, we identified and collected GO resources including genes, proteins, taxonomy and GO relationships from NCBI, UniProt and GO organisations. Then, we designed database tables in PostgreSQL database server and developed a Java application to extract data from source files and loaded into database automatically. We developed a PHP web application based on Model-View-Control architecture, used a specific data structure as well as current and novel algorithms to estimate GO graphs parameters. We designed different navigation and visualization methods on the graphs and integrated these into graphical reports. This tool is particularly significant when comparing GO groups between multiple samples (including those of pathogenic bacteria) from different sources simultaneously. Comparing GO protein distribution among up- or down-regulated genes from different samples can improve understanding of biological pathways, and mechanism(s) of infection. It can also aid in the discovery of genes associated with specific function(s) for investigation as a novel vaccine or therapeutic targets. AVAILABILITY: http://turing.ersa.edu.au/BacteriaGO.


Subject(s)
Bacteria/genetics , Bacteria/metabolism , Gene Ontology , Internet , Software , Algorithms , Computational Biology/methods , Databases, Genetic , User-Computer Interface
SELECTION OF CITATIONS
SEARCH DETAIL
...