ABSTRACT
Genomic selection (GS) is revolutionizing plant breeding. However, because it is a predictive methodology, a basic understanding of statistical machine-learning methods is necessary for its successful implementation. This methodology uses a reference population that contains both the phenotypic and genotypic information of genotypes to train a statistical machine-learning method. After optimization, this method is used to make predictions of candidate lines for which only genotypic information is available. However, due to a lack of time and appropriate training, it is difficult for breeders and scientists of related fields to learn all the fundamentals of prediction algorithms. With smart or highly automated software, it is possible for these professionals to appropriately implement any state-of-the-art statistical machine-learning method for its collected data without the need for an exhaustive understanding of statistical machine-learning methods and programing. For this reason, we introduce state-of-the-art statistical machine-learning methods using the Sparse Kernel Methods (SKM) R library, with complete guidelines on how to implement seven statistical machine-learning methods that are available in this library for genomic prediction (random forest, Bayesian models, support vector machine, gradient boosted machine, generalized linear models, partial least squares, feed-forward artificial neural networks). This guide includes details of the functions required to implement each of the methods, as well as others for easily implementing different tuning strategies, cross-validation strategies, and metrics to evaluate the prediction performance and different summary functions that compute it. A toy dataset illustrates how to implement statistical machine-learning methods and facilitate their use by professionals who do not possess a strong background in machine learning and programing.
Subject(s)
Plant Breeding , Software , Bayes Theorem , Genomics/methods , Machine LearningABSTRACT
Background: Genome skimming is a popular method in plant phylogenomics that do not include a biased enrichment step, relying on random shallow sequencing of total genomic DNA. From these data the plastome is usually readily assembled and constitutes the bulk of phylogenetic information generated in these studies. Despite a few attempts to use genome skims to recover low copy nuclear loci for direct phylogenetic use, such endeavor remains neglected. Causes might include the trade-off between libraries with few reads and species with large genomes (i.e., missing data caused by low coverage), but also might relate to the lack of pipelines for data assembling. Methods: A pipeline and its companion R package designed to automate the recovery of low copy nuclear markers from genome skimming libraries are presented. Additionally, a series of analyses aiming to evaluate the impact of key assembling parameters, reference selection and missing data are presented. Results: A substantial amount of putative low copy nuclear loci was assembled and proved useful to base phylogenetic inference across the libraries tested (4 to 11 times more data than previously assembled plastomes from the same libraries). Discussion: Critical aspects of assembling low copy nuclear markers from genome skims include the minimum coverage and depth of a sequence to be used. More stringent values of these parameters reduces the amount of assembled data and increases the relative amount of missing data, which can compromise phylogenetic inference, in turn relaxing the same parameters might increase sequence error. These issues are discussed in the text, and parameter tuning through multiple comparisons tracking their effects on support and congruence is highly recommended when using this pipeline. The skimmingLoci pipeline (https://github.com/mreginato/skimmingLoci) might stimulate the use of genome skims to recover nuclear loci for direct phylogenetic use, increasing the power of genome skimming data to resolve phylogenetic relationships, while reducing the amount of sequenced DNA that is commonly wasted.
Subject(s)
DNA , Genome, Plant , Phylogeny , Genome, Plant/genetics , Sequence Analysis, DNA/methods , Genomic LibraryABSTRACT
The adoption of machine learning frameworks in areas beyond computer science have been facilitated by the development of user-friendly software tools that do not require an advanced understanding of computer programming. In this paper, we present a new package (sparse kernel methods, SKM) software developed in R language for implementing six (generalized boosted machines, generalized linear models, support vector machines, random forest, Bayesian regression models and deep neural networks) of the most popular supervised machine learning algorithms with the optional use of sparse kernels. The SKM focuses on user simplicity, as it does not try to include all the available machine learning algorithms, but rather the most important aspects of these six algorithms in an easy-to-understand format. Another relevant contribution of this package is a function for the computation of seven different kernels. These are Linear, Polynomial, Sigmoid, Gaussian, Exponential, Arc-Cosine 1 and Arc-Cosine L (with L = 2, 3, ) and their sparse versions, which allow users to create kernel machines without modifying the statistical machine learning algorithm. It is important to point out that the main contribution of our package resides in the functionality for the computation of the sparse version of seven basic kernels, which is indispensable for reducing computational resources to implement kernel machine learning methods without a significant loss in prediction performance. Performance of the SKM is evaluated in a genome-based prediction framework using both a maize and wheat data set. As such, the use of this package is not restricted to genome prediction problems, and can be used in many different applications.
ABSTRACT
Nowadays, there are thousands of publicly available gene expression datasets which can be analyzed in silico using specialized software or the R programming language. However, transcriptomic studies consider experimental conditions individually, giving one independent result per comparison. Here we describe the Gene Expression Variation Analysis (GEVA), a new R package that accepts multiple differential expression analysis results as input and performs multiple statistical steps, such as weighted summarization, quantiles partition, and clustering to find genes whose differential expression varied less across all experiments. The experimental conditions can be divided into groups, which we call factors, where additional ANOVA (Fisher's and Levene's) tests are applied to identify differentially expressed genes in response either specifically to one factor or dependently to all factors. The final results present three possible classifications for relevant genes: similar, factor-dependent, and factor-specific. To validate these results subsequently to the GEVA's development, 28 transcriptomic datasets were tested using 11 different combinations of the available parameters, including several clustering, quantiles, and summarization methods. The final classifications were validated using knockout studies from different organisms, as they lack genes whose differential expression is expected. Although some of the final classifications differed depending on the parameters' choice, the test results from the default parameters corroborated with the published experimental studies regarding the selected datasets. Thus, we conclude that GEVA can effectively find similarities between groups of biological conditions, and therefore could be a robust alternative for multiple comparison analyses.
Subject(s)
Gene Expression Profiling , Software , Cluster Analysis , Gene Expression Profiling/methods , Programming Languages , TranscriptomeABSTRACT
BACKGROUND: Finding meaningful gene-gene interaction and the main Transcription Factors (TFs) in co-expression networks is one of the most important challenges in gene expression data mining. RESULTS: Here, we developed the R package "CeTF" that integrates the Partial Correlation with Information Theory (PCIT) and Regulatory Impact Factors (RIF) algorithms applied to gene expression data from microarray, RNA-seq, or single-cell RNA-seq platforms. This approach allows identifying the transcription factors most likely to regulate a given network in different biological systems - for example, regulation of gene pathways in tumor stromal cells and tumor cells of the same tumor. This pipeline can be easily integrated into the high-throughput analysis. To demonstrate the CeTF package application, we analyzed gastric cancer RNA-seq data obtained from TCGA (The Cancer Genome Atlas) and found the HOXB3 gene as the second most relevant TFs with a high regulatory impact (TFs-HRi) regulating gene pathways in the cell cycle. CONCLUSION: This preliminary finding shows the potential of CeTF to list master regulators of gene networks. CeTF was designed as a user-friendly tool that provides many highly automated functions without requiring the user to perform many complicated processes. It is available on Bioconductor ( http://bioconductor.org/packages/CeTF ) and GitHub ( http://github.com/cbiagii/CeTF ).
Subject(s)
Information Theory , Transcription Factors , Gene Expression Regulation , Gene Regulatory Networks , Software , Transcription Factors/geneticsABSTRACT
Multilocus sequence typing (MLST) is a standard tool in population genetics and bacterial epidemiology that assesses the genetic variation present in a reduced number of housekeeping genes (typically seven) along the genome. This methodology assigns arbitrary integer identifiers to genetic variations at these loci which allows us to efficiently compare bacterial isolates using allele-based methods. Now, the increasing availability of whole-genome sequences for hundreds to thousands of strains from the same bacterial species has allowed us to apply and extend MLST schemes by automatic extraction of allele information from the genomes. The PubMLST database is the most comprehensive resource of described schemes available for a wide variety of species. Here we present MLSTar as the first R package that allows us to (i) connect with the PubMLST database to select a target scheme, (ii) screen a desired set of genomes to assign alleles and sequence types, and (iii) interact with other widely used R packages to analyze and produce graphical representations of the data. We applied MLSTar to analyze more than 2,500 bacterial genomes from different species, showing great accuracy, and comparable performance with previously published command-line tools. MLSTar can be freely downloaded from http://github.com/iferres/MLSTar.
ABSTRACT
Targeted sequencing (TS) is growing as a screening methodology used in research and medical genetics to identify genomic alterations causing human diseases. In general, a list of possible genomic variants is derived from mapped reads through a variant calling step. This processing step is usually based on variant coverage, although it may be affected by several factors. Therefore, undercovered relevant clinical variants may not be reported, affecting pathology diagnosis or treatment. Thus, a prior quality control of the experiment is critical to determine variant detection accuracy and to avoid erroneous medical conclusions. There are several quality control tools, but they are focused on issues related to whole-genome sequencing. However, in TS, quality control should assess experiment, gene, and genomic region performances based on achieved coverages. Here, we propose TarSeqQC R package for quality control in TS experiments. The tool is freely available at Bioconductor repository. TarSeqQC was used to analyze two datasets; low-performance primer pools and features were detected, enhancing the quality of experiment results. Read count profiles were also explored, showing TarSeqQC's effectiveness as an exploration tool. Our proposal may be a valuable bioinformatic tool for routinely TS experiments in both research and medical genetics.
Subject(s)
Computational Biology/methods , Genomics/methods , High-Throughput Nucleotide Sequencing , Software , Computational Biology/standards , Datasets as Topic , Genomics/standards , Humans , Neoplasms/genetics , Quality Control , Reproducibility of Results , Software/standards , User-Computer InterfaceABSTRACT
We present a new software package (HZAR) that provides functions for fitting molecular genetic and morphological data from hybrid zones to classic equilibrium cline models using the Metropolis-Hastings Markov chain Monte Carlo (MCMC) algorithm. The software applies likelihood functions appropriate for different types of data, including diploid and haploid genetic markers and quantitative morphological traits. The modular design allows flexibility in fitting cline models of varying complexity. To facilitate hypothesis testing, an autofit function is included that allows automated model selection from a set of nested cline models. Cline parameter values, such as cline centre and cline width, are estimated and may be compared statistically across clines. The package is written in the R language and is available through the Comprehensive R Archive Network (CRAN; http://cran.r-project.org/). Here, we describe HZAR and demonstrate its use with a sample data set from a well-studied hybrid zone in western Panama between white-collared (Manacus candei) and golden-collared manakins (M. vitellinus). Comparisons of our results with previously published results for this hybrid zone validate the hzar software. We extend analysis of this hybrid zone by fitting additional models to molecular data where appropriate.