ABSTRACT
Living species vary significantly in phenotype and genomic content. Sophisticated statistical methods linking genes with phenotypes within a species have led to breakthroughs in complex genetic diseases and genetic breeding. Despite the abundance of genomic and phenotypic data available for thousands of species, finding genotype-phenotype associations across species is challenging due to the non-independence of species data resulting from common ancestry. To address this, we present CALANGO (comparative analysis with annotation-based genomic components), a phylogeny-aware comparative genomics tool to find homologous regions and biological roles associated with quantitative phenotypes across species. In two case studies, CALANGO identified both known and previously unidentified genotype-phenotype associations. The first study revealed unknown aspects of the ecological interaction between Escherichia coli, its integrated bacteriophages, and the pathogenicity phenotype. The second identified an association between maximum height in angiosperms and the expansion of a reproductive mechanism that prevents inbreeding and increases genetic diversity, with implications for conservation biology and agriculture.
ABSTRACT
MOTIVATION: Insects possess a vast phenotypic diversity and key ecological roles. Several insect species also have medical, agricultural and veterinary importance as parasites and disease vectors. Therefore, strategies to identify potential essential genes in insects may reduce the resources needed to find molecular players in central processes of insect biology. However, most predictors of essential genes in multicellular eukaryotes using machine learning rely on expensive and laborious experimental data to be used as gene features, such as gene expression profiles or protein-protein interactions, even though some of this information may not be available for the majority of insect species with genomic sequences available. RESULTS: Here, we present and validate a machine learning strategy to predict essential genes in insects using sequence-based intrinsic attributes (statistical and physicochemical data) together with the predictions of subcellular location and transcriptomic data, if available. We gathered information available in public databases describing essential and non-essential genes for Drosophila melanogaster (fruit fly, Diptera) and Tribolium castaneum (red flour beetle, Coleoptera). We proceeded by computing intrinsic and extrinsic attributes that were used to train statistical models in one species and tested by their capability of predicting essential genes in the other. Even models trained using only intrinsic attributes are capable of predicting genes in the other insect species, including the prediction of lineage-specific essential genes. Furthermore, the inclusion of RNA-Seq data is a major factor to increase classifier performance. AVAILABILITY AND IMPLEMENTATION: The code, data and final models produced in this study are freely available at https://github.com/g1o/GeneEssentiality/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.