Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 25
Filter
Add more filters










Publication year range
1.
BMC Bioinformatics ; 25(1): 181, 2024 May 08.
Article in English | MEDLINE | ID: mdl-38720247

ABSTRACT

BACKGROUND: RNA sequencing combined with machine learning techniques has provided a modern approach to the molecular classification of cancer. Class predictors, reflecting the disease class, can be constructed for known tissue types using the gene expression measurements extracted from cancer patients. One challenge of current cancer predictors is that they often have suboptimal performance estimates when integrating molecular datasets generated from different labs. Often, the quality of the data is variable, procured differently, and contains unwanted noise hampering the ability of a predictive model to extract useful information. Data preprocessing methods can be applied in attempts to reduce these systematic variations and harmonize the datasets before they are used to build a machine learning model for resolving tissue of origins. RESULTS: We aimed to investigate the impact of data preprocessing steps-focusing on normalization, batch effect correction, and data scaling-through trial and comparison. Our goal was to improve the cross-study predictions of tissue of origin for common cancers on large-scale RNA-Seq datasets derived from thousands of patients and over a dozen tumor types. The results showed that the choice of data preprocessing operations affected the performance of the associated classifier models constructed for tissue of origin predictions in cancer. CONCLUSION: By using TCGA as a training set and applying data preprocessing methods, we demonstrated that batch effect correction improved performance measured by weighted F1-score in resolving tissue of origin against an independent GTEx test dataset. On the other hand, the use of data preprocessing operations worsened classification performance when the independent test dataset was aggregated from separate studies in ICGC and GEO. Therefore, based on our findings with these publicly available large-scale RNA-Seq datasets, the application of data preprocessing techniques to a machine learning pipeline is not always appropriate.


Subject(s)
Machine Learning , Neoplasms , RNA-Seq , Humans , RNA-Seq/methods , Neoplasms/genetics , Transcriptome/genetics , Sequence Analysis, RNA/methods , Gene Expression Profiling/methods , Computational Biology/methods
2.
Data Brief ; 45: 108641, 2022 Dec.
Article in English | MEDLINE | ID: mdl-36426049

ABSTRACT

The data in this article are associated with the research paper "GigaAssay - an adaptable high-throughput saturation mutagenesis assay" [1]. The raw data are sequence reads of HIV-1 Tat cDNA amplified from cellular genomic DNA in a new single-pot saturation mutagenesis assay designated the "GigaAssay". A bioinformatic pipeline and parameters used to analyze the data. Raw, processed, analyzed, and filtered data are reported. The data is processed to calculate the Tat-driven transcription activity for cells with each possible single amino acid substitution in Tat. This data can be reused to interpret Tat intermolecular interactions and HIV latency. This is one of the largest and most complete datasets regarding the impact of amino acid substitutions within a single protein on a molecular function.

3.
Genomics ; 114(4): 110439, 2022 07.
Article in English | MEDLINE | ID: mdl-35905834

ABSTRACT

High-throughput assay systems have had a large impact on understanding the mechanisms of basic cell functions. However, high-throughput assays that directly assess molecular functions are limited. Herein, we describe the "GigaAssay", a modular high-throughput one-pot assay system for measuring molecular functions of thousands of genetic variants at once. In this system, each cell was infected with one virus from a library encoding thousands of Tat mutant proteins, with each viral particle encoding a random unique molecular identifier (UMI). We demonstrate proof of concept by measuring transcription of a GFP reporter in an engineered reporter cell line driven by binding of the HIV Tat transcription factor to the HIV long terminal repeat. Infected cells were flow-sorted into 3 bins based on their GFP fluorescence readout. The transcriptional activity of each Tat mutant was calculated from the ratio of signals from each bin. The use of UMIs in the GigaAssay produced a high average accuracy (95%) and positive predictive value (98%) determined by comparison to literature benchmark data, known C-terminal truncations, and blinded independent mutant tests. Including the substitution tolerance with structure/function analysis shows restricted substitution types spatially concentrated in the Cys-rich region. Tat has abundant intragenic epistasis (10%) when single and double mutants are compared.


Subject(s)
HIV-1 , tat Gene Products, Human Immunodeficiency Virus , Cell Line , HIV Long Terminal Repeat , HIV-1/genetics , Mutagenesis , Transcriptional Activation , tat Gene Products, Human Immunodeficiency Virus/genetics , tat Gene Products, Human Immunodeficiency Virus/metabolism
4.
Trends Genet ; 38(1): 12-21, 2022 01.
Article in English | MEDLINE | ID: mdl-34340871

ABSTRACT

Human specific endogenous retrovirus H (HERVH) is highly expressed in both naive and primed stem cells and is essential for pluripotency. Despite the proven relationship between HERVH expression and pluripotency, there is no single definitive model for the function of HERVH. Instead, several hypotheses of a regulatory function have been put forward including HERVH acting as enhancers, long noncoding RNAs (lncRNAs), and most recently as markers of topologically associating domain (TAD) boundaries. Recently several enhancer-associated lncRNAs have been characterized, which bind to Mediator and are necessary for promoter-enhancer folding interactions. We propose a synergistic model of HERVH function combining relevant findings and discuss the current limitations for its role in regulation, including the lack of evidence for a pluripotency-associated target gene.


Subject(s)
Endogenous Retroviruses , RNA, Long Noncoding , Endogenous Retroviruses/metabolism , Enhancer Elements, Genetic , Humans , RNA, Long Noncoding/metabolism , Stem Cells/metabolism
5.
Sci Rep ; 11(1): 4482, 2021 02 24.
Article in English | MEDLINE | ID: mdl-33627720

ABSTRACT

The study aimed to utilize machine learning (ML) approaches and genomic data to develop a prediction model for bone mineral density (BMD) and identify the best modeling approach for BMD prediction. The genomic and phenotypic data of Osteoporotic Fractures in Men Study (n = 5130) was analyzed. Genetic risk score (GRS) was calculated from 1103 associated SNPs for each participant after a comprehensive genotype imputation. Data were normalized and divided into a training set (80%) and a validation set (20%) for analysis. Random forest, gradient boosting, neural network, and linear regression were used to develop BMD prediction models separately. Ten-fold cross-validation was used for hyper-parameters optimization. Mean square error and mean absolute error were used to assess model performance. When using GRS and phenotypic covariates as the predictors, all ML models' performance and linear regression in BMD prediction were similar. However, when replacing GRS with the 1103 individual SNPs in the model, ML models performed significantly better than linear regression (with lasso regularization), and the gradient boosting model performed the best. Our study suggested that ML models, especially gradient boosting, can improve BMD prediction in genomic data.


Subject(s)
Bone Density/genetics , Bone Density/physiology , Aged , Fractures, Bone/genetics , Fractures, Bone/pathology , Genomics/methods , Genotype , Humans , Linear Models , Machine Learning , Male , Polymorphism, Single Nucleotide/genetics , Risk Assessment , Risk Factors
6.
Calcif Tissue Int ; 107(4): 353-361, 2020 10.
Article in English | MEDLINE | ID: mdl-32728911

ABSTRACT

The study aims were to develop fracture prediction models by using machine learning approaches and genomic data, as well as to identify the best modeling approach for fracture prediction. The genomic data of Osteoporotic Fractures in Men, cohort Study (n = 5130), were analyzed. After a comprehensive genotype imputation, genetic risk score (GRS) was calculated from 1103 associated Single Nucleotide Polymorphisms for each participant. Data were normalized and split into a training set (80%) and a validation set (20%) for analysis. Random forest, gradient boosting, neural network, and logistic regression were used to develop prediction models for major osteoporotic fractures separately, with GRS, bone density, and other risk factors as predictors. In model training, the synthetic minority oversampling technique was used to account for low fracture rate, and tenfold cross-validation was employed for hyperparameters optimization. In the testing, the area under curve (AUC) and accuracy were used to assess the model performance. The McNemar test was employed to examine the accuracy difference between models. The results showed that the prediction performance of gradient boosting was the best, with AUC of 0.71 and an accuracy of 0.88, and the GRS ranked as the 7th most important variable in the model. The performance of random forest and neural network were also significantly better than that of logistic regression. This study suggested that improving fracture prediction in older men can be achieved by incorporating genetic profiling and by utilizing the gradient boosting approach. This result should not be extrapolated to women or young individuals.


Subject(s)
Bone Density , Fractures, Bone/diagnosis , Machine Learning , Risk Assessment , Activities of Daily Living , Aged , Aged, 80 and over , Cohort Studies , Genomics , Humans , Male , Phenotype
7.
Bioessays ; 41(12): e1900126, 2019 12.
Article in English | MEDLINE | ID: mdl-31693213

ABSTRACT

Genome editing with engineered nucleases (GEENs) introduce site-specific DNA double-strand breaks (DSBs) and repairs DSBs via nonhomologous end-joining (NHEJ) pathways that eventually create indels (insertions/deletions) in a genome. Whether the features of indels resulting from gene editing could be customized is asked. A review of the literature reveals how gene editing technologies via NHEJ pathways impact gene editing. The survey consolidates a body of literature that suggests that the type (insertion, deletion, and complex) and the approximate length of indel edits can be somewhat customized with different GEENs and by manipulating the expression of key NHEJ genes. Structural data suggest that binding of GEENs to DNA may interfere with binding of key components of DNA repair complexes, favoring either classical- or alternative-NHEJ. The hypotheses have some limitations, but if validated, will enable scientists to better control indel makeup, holding promise for basic science and clinical applications of gene editing. Also see the video abstract here https://youtu.be/vTkJtUsLi3w.


Subject(s)
Gene Editing/methods , CRISPR-Cas Systems/genetics , DNA/genetics , DNA/metabolism , DNA Breaks, Double-Stranded , Humans , Transcription Activator-Like Effector Nucleases/metabolism , Zinc Finger Nucleases/metabolism
9.
Mob DNA ; 10: 39, 2019.
Article in English | MEDLINE | ID: mdl-31497073

ABSTRACT

BACKGROUND: Despite the long-held assumption that transposons are normally only expressed in the germ-line, recent evidence shows that transcripts of transposable element (TE) sequences are frequently found in the somatic cells. However, the extent of variation in TE transcript levels across different tissues and different individuals are unknown, and the co-expression between TEs and host gene mRNAs have not been examined. RESULTS: Here we report the variation in TE derived transcript levels across tissues and between individuals observed in the non-tumorous tissues collected for The Cancer Genome Atlas. We found core TE co-expression modules consisting mainly of transposons, showing correlated expression across broad classes of TEs. Despite this co-expression within tissues, there are individual TE loci that exhibit tissue-specific expression patterns, when compared across tissues. The core TE modules were negatively correlated with other gene modules that consisted of immune response genes in interferon signaling. KRAB Zinc Finger Proteins (KZFPs) were over-represented gene members of the TE modules, showing positive correlation across multiple tissues. But we did not find overlap between TE-KZFP pairs that are co-expressed and TE-KZFP pairs that are bound in published ChIP-seq studies. CONCLUSIONS: We find unexpected variation in TE derived transcripts, within and across non-tumorous tissues. We describe a broad view of the RNA state for non-tumorous tissues exhibiting higher level of TE transcripts. Tissues with higher level of TE transcripts have a broad range of TEs co-expressed, with high expression of a large number of KZFPs, and lower RNA levels of immune genes.

10.
Mob DNA ; 10: 29, 2019.
Article in English | MEDLINE | ID: mdl-31320939

ABSTRACT

Though transposable elements make up around half of the human genome, the repetitive nature of their sequences makes it difficult to accurately align conventional sequencing reads. However, in light of new advances in sequencing technology, such as increased read length and paired-end libraries, these repetitive regions are now becoming easier to align to. This study investigates the mappability of transposable elements with 50 bp, 76 bp and 100 bp paired-end read libraries. With respect to those read lengths and allowing for 3 mismatches during alignment, over 68, 85, and 88% of all transposable elements in the RepeatMasker database are uniquely mappable, suggesting that accurate locus-specific mapping of older transposable elements is well within reach.

11.
Mol Biol Evol ; 35(1): 50-65, 2018 01 01.
Article in English | MEDLINE | ID: mdl-29309688

ABSTRACT

Experimental evolution affords the opportunity to investigate adaptation to stressful environments. Studies combining experimental evolution with whole-genome resequencing have provided insight into the dynamics of adaptation and a new tool to uncover genes associated with polygenic traits. Here, we selected for starvation resistance in populations of Drosophila melanogaster for over 80 generations. In response, the starvation-selected lines developed an obese condition, storing nearly twice the level of total lipids than their unselected controls. Although these fats provide a ∼3-fold increase in starvation resistance, the imbalance in lipid homeostasis incurs evolutionary cost. Some of these tradeoffs resemble obesity-associated pathologies in mammals including metabolic depression, low activity levels, dilated cardiomyopathy, and disrupted sleeping patterns. To determine the genetic basis of these traits, we resequenced genomic DNA from the selected lines and their controls. We found 1,046,373 polymorphic sites, many of which diverged between selection treatments. In addition, we found a wide range of genetic heterogeneity between the replicates of the selected lines, suggesting multiple mechanisms of adaptation. Genome-wide heterozygosity was low in the selected populations, with many large blocks of SNPs nearing fixation. We found candidate loci under selection by using an algorithm to control for the effects of genetic drift. These loci were mapped to a set of 382 genes, which associated with many processes including nutrient response, catabolic metabolism, and lipid droplet function. The results of our study speak to the evolutionary origins of obesity and provide new targets to understand the polygenic nature of obesity in a unique model system.


Subject(s)
Drosophila melanogaster/genetics , Obesity/genetics , Starvation/genetics , Acclimatization , Adaptation, Physiological/genetics , Animals , Directed Molecular Evolution/methods , Disease Models, Animal , Evolution, Molecular , Genome, Insect/genetics , Genome-Wide Association Study/methods , Models, Genetic , Multifactorial Inheritance , Selection, Genetic/genetics
12.
J Mol Evol ; 83(3-4): 137-146, 2016 Oct.
Article in English | MEDLINE | ID: mdl-27770175

ABSTRACT

Evolutionary constraint for insertions and deletions (indels) is not necessarily equal to constraint for nucleotide substitutions for any given region of a genome. Knowing the variation in indel-specific evolutionary rates across the sequence will aid our understanding of evolutionary constraints on indels, and help us infer how indels have contributed to the evolution of the sequence. However, unlike for nucleotide substitutions, there has been no phylogenetic method that can statistically infer significantly different rates of indels across the sequence space independent of substitution rates. Here, we have developed a software that will find sites with accelerated evolutionary rates specific to indels, by introducing a scaling parameter that only applies to the indel rates and not to the nucleotide substitution rates. Using the software, we show that we can find regions of accelerated rates of indels in the protein alignments of primate genomes. We also confirm that the sites that have high rates of indels are different from the sites that have high rates of nucleotide substitutions within the protein sequences. By identifying regions with accelerated rates of indels independent of nucleotide substitutions, we will be able to better understand the impact of indel mutations on protein sequence evolution.


Subject(s)
INDEL Mutation , Models, Genetic , Mutation Rate , Animals , Computer Simulation , Evolution, Molecular , Humans , Nucleotides/genetics , Phylogeny , Proteins/genetics , Sequence Deletion , Software , Species Specificity
13.
Science ; 347(6217): 1258522, 2015 Jan 02.
Article in English | MEDLINE | ID: mdl-25554792

ABSTRACT

Variation in vectorial capacity for human malaria among Anopheles mosquito species is determined by many factors, including behavior, immunity, and life history. To investigate the genomic basis of vectorial capacity and explore new avenues for vector control, we sequenced the genomes of 16 anopheline mosquito species from diverse locations spanning ~100 million years of evolution. Comparative analyses show faster rates of gene gain and loss, elevated gene shuffling on the X chromosome, and more intron losses, relative to Drosophila. Some determinants of vectorial capacity, such as chemosensory genes, do not show elevated turnover but instead diversify through protein-sequence changes. This dynamism of anopheline genes and genomes may contribute to their flexible capacity to take advantage of new ecological niches, including adapting to humans as primary hosts.


Subject(s)
Anopheles/genetics , Evolution, Molecular , Genome, Insect , Insect Vectors/genetics , Malaria/transmission , Animals , Anopheles/classification , Base Sequence , Chromosomes, Insect/genetics , Drosophila/genetics , Humans , Insect Vectors/classification , Molecular Sequence Data , Phylogeny , Sequence Alignment
14.
Mol Biol Evol ; 30(8): 1987-97, 2013 Aug.
Article in English | MEDLINE | ID: mdl-23709260

ABSTRACT

Current sequencing methods produce large amounts of data, but genome assemblies constructed from these data are often fragmented and incomplete. Incomplete and error-filled assemblies result in many annotation errors, especially in the number of genes present in a genome. This means that methods attempting to estimate rates of gene duplication and loss often will be misled by such errors and that rates of gene family evolution will be consistently overestimated. Here, we present a method that takes these errors into account, allowing one to accurately infer rates of gene gain and loss among genomes even with low assembly and annotation quality. The method is implemented in the newest version of the software package CAFE, along with several other novel features. We demonstrate the accuracy of the method with extensive simulations and reanalyze several previously published data sets. Our results show that errors in genome annotation do lead to higher inferred rates of gene gain and loss but that CAFE 3 sufficiently accounts for these errors to provide accurate estimates of important evolutionary parameters.


Subject(s)
Genome , Molecular Sequence Annotation/methods , Sequence Analysis, DNA/methods , Software , Algorithms , Computational Biology/methods , Evolution, Molecular , Genomics/methods , Reproducibility of Results
15.
Nat Commun ; 3: 913, 2012 Jun 26.
Article in English | MEDLINE | ID: mdl-22735441

ABSTRACT

Ganoderma lucidum is a widely used medicinal macrofungus in traditional Chinese medicine that creates a diverse set of bioactive compounds. Here we report its 43.3-Mb genome, encoding 16,113 predicted genes, obtained using next-generation sequencing and optical mapping approaches. The sequence analysis reveals an impressive array of genes encoding cytochrome P450s (CYPs), transporters and regulatory proteins that cooperate in secondary metabolism. The genome also encodes one of the richest sets of wood degradation enzymes among all of the sequenced basidiomycetes. In all, 24 physical CYP gene clusters are identified. Moreover, 78 CYP genes are coexpressed with lanosterol synthase, and 16 of these show high similarity to fungal CYPs that specifically hydroxylate testosterone, suggesting their possible roles in triterpenoid biosynthesis. The elucidation of the G. lucidum genome makes this organism a potential model system for the study of secondary metabolic pathways and their regulation in medicinal fungi.


Subject(s)
Genome, Fungal/genetics , Reishi/genetics , Fungal Proteins/genetics , Reishi/metabolism
16.
Fly (Austin) ; 6(2): 121-5, 2012.
Article in English | MEDLINE | ID: mdl-22634624

ABSTRACT

Genes occasionally change their location in the genome through inter-chromosomal duplication and loss. These changes happen as mistakes during recombination or through retrotransposition. In Han and Hahn 2011,(1) we surveyed the genomes of ten Drosophila species, to identify and characterize the gene transposition events in the history of these species. In the paper, we showed that the rate of gene transposition in Drosophila is higher than previously appreciated. To understand the process of gene transposition, we examined the sequences, locations, and functions of the transposed genes. Based on the elevated rate of sequence evolution in transposed genes and the frequent movements near the centromeres and telomeres, we could not reject the hypothesis that these are mutations fixed through relaxed selection. But, by examining the functions of transposed genes more carefully, we found that genes with male-specific functions and genes with female-specific functions move in opposite directions involving the X chromosome. We also found an over-representation of chromosome related functions among the transposed genes. These observations suggest the possibility of particular selection pressures contributing to gene transpositions in Drosophila.


Subject(s)
Chromosomes, Insect , Drosophila/genetics , Gene Rearrangement , Genes, Insect , Animals , Female , Male
17.
Genetics ; 190(2): 813-25, 2012 Feb.
Article in English | MEDLINE | ID: mdl-22095076

ABSTRACT

Gene transposition puts a new gene copy in a novel genomic environment. Moreover, genes moving between the autosomes and the X chromosome experience change in several evolutionary parameters. Previous studies of gene transposition have not utilized the phylogenetic framework that becomes possible with the availability of whole genomes from multiple species. Here we used parsimonious reconstruction on the genomic distribution of gene families to analyze interchromosomal gene transposition in Drosophila. We identified 782 genes that have moved chromosomes within the phylogeny of 10 Drosophila species, including 87 gene families with multiple independent movements on different branches of the phylogeny. Using this large catalog of transposed genes, we detected accelerated sequence evolution in duplicated genes that transposed when compared to the parental copy at the original locus. We also observed a more refined picture of the biased movement of genes from the X chromosome to the autosomes. The bias of X-to-autosome movement was significantly stronger for RNA-based movements than for DNA-based movements, and among DNA-based movements there was an excess of genes moving onto the X chromosome as well. Genes involved in female-specific functions moved onto the X chromosome while genes with male-specific functions moved off the X. There was a significant overrepresentation of proteins involving chromosomal function among transposed genes, suggesting that genetic conflict between sexes and among chromosomes may be a driving force behind gene transposition in Drosophila.


Subject(s)
Chromosomes, Insect , DNA Transposable Elements , Drosophila/genetics , Genes, Insect , Animals , Chromosome Segregation , Female , Gene Duplication , Genome, Insect , Male , Recombination, Genetic
18.
Evolution ; 65(1): 231-45, 2011 Jan.
Article in English | MEDLINE | ID: mdl-20731717

ABSTRACT

Developmental mechanisms play an important role in determining the costs, limits, and evolutionary consequences of phenotypic plasticity. One issue central to these claims is the hypothesis of developmental decoupling, where alternate morphs result from evolutionarily independent developmental pathways. We address this assumption through a microarray study that tests whether differences in gene expression between alternate morphs are as divergent as those between sexes, a classic example of developmental decoupling. We then examine whether genes with morph-biased expression are less conserved than genes with shared expression between morphs, as predicted if developmental decoupling relaxes pleiotropic constraints on divergence. We focus on the developing horns and brains of two species of horned beetles with impressive sexual- and morph-dimorphism in the expression of horns and fighting behavior. We find that patterns of gene expression were as divergent between morphs as they were between sexes. However, overall patterns of gene expression were also highly correlated across morphs and sexes. Morph-biased genes were more evolutionarily divergent, suggesting a role of relaxed pleiotropic constraints or relaxed selection. Together these results suggest that alternate morphs are to some extent developmentally decoupled, and that this decoupling has significant evolutionary consequences. However, alternative morphs may not be as developmentally decoupled as sometimes assumed and such hypotheses of development should be revisited and refined.


Subject(s)
Coleoptera/anatomy & histology , Coleoptera/genetics , Animals , Biological Evolution , Coleoptera/classification , Coleoptera/growth & development , Female , Gene Expression Profiling , Gene Expression Regulation, Developmental , Genetic Pleiotropy , Hawaii , Male , Phenotype , Phylogeny , Sex Characteristics , Virginia
19.
Evolution ; 64(6): 1541-57, 2010 Jun.
Article in English | MEDLINE | ID: mdl-20298429

ABSTRACT

The two "rules of speciation"--the Large X-effect and Haldane's rule--hold throughout the animal kingdom, but the underlying genetic mechanisms that cause them are still unclear. Two predominant explanations--the "dominance theory" and faster male evolution--both have some empirical support, suggesting that the genetic basis of these rules is likely multifarious. We revisit one historical explanation for these rules, based on dysfunctional genetic interactions involving genes recently moved between chromosomes. We suggest that gene movement specifically off or onto the X chromosome is another mechanism that could contribute to the two rules, especially as X chromosome movements can be subject to unique sex-specific and sex chromosome specific consequences in hybrids. Our hypothesis is supported by patterns emerging from comparative genomic data, including a strong bias in interchromosomal gene movements involving the X and an overrepresentation of male reproductive functions among chromosomally relocated genes. In addition, our model indicates that the contribution of gene movement to the two rules in any specific group will depend upon key developmental and reproductive parameters that are taxon specific. We provide several testable predictions that can be used to assess the importance of gene movement as a contributor to these rules in the future.


Subject(s)
Genes , Genetic Speciation , X Chromosome , Animals , Female , Male
20.
BMC Bioinformatics ; 10: 356, 2009 Oct 27.
Article in English | MEDLINE | ID: mdl-19860910

ABSTRACT

BACKGROUND: Evolutionary trees are central to a wide range of biological studies. In many of these studies, tree nodes and branches need to be associated (or annotated) with various attributes. For example, in studies concerned with organismal relationships, tree nodes are associated with taxonomic names, whereas tree branches have lengths and oftentimes support values. Gene trees used in comparative genomics or phylogenomics are usually annotated with taxonomic information, genome-related data, such as gene names and functional annotations, as well as events such as gene duplications, speciations, or exon shufflings, combined with information related to the evolutionary tree itself. The data standards currently used for evolutionary trees have limited capacities to incorporate such annotations of different data types. RESULTS: We developed a XML language, named phyloXML, for describing evolutionary trees, as well as various associated data items. PhyloXML provides elements for commonly used items, such as branch lengths, support values, taxonomic names, and gene names and identifiers. By using "property" elements, phyloXML can be adapted to novel and unforeseen use cases. We also developed various software tools for reading, writing, conversion, and visualization of phyloXML formatted data. CONCLUSION: PhyloXML is an XML language defined by a complete schema in XSD that allows storing and exchanging the structures of evolutionary trees as well as associated data. More information about phyloXML itself, the XSD schema, as well as tools implementing and supporting phyloXML, is available at http://www.phyloxml.org.


Subject(s)
Biological Evolution , Computational Biology/methods , Genomics/methods , Phylogeny , Software , Databases, Genetic
SELECTION OF CITATIONS
SEARCH DETAIL
...