Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 13 de 13
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
BMC Ecol Evol ; 24(1): 11, 2024 Jan 20.
Artigo em Inglês | MEDLINE | ID: mdl-38245667

RESUMO

Abrupt environmental changes can lead to evolutionary shifts in trait evolution. Identifying these shifts is an important step in understanding the evolutionary history of phenotypes. The detection performances of different methods are influenced by many factors, including different numbers of shifts, shift sizes, where a shift occurs on a tree, and the types of phylogenetic structure. Furthermore, the model assumptions are oversimplified, so are likely to be violated in real data, which could cause the methods to fail. We perform simulations to assess the effect of these factors on the performance of shift detection methods. To make the comparisons more complete, we also propose an ensemble variable selection method (R package ELPASO) and compare it with existing methods (R packages [Formula: see text]1ou and PhylogeneticEM). The performances of methods are highly dependent on the selection criterion. [Formula: see text]1ou+pBIC is usually the most conservative method and it performs well when signal sizes are large. [Formula: see text]1ou+BIC is the least conservative method and it performs well when signal sizes are small. The ensemble method provides more balanced choices between those two methods. Moreover, the performances of all methods are heavily impacted by measurement error, tree reconstruction error and shifts in variance.


Assuntos
Filogenia , Fenótipo
2.
Stat Med ; 42(30): 5676-5693, 2023 12 30.
Artigo em Inglês | MEDLINE | ID: mdl-37848186

RESUMO

Non-Negative Matrix Factorization (NMF) is a widely used dimension reduction method that factorizes a non-negative data matrix into two lower dimensional non-negative matrices: one is the basis or feature matrix which consists of the variables and the other is the coefficients matrix which is the projections of data points to the new basis. The features can be interpreted as sub-structures of the data. The number of sub-structures in the feature matrix is also called the rank. This parameter controls the model complexity and is the only tuning parameter for the NMF model. An appropriate rank will extract the key latent features while minimizing the noise from the original data. However due to the large amount of optimization error always present in the NMF computation, the rank selection has been a difficult problem. We develop a novel rank selection method based on hypothesis testing, using a deconvolved bootstrap distribution to assess the significance level accurately. Through simulations, we compare our method with a rank selection method based on hypothesis testing using bootstrap distribution without deconvolution and a method based on cross-validation; we demonstrate that our method is not only accurate at estimating the true ranks for NMF, especially when the features are hard to distinguish, but also efficient at computation. When applied to real microbiome data (eg, OTU data and functional metagenomic data), our method also shows the ability to extract interpretable subcommunities in the data.


Assuntos
Algoritmos , Microbiota , Humanos , Projetos de Pesquisa
3.
Syst Biol ; 72(3): 559-574, 2023 Jun 17.
Artigo em Inglês | MEDLINE | ID: mdl-35904761

RESUMO

Organismal traits can evolve in a coordinated way, with correlated patterns of gains and losses reflecting important evolutionary associations. Discovering these associations can reveal important information about the functional and ecological linkages among traits. Phylogenetic profiles treat individual genes as traits distributed across sets of genomes and can provide a fine-grained view of the genetic underpinnings of evolutionary processes in a set of genomes. Phylogenetic profiling has been used to identify genes that are functionally linked and to identify common patterns of lateral gene transfer in microorganisms. However, comparative analysis of phylogenetic profiles and other trait distributions should take into account the phylogenetic relationships among the organisms under consideration. Here, we propose the Community Coevolution Model (CCM), a new coevolutionary model to analyze the evolutionary associations among traits, with a focus on phylogenetic profiles. In the CCM, traits are considered to evolve as a community with interactions, and the transition rate for each trait depends on the current states of other traits. Surpassing other comparative methods for pairwise trait analysis, CCM has the additional advantage of being able to examine multiple traits as a community to reveal more dependency relationships. We also develop a simulation procedure to generate phylogenetic profiles with correlated evolutionary patterns that can be used as benchmark data for evaluation purposes. A simulation study demonstrates that CCM is more accurate than other methods including the Jaccard Index and three tree-aware methods. The parameterization of CCM makes the interpretation of the relations between genes more direct, which leads to Darwin's scenario being identified easily based on the estimated parameters. We show that CCM is more efficient and fits real data better than other methods resulting in higher likelihood scores with fewer parameters. An examination of 3786 phylogenetic profiles across a set of 659 bacterial genomes highlights linkages between genes with common functions, including many patterns that would not have been identified under a nonphylogenetic model of common distribution. We also applied the CCM to 44 proteins in the well-studied Mitochondrial Respiratory Complex I and recovered associations that mapped well onto the structural associations that exist in the complex. [Coevolution; evolutionary rates; gene network; graphical models; phylogenetic profiles; phylogeny.].


Assuntos
Evolução Biológica , Proteínas , Filogenia , Fenótipo , Genoma Bacteriano
4.
Microb Genom ; 7(8)2021 08.
Artigo em Inglês | MEDLINE | ID: mdl-34356001

RESUMO

Despite that obesity is associated with many metabolic diseases, a significant proportion (10-30 %) of obese individuals is recognized as 'metabolically healthy obeses' (MHOs). The aim of the current study is to characterize the gut microbiome for MHOs as compared to 'metabolically unhealthy obeses' (MUOs). We compared the gut microbiome of 172 MHO and 138 MUO individuals from Chongqing (China) (inclined to eat red meat and food with a spicy taste), and performed validation with selected biomarkers in 40 MHOs and 33 MUOs from Quanzhou (China) (inclined to eat seafood and food with a light/bland taste). The genera Alistipes, Faecalibacterium and Odoribacter had increased abundance in both Chongqing and Quanzhou MHOs. We also observed different microbial functions in MUOs compared to MHOs, including an increased abundance of genes associated with glycan biosynthesis and metabolism. In addition, the microbial gene markers identified from the Chongqing cohort bear a moderate accuracy [AUC (area under the operating characteristic curve)=0.69] for classifying MHOs distinct from MUOs in the Quanzhou cohort. These findings indicate that gut microbiome is significantly distinct between MHOs and MUOs, implicating the potential of the gut microbiome in stratification and refined management of obesity.


Assuntos
Bactérias/classificação , Microbioma Gastrointestinal , Obesidade , Bactérias/genética , Biomarcadores , China , Estudos de Coortes , Dieta , Fezes/microbiologia , Microbioma Gastrointestinal/genética , Humanos , Doenças Metabólicas , Metagenoma
5.
Stat Med ; 40(4): 897-919, 2021 02 20.
Artigo em Inglês | MEDLINE | ID: mdl-33219557

RESUMO

In this article, we present a new variable selection method for regression and classification purposes, particularly for microbiome analysis. Our method, called subsampling ranking forward selection (SuRF), is based on LASSO penalized regression, subsampling and forward-selection methods. SuRF offers major advantages over existing variable selection methods in terms of both sparsity of selected models and model inference. We provide an R package that can implement our method for generalized linear models. We apply our method to classification problems from microbiome data, using a novel agglomeration approach to deal with the special tree-like correlation structure of the variables. Existing methods arbitrarily choose a taxonomic level a priori before performing the analysis, whereas by combining SuRF with these aggregated variables, we are able to identify the key biomarkers at the appropriate taxonomic level, as suggested by the data. We present simulations in multiple sparse settings to demonstrate that our approach performs better than several other popularly used existing approaches in recovering the true variables. We apply SuRF to two microbiome datasets: one about prediction of pouchitis and another for identifying samples from two healthy individuals. We find that SuRF can provide a better or comparable prediction with other methods while controlling the false positive rate of variable selection.


Assuntos
Análise de Dados , Microbiota , Humanos
6.
Biometrics ; 77(4): 1369-1384, 2021 12.
Artigo em Inglês | MEDLINE | ID: mdl-33006392

RESUMO

In this paper, we study the problem of computing a principal component analysis of data affected by Poisson noise. We assume samples are drawn from independent Poisson distributions. We want to estimate principal components of a fixed transformation of the latent Poisson means. Our motivating example is microbiome data, though the methods apply to many other situations. We develop a semiparametric approach to correct the bias of variance estimators, both for untransformed and transformed (with particular attention to log-transformation) Poisson means. Furthermore, we incorporate methods for correcting different exposure or sequencing depth in the data. In addition to identifying the principal components, we also address the nontrivial problem of computing the principal scores in this semiparametric framework. Most previous approaches tend to take a more parametric line: for example, fitting a log-normal Poisson (PLN) model. We compare our method with the PLN approach and find that in many cases our method is better at identifying the main principal components of the latent log-transformed Poisson means, and as a further major advantage, takes far less time to compute. Comparing methods on real and simulated data, we see that our method also appears to be more robust to outliers than the parametric method.


Assuntos
Microbiota , Distribuição de Poisson , Análise de Componente Principal , Projetos de Pesquisa
7.
BMC Bioinformatics ; 21(1): 450, 2020 Oct 12.
Artigo em Inglês | MEDLINE | ID: mdl-33045987

RESUMO

BACKGROUND: The vast majority of microbiome research so far has focused on the structure of the microbiome at a single time-point. There have been several studies that measure the microbiome from a particular environment over time. A few models have been developed by extending time series models to accomodate specific features in microbiome data to address questions of stability and interactions of the microbime time series. Most research has observed the stability and mean reversion for some microbiomes. However, little has been done to study the mean reversion rates of these stable microbes and how sampling frequencies are related to such conclusions. In this paper, we begin to rectify this situation. We analyse two widely studied microbial time series data sets on four healthy individuals. We choose to study healthy individuals because we are interested in the baseline temporal dynamics of the microbiome. RESULTS: For this analysis, we focus on the temporal dynamics of individual genera, absorbing all interactions in a stochastic term. We use a simple stochastic differential equation model to assess the following three questions. (1) Does the microbiome exhibit temporal continuity? (2) Does the microbiome have a stable state? (3) To better understand the temporal dynamics, how frequently should data be sampled in future studies? We find that a simple Ornstein-Uhlenbeck model which incorporates both temporal continuity and reversion to a stable state fits the data for almost every genus better than a Brownian motion model that contains only temporal continuity. The Ornstein-Uhlenbeck model also fits the data better than modelling separate time points as independent. Under the Ornstein-Uhlenbeck model, we calculate the variance of the estimated mean reversion rate (the speed with which each genus returns to its stable state). Based on this calculation, we are able to determine the optimal sample schemes for studying temporal dynamics. CONCLUSIONS: There is evidence of temporal continuity for most genera; there is clear evidence of a stable state; and the optimal sampling frequency for studying temporal dynamics is in the range of one sample every 0.8-3.2 days.


Assuntos
Microbiota , Modelos Biológicos , Voluntários Saudáveis , Humanos , Cinética , Processos Estocásticos
8.
Sci Rep ; 9(1): 13424, 2019 09 17.
Artigo em Inglês | MEDLINE | ID: mdl-31530820

RESUMO

The gut microbiota (GM) is related to obesity and other metabolic diseases. To detect GM markers for obesity in patients with different metabolic abnormalities and investigate their relationships with clinical indicators, 1,914 Chinese adults were enrolled for 16S rRNA gene sequencing in this retrospective study. Based on GM composition, Random forest classifiers were constructed to screen the obesity patients with (Group OA) or without metabolic diseases (Group O) from healthy individuals (Group H), and high accuracies were observed for the discrimination of Group O and Group OA (areas under the receiver operating curve (AUC) equal to 0.68 and 0.76, respectively). Furthermore, six GM markers were shared by obesity patients with various metabolic disorders (Bacteroides, Parabacteroides, Blautia, Alistipes, Romboutsia and Roseburia). As for the discrimination with Group O, Group OA exhibited low accuracy (AUC = 0.57). Nonetheless, GM classifications to distinguish between Group O and the obese patients with specific metabolic abnormalities were not accurate (AUC values from 0.59 to 0.66). Common biomarkers were identified for the obesity patients with high uric acid, high serum lipids and high blood pressure, such as Clostridium XIVa, Bacteroides and Roseburia. A total of 20 genera were associated with multiple significant clinical indicators. For example, Blautia, Romboutsia, Ruminococcus2, Clostridium sensu stricto and Dorea were positively correlated with indicators of bodyweight (including waistline and body mass index) and serum lipids (including low density lipoprotein, triglyceride and total cholesterol). In contrast, the aforementioned clinical indicators were negatively associated with Bacteroides, Roseburia, Butyricicoccus, Alistipes, Parasutterella, Parabacteroides and Clostridium IV. Generally, these biomarkers hold the potential to predict obesity-related metabolic abnormalities, and interventions based on these biomarkers might be beneficial to weight loss and metabolic risk improvement.


Assuntos
Microbioma Gastrointestinal/fisiologia , Obesidade/metabolismo , Obesidade/microbiologia , Adulto , Biomarcadores , Índice de Massa Corporal , Fezes/microbiologia , Feminino , Humanos , Masculino , Pessoa de Meia-Idade
9.
BMC Bioinformatics ; 20(1): 349, 2019 Jun 20.
Artigo em Inglês | MEDLINE | ID: mdl-31221105

RESUMO

BACKGROUND: Testing model adequacy is important before a DNA substitution model is chosen for phylogenetic inference. Using a mis-specified model can negatively impact phylogenetic inference, for example, the maximum likelihood method can be inconsistent when the DNA sequences are generated under a tree topology which is in the Felsentein Zone and analyzed with a mis-specified or inadequate model. However, model adequacy testing in phylogenetics is underdeveloped. RESULTS: Here we develop a simple, general, powerful and robust model test based on Pearson's goodness-of-fit test and binning of site patterns. We demonstrate through simulation that this test is robust in its high power to reject the inadequate models for a large range of different ways of binning site patterns while the Type I error is controlled well. In the real data analysis we discovered many cases where models chosen by another method can be rejected by this new test, in particular, our proposed test rejects the most complex DNA model (GTR+I+ Γ) while the Goldman-Cox test fails to reject the commonly used simple models. CONCLUSIONS: Model adequacy testing and bootstrap should be used together to assess reliability of conclusions after model selection and model fitting have already been applied to choose the model and fit it. The new goodness-of-fit test proposed in this paper is a simple and powerful model adequacy testing method serving such a regular model checking purpose. We caution against deriving strong conclusions from analyses based on inadequate models. At a minimum, those results derived from inadequate models can now be readly flagged using the new test, and reported as such.


Assuntos
DNA/genética , Modelos Genéticos , Sequência de Bases , Viés , Sítios de Ligação , Simulação por Computador , Bases de Dados Genéticas , Humanos , Funções Verossimilhança , Filogenia
10.
BMC Evol Biol ; 19(1): 22, 2019 01 14.
Artigo em Inglês | MEDLINE | ID: mdl-30642241

RESUMO

BACKGROUND: An excess of nonsynonymous substitutions, over neutrality, is considered evidence of positive Darwinian selection. Inference for proteins often relies on estimation of the nonsynonymous to synonymous ratio (ω = dN/dS) within a codon model. However, to ease computational difficulties, ω is typically estimated assuming an idealized substitution process where (i) all nonsynonymous substitutions have the same rate (regardless of impact on organism fitness) and (ii) instantaneous double and triple (DT) nucleotide mutations have zero probability (despite evidence that they can occur). It follows that estimates of ω represent an imperfect summary of the intensity of selection, and that tests based on the ω > 1 threshold could be negatively impacted. RESULTS: We developed a general-purpose parametric (GPP) modelling framework for codons. This novel approach allows specification of all possible instantaneous codon substitutions, including multiple nonsynonymous rates (MNRs) and instantaneous DT nucleotide changes. Existing codon models are specified as special cases of the GPP model. We use GPP models to implement likelihood ratio tests for ω > 1 that accommodate MNRs and DT mutations. Through both simulation and real data analysis, we find that failure to model MNRs and DT mutations reduces power in some cases and inflates false positives in others. False positives under traditional M2a and M8 models were very sensitive to DT changes. This was exacerbated by the choice of frequency parameterization (GY vs. MG), with rates sometimes > 90% under MG. By including MNRs and DT mutations, accuracy and power was greatly improved under the GPP framework. However, we also find that over-parameterized models can perform less well, and this can contribute to degraded performance of LRTs. CONCLUSIONS: We suggest GPP models should be used alongside traditional codon models. Further, all codon models should be deployed within an experimental design that includes (i) assessing robustness to model assumptions, and (ii) investigation of non-standard behaviour of MLEs. As the goal of every analysis is to avoid false conclusions, more work is needed on model selection methods that consider both the increase in fit engendered by a model parameter and the degree to which that parameter is affected by un-modelled evolutionary processes.


Assuntos
Códon/genética , Modelos Genéticos , Taxa de Mutação , Mutação/genética , Nucleotídeos/genética , Seleção Genética , Simulação por Computador , Evolução Molecular , Streptococcus/genética
11.
Microbiome ; 5(1): 110, 2017 08 31.
Artigo em Inglês | MEDLINE | ID: mdl-28859695

RESUMO

BACKGROUND: Learning the structure of microbial communities is critical in understanding the different community structures and functions of microbes in distinct individuals. We view microbial communities as consisting of many subcommunities which are formed by certain groups of microbes functionally dependent on each other. The focus of this paper is on methods for extracting the subcommunities from the data, in particular Non-Negative Matrix Factorization (NMF). Our methods can be applied to both OTU data and functional metagenomic data. We apply the existing unsupervised NMF method and also develop a new supervised NMF method for extracting interpretable information from classification problems. RESULTS: The relevance of the subcommunities identified by NMF is demonstrated by their excellent performance for classification. Through three data examples, we demonstrate how to interpret the features identified by NMF to draw meaningful biological conclusions and discover hitherto unidentified patterns in the data. Comparing whole metagenomes of various mammals, (Muegge et al., Science 332:970-974, 2011), the biosynthesis of macrolides pathway is found in hindgut-fermenting herbivores, but not carnivores. This is consistent with results in veterinary science that macrolides should not be given to non-ruminant herbivores. For time series microbiome data from various body sites (Caporaso et al., Genome Biol 12:50, 2011), a shift in the microbial communities is identified for one individual. The shift occurs at around the same time in the tongue and gut microbiomes, indicating that the shift is a genuine biological trait, rather than an artefact of the method. For whole metagenome data from IBD patients and healthy controls (Qin et al., Nature 464:59-65, 2010), we identify differences in a number of pathways (some known, others new). CONCLUSIONS: NMF is a powerful tool for identifying the key features of microbial communities. These identified features can not only be used to perform difficult classification problems with a high degree of accuracy, they are also very interpretable and can lead to important biological insights into the structure of the communities. In addition, NMF is a dimension-reduction method (similar to PCA) in that it reduces the extremely complex microbial data into a low-dimensional representation, allowing a number of analyses to be performed more easily-for example, searching for temporal patterns in the microbiome. When we are interested in the differences between the structures of two groups of communities, supervised NMF provides a better way to do this, while retaining all the advantages of NMF-e.g. interpretability and a simple biological intuition.


Assuntos
Microbioma Gastrointestinal/genética , Metagenoma , Metagenômica/métodos , Consórcios Microbianos , Algoritmos , Humanos , Aprendizado de Máquina Supervisionado , Aprendizado de Máquina não Supervisionado
12.
PLoS One ; 9(4): e94279, 2014.
Artigo em Inglês | MEDLINE | ID: mdl-24732341

RESUMO

We present a simple and effective method for combining distance matrices from multiple genes on identical taxon sets to obtain a single representative distance matrix from which to derive a combined-gene phylogenetic tree. The method applies singular value decomposition (SVD) to extract the greatest common signal present in the distances obtained from each gene. The first right eigenvector of the SVD, which corresponds to a weighted average of the distance matrices of all genes, can thus be used to derive a representative tree from multiple genes. We apply our method to three well known data sets and estimate the uncertainty using bootstrap methods. Our results show that this method works well for these three data sets and that the uncertainty in these estimates is small. A simulation study is conducted to compare the performance of our method with several other distance based approaches (namely SDM, SDM* and ACS97), and we find the performances of all these approaches are comparable in the consensus setting. The computational complexity of our method is similar to that of SDM. Besides constructing a representative tree from multiple genes, we also demonstrate how the subsequent eigenvalues and eigenvectors may be used to identify if there are conflicting signals in the data and which genes might be influential or outliers for the estimated combined-gene tree.


Assuntos
Algoritmos , Genes , Filogenia , Animais , Cloroplastos/genética , Simulação por Computador , Bases de Dados Genéticas , Humanos
13.
Stat Appl Genet Mol Biol ; 11(4): Article 14, 2012 Sep 25.
Artigo em Inglês | MEDLINE | ID: mdl-23023698

RESUMO

We analytically derive the first and second derivatives of the likelihood in maximum likelihood methods for phylogeny. These results enable the Newton-Raphson method to be used for maximising likelihood, which is important because there is a need for faster methods for optimisation of parameters in maximum likelihood methods. Furthermore, the calculation of the Hessian matrix also opens up possibilities for standard likelihood theory to be applied, for inference in phylogeny and for model selection problems. Another application of the Hessian matrix is local influence analysis, which can be used for detecting a number of biologically interesting phenomena. The pruning algorithm has been used to speed up computation of likelihoods for a tree. We explain how it can be used to speed up the computation for the first and second derivatives of the likelihood with respect to branch lengths and other parameters. The results in this paper apply not only to bifurcating trees, but also to general multifurcating trees. We demonstrate the use of our Hessian calculation for the three applications listed above, and compare with existing methods for those applications.


Assuntos
Algoritmos , Modelos Teóricos , Filogenia , Análise de Sequência de DNA/estatística & dados numéricos , Simulação por Computador , Evolução Molecular , Funções Verossimilhança , Cadeias de Markov
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...