Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 31
Filtrar
1.
Patterns (N Y) ; 3(3): 100434, 2022 Mar 11.
Artigo em Inglês | MEDLINE | ID: mdl-35510185

RESUMO

Gene knockout (KO) experiments are a proven, powerful approach for studying gene function. However, systematic KO experiments targeting a large number of genes are usually prohibitive due to the limit of experimental and animal resources. Here, we present scTenifoldKnk, an efficient virtual KO tool that enables systematic KO investigation of gene function using data from single-cell RNA sequencing (scRNA-seq). In scTenifoldKnk analysis, a gene regulatory network (GRN) is first constructed from scRNA-seq data of wild-type samples, and a target gene is then virtually deleted from the constructed GRN. Manifold alignment is used to align the resulting reduced GRN to the original GRN to identify differentially regulated genes, which are used to infer target gene functions in analyzed cells. We demonstrate that the scTenifoldKnk-based virtual KO analysis recapitulates the main findings of real-animal KO experiments and recovers the expected functions of genes in relevant cell types.

2.
Patterns (N Y) ; 1(9): 100139, 2020 Dec 11.
Artigo em Inglês | MEDLINE | ID: mdl-33336197

RESUMO

We present scTenifoldNet-a machine learning workflow built upon principal-component regression, low-rank tensor approximation, and manifold alignment-for constructing and comparing single-cell gene regulatory networks (scGRNs) using data from single-cell RNA sequencing. scTenifoldNet reveals regulatory changes in gene expression between samples by comparing the constructed scGRNs. With real data, scTenifoldNet identifies specific gene expression programs associated with different biological processes, providing critical insights into the underlying mechanism of regulatory networks governing cellular transcriptional activities.

3.
Cells ; 9(1)2019 12 19.
Artigo em Inglês | MEDLINE | ID: mdl-31861624

RESUMO

As single-cell RNA sequencing (scRNA-seq) data becomes widely available, cell-to-cell variability in gene expression, or single-cell expression variability (scEV), has been increasingly appreciated. However, it remains unclear whether this variability is functionally important and, if so, what are its implications for multi-cellular organisms. Here, we analyzed multiple scRNA-seq data sets from lymphoblastoid cell lines (LCLs), lung airway epithelial cells (LAECs), and dermal fibroblasts (DFs) and, for each cell type, selected a group of homogenous cells with highly similar expression profiles. We estimated the scEV levels for genes after correcting the mean-variance dependency in that data and identified 465, 466, and 364 highly variable genes (HVGs) in LCLs, LAECs, and DFs, respectively. Functions of these HVGs were found to be enriched with those biological processes precisely relevant to the corresponding cell type's function, from which the scRNA-seq data used to identify HVGs were generated-e.g., cytokine signaling pathways were enriched in HVGs identified in LCLs, collagen formation in LAECs, and keratinization in DFs. We repeated the same analysis with scRNA-seq data from induced pluripotent stem cells (iPSCs) and identified only 79 HVGs with no statistically significant enriched functions; the overall scEV in iPSCs was of negligible magnitude. Our results support the "variation is function" hypothesis, arguing that scEV is required for cell type-specific, higher-level system function. Thus, quantifying and characterizing scEV are of importance for our understating of normal and pathological cellular processes.


Assuntos
Perfilação da Expressão Gênica/métodos , Redes Reguladoras de Genes , Análise de Célula Única/métodos , Algoritmos , Linhagem Celular , Regulação da Expressão Gênica , Humanos , Especificidade de Órgãos , Análise de Sequência de RNA/métodos
4.
Econom Stat ; 9: 140-155, 2019 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-30740554

RESUMO

A semiparametric varying-coefficient mixed regressive spatial autoregressive model is used to study covariate effects on spatially dependent responses, where the effects of some covariates are allowed to vary with other variables. A semiparametric series-based least squares estimating procedure is proposed with the introduction of instrumental variables and series approximations of the conditional expectations. The estimators for both the nonparametric and parametric components of the model are shown to be consistent and their asymptotic distributions are derived. The proposed estimators perform well in simulations. The proposed method is applied to analyze a data set on teen pregnancy to investigate effects of neighborhood as well as other social and economic factors on the teen pregnancy rate.

5.
Biometrics ; 74(4): 1301-1310, 2018 12.
Artigo em Inglês | MEDLINE | ID: mdl-29738627

RESUMO

In many applications, non-Gaussian data such as binary or count are observed over a continuous domain and there exists a smooth underlying structure for describing such data. We develop a new functional data method to deal with this kind of data when the data are regularly spaced on the continuous domain. Our method, referred to as Exponential Family Functional Principal Component Analysis (EFPCA), assumes the data are generated from an exponential family distribution, and the matrix of the canonical parameters has a low-rank structure. The proposed method flexibly accommodates not only the standard one-way functional data, but also two-way (or bivariate) functional data. In addition, we introduce a new cross validation method for estimating the latent rank of a generalized data matrix. We demonstrate the efficacy of the proposed methods using a comprehensive simulation study. The proposed method is also applied to a real application of the UK mortality study, where data are binomially distributed and two-way functional across age groups and calendar years. The results offer novel insights into the underlying mortality pattern.


Assuntos
Biometria/métodos , Simulação por Computador/estatística & dados numéricos , Análise de Componente Principal/métodos , Fatores Etários , Calendários como Assunto/estatística & dados numéricos , Humanos , Mortalidade , Reino Unido
6.
Comput Struct Biotechnol J ; 15: 243-254, 2017.
Artigo em Inglês | MEDLINE | ID: mdl-28280526

RESUMO

Recently, the study of protein structures using angular representations has attracted much attention among structural biologists. The main challenge is how to efficiently model the continuous conformational space of the protein structures based on the differences and similarities between different Ramachandran plots. Despite the presence of statistical methods for modeling angular data of proteins, there is still a substantial need for more sophisticated and faster statistical tools to model the large-scale circular datasets. To address this need, we have developed a nonparametric method for collective estimation of multiple bivariate density functions for a collection of populations of protein backbone angles. The proposed method takes into account the circular nature of the angular data using trigonometric spline which is more efficient compared to existing methods. This collective density estimation approach is widely applicable when there is a need to estimate multiple density functions from different populations with common features. Moreover, the coefficients of adaptive basis expansion for the fitted densities provide a low-dimensional representation that is useful for visualization, clustering, and classification of the densities. The proposed method provides a novel and unique perspective to two important and challenging problems in protein structure research: structure-based protein classification and angular-sampling-based protein loop structure prediction.

7.
IEEE Trans Image Process ; 25(12): 5713-5726, 2016 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-28114064

RESUMO

This paper studies the problem of detecting the presence of nanoparticles in noisy transmission electron microscopic (TEM) images and then fitting each nanoparticle with an elliptic shape model. In order to achieve robustness while handling low contrast and high noise in the TEM images, we propose an approach to fuse two kinds of complementary image information, namely, the pixel intensity and the gradient (the first derivative in intensity). Our approach entails two main steps: 1) the first step is to, after necessary pre-processing, employ both intensity-based information and gradient-based information to process the same TEM image and produce two independent sets of results and 2) the subsequent step is to formulate a binary integer programming (BIP) problem for conflict resolution among the two sets of results. Solving the BIP problem determines the final nanoparticle identification. We apply our method to a set of TEM images taken under different microscopic resolutions and noise levels. The empirical results show the merit of the proposed method. It can process a TEM image of 1024×1024 pixels in a few minutes, and the processed outcomes appear rather robust.

8.
Pattern Recognit ; 60: 681-691, 2016 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-28066030

RESUMO

We propose a Sparse exponential family Principal Component Analysis (SePCA) method suitable for any type of data following exponential family distributions, to achieve simultaneous dimension reduction and variable selection for better interpretation of the results. Because of the generality of exponential family distributions, the method can be applied to a wide range of applications, in particular when analyzing high dimensional next-generation sequencing data and genetic mutation data in genomics. The use of sparsity-inducing penalty helps produce sparse principal component loading vectors such that the principal components can focus on informative variables. By using an equivalent dual form of the formulated optimization problem for SePCA, we derive optimal solutions with efficient iterative closed-form updating rules. The results from both simulation experiments and real-world applications have demonstrated the superiority of our SePCA in reconstruction accuracy and computational efficiency over traditional exponential family PCA (ePCA), the existing Sparse PCA (SPCA) and Sparse Logistic PCA (SLPCA) algorithms.

9.
Biostatistics ; 16(4): 754-71, 2015 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-25987650

RESUMO

Motivated by data recording the effects of an exercise intervention on subjects' physical activity over time, we develop a model to assess the effects of a treatment when the data are functional with 3 levels (subjects, weeks and days in our application) and possibly incomplete. We develop a model with 3-level mean structure effects, all stratified by treatment and subject random effects, including a general subject effect and nested effects for the 3 levels. The mean and random structures are specified as smooth curves measured at various time points. The association structure of the 3-level data is induced through the random curves, which are summarized using a few important principal components. We use penalized splines to model the mean curves and the principal component curves, and cast the proposed model into a mixed effects model framework for model fitting, prediction and inference. We develop an algorithm to fit the model iteratively with the Expectation/Conditional Maximization Either (ECME) version of the EM algorithm and eigenvalue decompositions. Selection of the number of principal components and handling incomplete data issues are incorporated into the algorithm. The performance of the Wald-type hypothesis test is also discussed. The method is applied to the physical activity data and evaluated empirically by a simulation study.


Assuntos
Algoritmos , Ensaios Clínicos como Assunto/estatística & dados numéricos , Terapia por Exercício/estatística & dados numéricos , Modelos Estatísticos , Avaliação de Resultados em Cuidados de Saúde/estatística & dados numéricos , Projetos de Pesquisa/estatística & dados numéricos , Humanos
10.
J Comput Graph Stat ; 24(1): 84-103, 2015 Jan 01.
Artigo em Inglês | MEDLINE | ID: mdl-25914514

RESUMO

Principal component analysis (PCA) is a popular dimension reduction method to reduce the complexity and obtain the informative aspects of high-dimensional datasets. When the data distribution is skewed, data transformation is commonly used prior to applying PCA. Such transformation is usually obtained from previous studies, prior knowledge, or trial-and-error. In this work, we develop a model-based method that integrates data transformation in PCA and finds an appropriate data transformation using the maximum profile likelihood. Extensions of the method to handle functional data and missing values are also developed. Several numerical algorithms are provided for efficient computation. The proposed method is illustrated using simulated and real-world data examples.

11.
BMC Bioinformatics ; 15 Suppl 15: S4, 2014.
Artigo em Inglês | MEDLINE | ID: mdl-25474163

RESUMO

BACKGROUND: Protein-ligand binding is important for some proteins to perform their functions. Protein-ligand binding sites are the residues of proteins that physically bind to ligands. Despite of the recent advances in computational prediction for protein-ligand binding sites, the state-of-the-art methods search for similar, known structures of the query and predict the binding sites based on the solved structures. However, such structural information is not commonly available. RESULTS: In this paper, we propose a sequence-based approach to identify protein-ligand binding residues. We propose a combination technique to reduce the effects of different sliding residue windows in the process of encoding input feature vectors. Moreover, due to the highly imbalanced samples between the ligand-binding sites and non ligand-binding sites, we construct several balanced data sets, for each of which a random forest (RF)-based classifier is trained. The ensemble of these RF classifiers forms a sequence-based protein-ligand binding site predictor. CONCLUSIONS: Experimental results on CASP9 and CASP8 data sets demonstrate that our method compares favorably with the state-of-the-art protein-ligand binding site prediction methods.


Assuntos
Inteligência Artificial , Proteínas/química , Análise de Sequência de Proteína/métodos , Aminoácidos/química , Sítios de Ligação , Ligantes , Conformação Proteica
12.
J Biol Rhythms ; 29(4): 231-42, 2014 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-25238853

RESUMO

Identification of circadian-regulated genes based on temporal transcriptome data is important for studying the regulation mechanism of the circadian system. However, various computational methods adopting different strategies for the identification of cycling transcripts usually yield inconsistent results even for the same dataset, making it challenging to choose the optimal method for a specific circadian study. To address this challenge, we evaluate 5 popular methods, including ARSER (ARS), COSOPT (COS), Fisher's G test (FIS), HAYSTACK (HAY), and JTK_CYCLE (JTK), based on both simulated and empirical datasets. Our results show that increasing the number of total samples (through improving sampling frequency or lengthening the sampling time window) is beneficial for computational methods to accurately identify circadian transcripts and measure circadian phase. For a given number of total samples, higher sampling frequency is more important for HAY and JTK, and the longer sampling time window is more crucial for ARS and COS, as testified on simulated and empirical datasets from which circadian signals are computationally identified. In addition, the preference of higher sampling frequency or the longer sampling time window is also obvious for JTK, ARS, and COS in estimating circadian phases of simulated periodic profiles. Our results also indicate that attention should be paid to the significance threshold that is used for each method in selecting circadian genes, especially when analyzing the same empirical dataset with 2 or more methods. To summarize, for any study involving genome-wide identification of circadian genes from transcriptome data, our evaluation results provide suggestions for the selection of an optimal method based on specific goal and experimental design.


Assuntos
Ritmo Circadiano/genética , Estudo de Associação Genômica Ampla/métodos , Genoma/genética , Transcriptoma/genética , Biologia Computacional/métodos , Perfilação da Expressão Gênica/métodos
13.
BMC Genomics ; 15 Suppl 1: S10, 2014.
Artigo em Inglês | MEDLINE | ID: mdl-24564304

RESUMO

In order to have a better understanding of unexplained heritability for complex diseases in conventional Genome-Wide Association Studies (GWAS), aggregated association analyses based on predefined functional regions, such as genes and pathways, become popular recently as they enable evaluating joint effect of multiple Single-Nucleotide Polymorphisms (SNPs), which helps increase the detection power, especially when investigating genetic variants with weak individual effects. In this paper, we focus on aggregated analysis methods based on the idea of Principal Component Analysis (PCA). The past approaches using PCA mostly make some inherent genotype data and/or risk effect model assumptions, which may hinder the accurate detection of potential disease SNPs that influence disease phenotypes. In this paper, we derive a general Supervised Categorical Principal Component Analysis (SCPCA), which explicitly models categorical SNP data without imposing any risk effect model assumption. We have evaluated the efficacy of SCPCA with the comparison to a traditional Supervised PCA (SPCA) and a previously developed Supervised Logistic Principal Component Analysis (SLPCA) based on both the simulated genotype data by HAPGEN2 and the genotype data of Crohn's Disease (CD) from Wellcome Trust Case Control Consortium (WTCCC). Our preliminary results have demonstrated the superiority of SCPCA over both SPCA and SLPCA due to its modeling explicitly designed for categorical SNP data as well as its flexibility on the risk effect model assumption.


Assuntos
Doença de Crohn/genética , Polimorfismo de Nucleotídeo Único , Análise de Componente Principal/métodos , Algoritmos , Variação Genética , Estudo de Associação Genômica Ampla , Genótipo , Humanos , Desequilíbrio de Ligação , Modelos Genéticos
14.
J Am Stat Assoc ; 109(508): 1355-1367, 2014 Dec 01.
Artigo em Inglês | MEDLINE | ID: mdl-25642005

RESUMO

In genome-wide association studies, the primary task is to detect biomarkers in the form of Single Nucleotide Polymorphisms (SNPs) that have nontrivial associations with a disease phenotype and some other important clinical/environmental factors. However, the extremely large number of SNPs comparing to the sample size inhibits application of classical methods such as the multiple logistic regression. Currently the most commonly used approach is still to analyze one SNP at a time. In this paper, we propose to consider the genotypes of the SNPs simultaneously via a logistic analysis of variance (ANOVA) model, which expresses the logit transformed mean of SNP genotypes as the summation of the SNP effects, effects of the disease phenotype and/or other clinical variables, and the interaction effects. We use a reduced-rank representation of the interaction-effect matrix for dimensionality reduction, and employ the L1-penalty in a penalized likelihood framework to filter out the SNPs that have no associations. We develop a Majorization-Minimization algorithm for computational implementation. In addition, we propose a modified BIC criterion to select the penalty parameters and determine the rank number. The proposed method is applied to a Multiple Sclerosis data set and simulated data sets and shows promise in biomarker detection.

15.
Artigo em Inglês | MEDLINE | ID: mdl-26357039

RESUMO

The characteristics of low minor allele frequency (MAF) and weak individual effects make genome-wide association studies (GWAS) for rare variant single nucleotide polymorphisms (SNPs) more difficult when using conventional statistical methods. By aggregating the rare variant effects belonging to the same gene, collapsing is the most common way to enhance the detection of rare variant effects for association analyses with a given trait. In this paper, we propose a novel framework of MAF-based logistic principal component analysis (MLPCA) to derive aggregated statistics by explicitly modeling the correlation between rare variant SNP data, which is categorical. The derived aggregated statistics by MLPCA can then be tested as a surrogate variable in regression models to detect the gene-environment interaction from rare variants. In addition, MLPCA searches for the optimal linear combination from the best subset of rare variants according to MAF that has the maximum association with the given trait. We compared the power of our MLPCA-based methods with four existing collapsing methods in gene-environment interaction association analysis using both our simulation data set and Genetic Analysis Workshop 17 (GAW17) data. Our experimental results have demonstrated that MLPCA on two forms of genotype data representations achieves higher statistical power than those existing methods and can be further improved by introducing the appropriate sparsity penalty. The performance improvement by our MLPCA-based methods result from the derived aggregated statistics by explicitly modeling categorical SNP data and searching for the maximum associated subset of SNPs for collapsing, which helps better capture the combined effect from individual rare variants and the interaction with environmental factors.


Assuntos
Biologia Computacional/métodos , Interação Gene-Ambiente , Polimorfismo de Nucleotídeo Único/genética , Análise de Componente Principal/métodos , Simulação por Computador , Modelos Genéticos
16.
Neuroinformatics ; 11(4): 477-93, 2013 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-23842791

RESUMO

In this work, we propose a spatial-temporal two-way regularized regression method for reconstructing neural source signals from EEG/MEG time course measurements. The proposed method estimates the dipole locations and amplitudes simultaneously through minimizing a single penalized least squares criterion. The novelty of our methodology is the simultaneous consideration of three desirable properties of the reconstructed source signals, that is, spatial focality, spatial smoothness, and temporal smoothness. The desirable properties are achieved by using three separate penalty functions in the penalized regression framework. Specifically, we impose a roughness penalty in the temporal domain for temporal smoothness, and a sparsity-inducing penalty and a graph Laplacian penalty in the spatial domain for spatial focality and smoothness. We develop a computational efficient multilevel block coordinate descent algorithm to implement the method. Using a simulation study with several settings of different spatial complexity and two real MEG examples, we show that the proposed method outperforms existing methods that use only a subset of the three penalty functions.


Assuntos
Mapeamento Encefálico , Encéfalo/fisiologia , Eletroencefalografia , Magnetoencefalografia , Análise de Regressão , Algoritmos , Simulação por Computador , Humanos
17.
Proteins ; 81(8): 1351-62, 2013 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-23504705

RESUMO

Hot spot residues of proteins are fundamental interface residues that help proteins perform their functions. Detecting hot spots by experimental methods is costly and time-consuming. Sequential and structural information has been widely used in the computational prediction of hot spots. However, structural information is not always available. In this article, we investigated the problem of identifying hot spots using only physicochemical characteristics extracted from amino acid sequences. We first extracted 132 relatively independent physicochemical features from a set of the 544 properties in AAindex1, an amino acid index database. Each feature was utilized to train a classification model with a novel encoding schema for hot spot prediction by the IBk algorithm, an extension of the K-nearest neighbor algorithm. The combinations of the individual classifiers were explored and the classifiers that appeared frequently in the top performing combinations were selected. The hot spot predictor was built based on an ensemble of these classifiers and to work in a voting manner. Experimental results demonstrated that our method effectively exploited the feature space and allowed flexible weights of features for different queries. On the commonly used hot spot benchmark sets, our method significantly outperformed other machine learning algorithms and state-of-the-art hot spot predictors. The program is available at http://sfb.kaust.edu.sa/pages/software.aspx.


Assuntos
Proteínas/química , Proteínas/metabolismo , Algoritmos , Sequência de Aminoácidos , Aminoácidos/química , Aminoácidos/metabolismo , Animais , Inteligência Artificial , Bases de Dados de Proteínas , Drosophila/química , Drosophila/metabolismo , Proteínas de Drosophila/química , Proteínas de Drosophila/metabolismo , Humanos , Hormônios Juvenis/química , Hormônios Juvenis/metabolismo , Modelos Moleculares , Mapas de Interação de Proteínas , Receptores da Eritropoetina/química , Receptores da Eritropoetina/metabolismo
18.
Brief Bioinform ; 14(6): 724-36, 2013 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-22926831

RESUMO

Despite considerable progress in the past decades, protein structure prediction remains one of the major unsolved problems in computational biology. Angular-sampling-based methods have been extensively studied recently due to their ability to capture the continuous conformational space of protein structures. The literature has focused on using a variety of parametric models of the sequential dependencies between angle pairs along the protein chains. In this article, we present a thorough review of angular-sampling-based methods by assessing three main questions: What is the best distribution type to model the protein angles? What is a reasonable number of components in a mixture model that should be considered to accurately parameterize the joint distribution of the angles? and What is the order of the local sequence-structure dependency that should be considered by a prediction method? We assess the model fits for different methods using bivariate lag-distributions of the dihedral/planar angles. Moreover, the main information across the lags can be extracted using a technique called Lag singular value decomposition (LagSVD), which considers the joint distribution of the dihedral/planar angles over different lags using a nonparametric approach and monitors the behavior of the lag-distribution of the angles using singular value decomposition. As a result, we developed graphical tools and numerical measurements to compare and evaluate the performance of different model fits. Furthermore, we developed a web-tool (http://www.stat.tamu.edu/∼madoliat/LagSVD) that can be used to produce informative animations.


Assuntos
Proteínas/química , Cadeias de Markov , Conformação Proteica
19.
IEEE Trans Pattern Anal Mach Intell ; 35(3): 669-81, 2013 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-22848127

RESUMO

This paper presents a method that enables automated morphology analysis of partially overlapping nanoparticles in electron micrographs. In the undertaking of morphology analysis, three tasks appear necessary: separate individual particles from an agglomerate of overlapping nano-objects; infer the particle's missing contours; and ultimately, classify the particles by shape based on their complete contours. Our specific method adopts a two-stage approach: the first stage executes the task of particle separation, and the second stage conducts simultaneously the tasks of contour inference and shape classification. For the first stage, a modified ultimate erosion process is developed for decomposing a mixture of particles into markers, and then, an edge-to-marker association method is proposed to identify the set of evidences that eventually delineate individual objects. We also provided theoretical justification regarding the separation capability of the first stage. In the second stage, the set of evidences become inputs to a Gaussian mixture model on B-splines, the solution of which leads to the joint learning of the missing contour and the particle shape. Using twelve real electron micrographs of overlapping nanoparticles, we compare the proposed method with seven state-of-the-art methods. The results show the superiority of the proposed method in terms of particle recognition rate.

20.
JMLR Workshop Conf Proc ; 28(2): 37-45, 2013.
Artigo em Inglês | MEDLINE | ID: mdl-25285330

RESUMO

Non-convex sparsity-inducing penalties have recently received considerable attentions in sparse learning. Recent theoretical investigations have demonstrated their superiority over the convex counterparts in several sparse learning settings. However, solving the non-convex optimization problems associated with non-convex penalties remains a big challenge. A commonly used approach is the Multi-Stage (MS) convex relaxation (or DC programming), which relaxes the original non-convex problem to a sequence of convex problems. This approach is usually not very practical for large-scale problems because its computational cost is a multiple of solving a single convex problem. In this paper, we propose a General Iterative Shrinkage and Thresholding (GIST) algorithm to solve the nonconvex optimization problem for a large class of non-convex penalties. The GIST algorithm iteratively solves a proximal operator problem, which in turn has a closed-form solution for many commonly used penalties. At each outer iteration of the algorithm, we use a line search initialized by the Barzilai-Borwein (BB) rule that allows finding an appropriate step size quickly. The paper also presents a detailed convergence analysis of the GIST algorithm. The efficiency of the proposed algorithm is demonstrated by extensive experiments on large-scale data sets.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...