Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 4 de 4
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
Genes Genomics ; 43(9): 1059-1064, 2021 09.
Artigo em Inglês | MEDLINE | ID: mdl-34181214

RESUMO

BACKGROUND: The inherent correlations among gene expressions have received attention. Recently, it was reported that a set of approximately 1000 landmark genes can be utilized for prediction of expression of other genes (target genes). OBJECTIVE: The objective of this study is to predict expression values of target genes based on expression values of landmark genes. METHODS: A cluster-based regression method is proposed. In the proposed method, clusters are obtained from a set of training instances of a gene and an estimator is obtained per cluster. A test instance is assigned to one of clusters then a regression model corresponding to the cluster predicts expression value. RESULTS: Performance of the proposed method is measured on the GEO (Gene Expression Omnibus) expression data and the GTEx (Genotype-Tissue Expression) expression data. In terms of mean absolute error averaged across target genes, the proposed method significantly outperforms previous approaches in the case of the GEO expression data. CONCLUSIONS: The experimental results report that the combination of clustering and regression can outperform the state-of-the art methods such as generative adversarial networks and a gradient boosting based method.


Assuntos
Perfilação da Expressão Gênica/estatística & dados numéricos , Expressão Gênica/genética , Análise de Regressão , Análise por Conglomerados , Humanos
2.
Genes Genomics ; 42(2): 225-234, 2020 02.
Artigo em Inglês | MEDLINE | ID: mdl-31833048

RESUMO

BACKGROUND: One of the apparent characteristics of bioinformatics data is the combination of very large number of features and relatively small number of samples. The vast number of features makes intuitive understanding of a target domain difficult. Dimensionality reduction or manifold learning has potential to circumvent this obstacle, but restricted methods have been preferred. OBJECTIVE: The objective of this study is to observe the characteristics of various dimensionality reduction methods-locally linear embedding (LLE), multi-dimensional scaling (MDS), principal component analysis (PCA), spectral embedding (SE), and t-distributed Stochastic Neighbor Embedding (t-SNE)-on the RNA-Seq dataset from the genotype-tissue expression (GTEx) project. RESULTS: The characteristics of the dimensionality reduction methods are observed on the nine groups of three different tissues in the reduced space with dimensionality of two, three, and four. The visualization results report that each dimensionality reduction method produces a very distinct reduced space. The quantitative results are obtained as the performance of k-means clustering. Clustering in the reduced space from non-linear methods such as LLE, t-SNE and SE achieved better results than in the reduced space produced by linear methods like PCA and MDS. CONCLUSIONS: The experimental results recommend the application of both linear and non-linear dimensionality reduction methods on the target data for grasping the underlying characteristics of the datasets intuitively.


Assuntos
RNA-Seq , Algoritmos , Análise por Conglomerados , Análise de Componente Principal
3.
Artigo em Inglês | MEDLINE | ID: mdl-26357316

RESUMO

Efficient search algorithms for finding genomic-range overlaps are essential for various bioinformatics applications. A majority of fast algorithms for searching the overlaps between a query range (e.g., a genomic variant) and a set of N reference ranges (e.g., exons) has time complexity of O(k + logN), where kdenotes a term related to the length and location of the reference ranges. Here, we present a simple but efficient algorithm that reduces k, based on the maximum reference range length. Specifically, for a given query range and the maximum reference range length, the proposed method divides the reference range set into three subsets: always, potentially, and never overlapping. Therefore, search effort can be reduced by excluding never overlapping subset. We demonstrate that the running time of the proposed algorithm is proportional to potentially overlapping subset size, that is proportional to the maximum reference range length if all the other conditions are the same. Moreover, an implementation of our algorithm was 13.8 to 30.0 percent faster than one of the fastest range search methods available when tested on various genomic-range data sets. The proposed algorithm has been incorporated into a disease-linked variant prioritization pipeline for WGS (http://gnome.tchlab.org) and its implementation is available at http://ml.ssu.ac.kr/gSearch.


Assuntos
Algoritmos , Genômica/métodos , Análise de Sequência de DNA/métodos , Simulação por Computador
4.
Methods ; 69(3): 213-9, 2014 Oct 01.
Artigo em Inglês | MEDLINE | ID: mdl-25072168

RESUMO

Faster and cheaper sequencing technologies together with the ability to sequence uncultured microbes collected from any environment present us an opportunity to distill meaningful information from the millions of new genomic sequences from environmental samples, called metagenome. Contrary to conventional cultured microbes, however, the metagenomic data is extremely heterogeneous and noisy. Therefore the separation of the sets of sequenced genomic fragments that belong to different microbes is essential for successful assembly of microbial genomes. In this paper, we present a novel clustering method for a given metagenomic dataset. The metagenomic dataset has some distinguished features because (i) it is possible that similar sequence patterns may exist in different species and (ii) each species has different number of individuals in the given metagenomic dataset. Our method overcomes these obstacles by using the Gaussian mixture model and analysis of mixture profiles, and taking advantage of genomic signatures extracted from the metagenomic dataset. Unlike conventional clustering methods where clusters are discovered through global similarities of data instances, our method builds clusters by combining the data instances sharing local similarities captured by mixture analysis. By considering shared mixture components, our method is able to create clusters of genomic sequences although they are globally distinct each other. We applied our method to an artificial metagenomic dataset comprised of simulated 47 million reads from 25 real microbial genomes, and analyzed the resulting clusters in terms of the number of clusters, the number of participating species and dominant species in each cluster. Even though our approach cannot address all challenges in the field of metagenome sequence clustering, we believe that out method can contribute to take a step forward to achieve the goals.


Assuntos
Biologia Computacional/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Metagenoma , Análise de Sequência de DNA/métodos , Algoritmos , Sequência de Bases , Análise por Conglomerados , Genômica , Humanos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...