Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 22
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
Sci Rep ; 13(1): 10478, 2023 06 28.
Artigo em Inglês | MEDLINE | ID: mdl-37380723

RESUMO

Machine learning-based pathogenicity prediction helps interpret rare missense variants of BRCA1 and BRCA2, which are associated with hereditary cancers. Recent studies have shown that classifiers trained using variants of a specific gene or a set of genes related to a particular disease perform better than those trained using all variants, due to their higher specificity, despite the smaller training dataset size. In this study, we further investigated the advantages of "gene-specific" machine learning compared to "disease-specific" machine learning. We used 1068 rare (gnomAD minor allele frequency (MAF) < 0.005) missense variants of 28 genes associated with hereditary cancers for our investigation. Popular machine learning classifiers were employed: regularized logistic regression, extreme gradient boosting, random forests, support vector machines, and deep neural networks. As features, we used MAFs from multiple populations, functional prediction and conservation scores, and positions of variants. The disease-specific training dataset included the gene-specific training dataset and was > 7 × larger. However, we observed that gene-specific training variants were sufficient to produce the optimal pathogenicity predictor if a suitable machine learning classifier was employed. Therefore, we recommend gene-specific over disease-specific machine learning as an efficient and effective method for predicting the pathogenicity of rare BRCA1 and BRCA2 missense variants.


Assuntos
Aprendizado de Máquina , Mutação de Sentido Incorreto , Virulência , Frequência do Gene , Redes Neurais de Computação
2.
PLoS One ; 17(7): e0271260, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-35901023

RESUMO

In numerous classification problems, class distribution is not balanced. For example, positive examples are rare in the fields of disease diagnosis and credit card fraud detection. General machine learning methods are known to be suboptimal for such imbalanced classification. One popular solution is to balance training data by oversampling the underrepresented (or undersampling the overrepresented) classes before applying machine learning algorithms. However, despite its popularity, the effectiveness of sampling has not been rigorously and comprehensively evaluated. This study assessed combinations of seven sampling methods and eight machine learning classifiers (56 varieties in total) using 31 datasets with varying degrees of imbalance. We used the areas under the precision-recall curve (AUPRC) and receiver operating characteristics curve (AUROC) as the performance measures. The AUPRC is known to be more informative for imbalanced classification than the AUROC. We observed that sampling significantly changed the performance of the classifier (paired t-tests P < 0.05) only for few cases (12.2% in AUPRC and 10.0% in AUROC). Surprisingly, sampling was more likely to reduce rather than improve the classification performance. Moreover, the adverse effects of sampling were more pronounced in AUPRC than in AUROC. Among the sampling methods, undersampling performed worse than others. Also, sampling was more effective for improving linear classifiers. Most importantly, we did not need sampling to obtain the optimal classifier for most of the 31 datasets. In addition, we found two interesting examples in which sampling significantly reduced AUPRC while significantly improving AUROC (paired t-tests P < 0.05). In conclusion, the applicability of sampling is limited because it could be ineffective or even harmful. Furthermore, the choice of the performance measure is crucial for decision making. Our results provide valuable insights into the effect and characteristics of sampling for imbalanced classification.


Assuntos
Efeitos Colaterais e Reações Adversas Relacionados a Medicamentos , Aprendizado de Máquina , Algoritmos , Área Sob a Curva , Humanos , Curva ROC
3.
Biomedicines ; 10(6)2022 Jun 14.
Artigo em Inglês | MEDLINE | ID: mdl-35740428

RESUMO

DNA methylation modification plays a vital role in the pathophysiology of high blood pressure (BP). Herein, we applied three machine learning (ML) algorithms including deep learning (DL), support vector machine, and random forest for detecting high BP using DNA methylome data. Peripheral blood samples of 50 elderly individuals were collected three times at three visits for DNA methylome profiling. Participants who had a history of hypertension and/or current high BP measure were considered to have high BP. The whole dataset was randomly divided to conduct a nested five-group cross-validation for prediction performance. Data in each outer training set were independently normalized using a min-max scaler, reduced dimensionality using principal component analysis, then fed into three predictive algorithms. Of the three ML algorithms, DL achieved the best performance (AUPRC = 0.65, AUROC = 0.73, accuracy = 0.69, and F1-score = 0.73). To confirm the reliability of using DNA methylome as a biomarker for high BP, we constructed mixed-effects models and found that 61,694 methylation sites located in 15,523 intragenic regions and 16,754 intergenic regions were significantly associated with BP measures. Our proposed models pioneered the methodology of applying ML and DNA methylome data for early detection of high BP in clinical practices.

4.
BMC Bioinformatics ; 23(1): 109, 2022 Mar 30.
Artigo em Inglês | MEDLINE | ID: mdl-35354356

RESUMO

BACKGROUND: In shotgun proteomics, database search engines have been developed to assign peptides to tandem mass (MS/MS) spectra and at the same time post-processing (or rescoring) approaches over the search results have been proposed to increase the number of confident peptide identifications. The most popular post-processing approaches such as Percolator and PeptideProphet have improved rates of peptide identifications by combining multiple scores from database search engines while applying machine learning techniques. Existing post-processing approaches, however, are limited when dealing with results from new search engines because their features for machine learning must be optimized specifically for each search engine. RESULTS: We propose a universal post-processing tool, called TIDD, which supports confident peptide identifications regardless of the search engine adopted. TIDD can work for any (including newly developed) search engines because it calculates universal features that assess peptide-spectrum match quality while it allows additional features provided by search engines (or users) as well. Even though it relies on universal features independent of search tools, TIDD showed similar or better performance than Percolator in terms of peptide identification. TIDD identified 10.23-38.95% more PSMs than target-decoy estimation for MSFragger, which is not supported by Percolator. TIDD offers an easy-to-use simple graphical user interface for user convenience. CONCLUSIONS: TIDD successfully eliminated the requirement for an optimal feature engineering per database search tool, and thus, can be applied directly to any database search results including newly developed ones.


Assuntos
Algoritmos , Espectrometria de Massas em Tandem , Bases de Dados de Proteínas , Aprendizado de Máquina , Peptídeos , Espectrometria de Massas em Tandem/métodos
5.
Sci Rep ; 9(1): 3219, 2019 03 01.
Artigo em Inglês | MEDLINE | ID: mdl-30824715

RESUMO

Comprehensive and accurate detection of variants from whole-genome sequencing (WGS) is a strong prerequisite for translational genomic medicine; however, low concordance between analytic pipelines is an outstanding challenge. We processed a European and an African WGS samples with 70 analytic pipelines comprising the combination of 7 short-read aligners and 10 variant calling algorithms (VCAs), and observed remarkable differences in the number of variants called by different pipelines (max/min ratio: 1.3~3.4). The similarity between variant call sets was more closely determined by VCAs rather than by short-read aligners. Remarkably, reported minor allele frequency had a substantial effect on concordance between pipelines (concordance rate ratio: 0.11~0.92; Wald tests, P < 0.001), entailing more discordant results for rare and novel variants. We compared the performance of analytic pipelines and pipeline ensembles using gold-standard variant call sets and the catalog of variants from the 1000 Genomes Project. Notably, a single pipeline using BWA-MEM and GATK-HaplotypeCaller performed comparable to the pipeline ensembles for 'callable' regions (~97%) of the human reference genome. While a single pipeline is capable of analyzing common variants in most genomic regions, our findings demonstrated the limitations and challenges in analyzing rare or novel variants, especially for non-European genomes.


Assuntos
Variação Genética , Genoma Humano/genética , Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Sequenciamento Completo do Genoma/métodos , Algoritmos , Alelos , População Negra/genética , Frequência do Gene , Genótipo , Haplótipos , Humanos , Polimorfismo de Nucleotídeo Único , População Branca/genética
6.
Cancer Cell ; 35(1): 111-124.e10, 2019 01 14.
Artigo em Inglês | MEDLINE | ID: mdl-30645970

RESUMO

We report proteogenomic analysis of diffuse gastric cancers (GCs) in young populations. Phosphoproteome data elucidated signaling pathways associated with somatic mutations based on mutation-phosphorylation correlations. Moreover, correlations between mRNA and protein abundances provided potential oncogenes and tumor suppressors associated with patient survival. Furthermore, integrated clustering of mRNA, protein, phosphorylation, and N-glycosylation data identified four subtypes of diffuse GCs. Distinguishing these subtypes was possible by proteomic data. Four subtypes were associated with proliferation, immune response, metabolism, and invasion, respectively; and associations of the subtypes with immune- and invasion-related pathways were identified mainly by phosphorylation and N-glycosylation data. Therefore, our proteogenomic analysis provides additional information beyond genomic analyses, which can improve understanding of cancer biology and patient stratification in diffuse GCs.


Assuntos
Redes Reguladoras de Genes , Mutação , Proteogenômica/métodos , Neoplasias Gástricas/genética , Neoplasias Gástricas/metabolismo , Idade de Início , Feminino , Glicosilação , Humanos , Masculino , Fosforilação , Mapas de Interação de Proteínas , Análise de Sobrevida , Sequenciamento do Exoma/métodos
7.
J Proteome Res ; 16(6): 2231-2239, 2017 06 02.
Artigo em Inglês | MEDLINE | ID: mdl-28452485

RESUMO

Proteogenomic searches are useful for novel peptide identification from tandem mass spectra. Usually, separate and multistage approaches are adopted to accurately control the false discovery rate (FDR) for proteogenomic search. Their performance on novel peptide identification has not been thoroughly evaluated, however, mainly due to the difficulty in confirming the existence of identified novel peptides. We simulated a proteogenomic search using a controlled, spike-in proteomic data set. After confirming that the results of the simulated proteogenomic search were similar to those of a real proteogenomic search using a human cell line data set, we evaluated the performance of six FDR control methods-global, separate, and multistage FDR estimation, respectively, coupled to a target-decoy search and a mixture model-based method-on novel peptide identification. The multistage approach showed the highest accuracy for FDR estimation. However, global and separate FDR estimation with the mixture model-based method showed higher sensitivities than others at the same true FDR. Furthermore, the mixture model-based method performed equally well when applied without or with a reduced set of decoy sequences. Considering different prior probabilities for novel and known protein identification, we recommend using mixture model-based methods with separate FDR estimation for sensitive and reliable identification of novel peptides from proteogenomic searches.


Assuntos
Peptídeos/análise , Proteogenômica/métodos , Linhagem Celular , Simulação por Computador , Reações Falso-Positivas , Humanos , Métodos , Modelos Teóricos , Espectrometria de Massas em Tandem
8.
BMC Genomics ; 17(Suppl 13): 1031, 2016 12 22.
Artigo em Inglês | MEDLINE | ID: mdl-28155652

RESUMO

BACKGROUND: Proteogenomics is a promising approach for various tasks ranging from gene annotation to cancer research. Databases for proteogenomic searches are often constructed by adding peptide sequences inferred from genomic or transcriptomic evidence to reference protein sequences. Such inflation of databases has potential of identifying novel peptides. However, it also raises concerns on sensitive and reliable peptide identification. Spurious peptides included in target databases may result in underestimated false discovery rate (FDR). On the other hand, inflation of decoy databases could decrease the sensitivity of peptide identification due to the increased number of high-scoring random hits. Although several studies have addressed these issues, widely applicable guidelines for sensitive and reliable proteogenomic search have hardly been available. RESULTS: To systematically evaluate the effect of database inflation in proteogenomic searches, we constructed a variety of real and simulated proteogenomic databases for yeast and human tandem mass spectrometry (MS/MS) data, respectively. Against these databases, we tested two popular database search tools with various approaches to search result validation: the target-decoy search strategy (with and without a refined scoring-metric) and a mixture model-based method. The effect of separate filtering of known and novel peptides was also examined. The results from real and simulated proteogenomic searches confirmed that separate filtering increases the sensitivity and reliability in proteogenomic search. However, no one method consistently identified the largest (or the smallest) number of novel peptides from real proteogenomic searches. CONCLUSIONS: We propose to use a set of search result validation methods with separate filtering, for sensitive and reliable identification of peptides in proteogenomic search.


Assuntos
Bases de Dados Genéticas , Peptídeos/metabolismo , Proteogenômica/métodos , Humanos , Peptídeos/química , Reprodutibilidade dos Testes , Ferramenta de Busca , Sensibilidade e Especificidade , Espectrometria de Massas em Tandem , Leveduras/metabolismo
9.
Artigo em Inglês | MEDLINE | ID: mdl-26357316

RESUMO

Efficient search algorithms for finding genomic-range overlaps are essential for various bioinformatics applications. A majority of fast algorithms for searching the overlaps between a query range (e.g., a genomic variant) and a set of N reference ranges (e.g., exons) has time complexity of O(k + logN), where kdenotes a term related to the length and location of the reference ranges. Here, we present a simple but efficient algorithm that reduces k, based on the maximum reference range length. Specifically, for a given query range and the maximum reference range length, the proposed method divides the reference range set into three subsets: always, potentially, and never overlapping. Therefore, search effort can be reduced by excluding never overlapping subset. We demonstrate that the running time of the proposed algorithm is proportional to potentially overlapping subset size, that is proportional to the maximum reference range length if all the other conditions are the same. Moreover, an implementation of our algorithm was 13.8 to 30.0 percent faster than one of the fastest range search methods available when tested on various genomic-range data sets. The proposed algorithm has been incorporated into a disease-linked variant prioritization pipeline for WGS (http://gnome.tchlab.org) and its implementation is available at http://ml.ssu.ac.kr/gSearch.


Assuntos
Algoritmos , Genômica/métodos , Análise de Sequência de DNA/métodos , Simulação por Computador
10.
Liver Int ; 35(12): 2537-46, 2015 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-26148225

RESUMO

BACKGROUND & AIMS: The I148M variant because of the substitution of C to G in PNPLA3 (rs738409) is associated with the increased risk of nonalcoholic fatty liver disease (NAFLD). In liver, I148M variant reduces hydrolytic function of PNPLA3, which results in hepatic steatosis; however, its association with the other clinical phenotype such as adiposity and metabolic diseases is not well established. METHODS: To identify the impact of I148M variant on clinical risk factors of NAFLD, we recruited 1363 generally healthy Korean males after excluding alcoholic and secondary causes of hepatic steatosis. Central adiposity was assessed by computed tomography, and hepatic steatosis was evaluated by abdominal ultrasonography. RESULTS: The participants were predominantly middle-aged (49.0 ± 7.1 years; range 30-60 years), and the frequency of NAFLD was 44.2%. The rs738409-G allele carriers had a 1.19-fold increased risk for NAFLD (minor allele frequency 0.43; allelic odds ratio 1.38; P = 4.3 × 10(-5) ). Interestingly, the rs738409 GG carriers showed significantly lower levels of visceral and subcutaneous adiposity (P < 0.001 and = 0.015, respectively), BMI (P < 0.001), triglycerides (P < 0.001) and insulin resistance (P = 0.002) compared to CC carriers. These negative associations between clinical risk factors and rs738409-G dosage were more prominent in non-NAFLD group compared to those in NAFLD group. CONCLUSIONS: The I148M variant, although increasing the risk of NAFLD, was associated with reduced levels of central adiposity, BMI, serum triglycerides and insulin resistance, suggesting differential roles in fat storage and distribution according to cell types and metabolic status.


Assuntos
Lipase/genética , Fígado , Proteínas de Membrana/genética , Doenças Metabólicas , Hepatopatia Gordurosa não Alcoólica , Obesidade Abdominal , Adulto , Índice de Massa Corporal , Predisposição Genética para Doença , Humanos , Resistência à Insulina/genética , Fígado/metabolismo , Fígado/patologia , Masculino , Doenças Metabólicas/diagnóstico , Doenças Metabólicas/genética , Pessoa de Meia-Idade , Hepatopatia Gordurosa não Alcoólica/diagnóstico , Hepatopatia Gordurosa não Alcoólica/genética , Obesidade Abdominal/diagnóstico , Obesidade Abdominal/genética , Polimorfismo de Nucleotídeo Único , República da Coreia , Triglicerídeos/sangue
11.
J Proteome Res ; 13(7): 3488-97, 2014 Jul 03.
Artigo em Inglês | MEDLINE | ID: mdl-24918111

RESUMO

Isobaric tag-based quantification such as iTRAQ and TMT is a promising approach to mass spectrometry-based quantification in proteomics as it provides wide proteome coverage with greatly increased experimental throughput. However, it is known to suffer from inaccurate quantification and identification of a target peptide due to cofragmentation of multiple peptides, which likely leads to under-estimation of differentially expressed peptides (DEPs). A simple method of filtering out cofragmented spectra with less than 100% precursor isolation purity (PIP) would decrease the coverage of iTRAQ/TMT experiments. In order to estimate the impact of cofragmentation on quantification and identification of iTRAQ-labeled peptide samples, we generated multiplexed spectra with varying degrees of PIP by mixing the two MS/MS spectra of 100% PIP obtained in global proteome profiling experiments on gastric tumor-normal tissue pair proteomes labeled by 4-plex iTRAQ. Despite cofragmentation, the simulation experiments showed that more than 99% of multiplexed spectra with PIP greater than 80% were correctly identified by three different database search engines-MODa, MS-GF+, and Proteome Discoverer. Using the multiplexed spectra that have been correctly identified, we estimated the effect of cofragmentation on peptide quantification. In 74% of the multiplexed spectra, however, the cancer-to-normal expression ratio was compressed, and a fair number of spectra showed the "ratio inflation" phenomenon. On the basis of the estimated distribution of distortions on quantification, we were able to calculate cutoff values for DEP detection from cofragmented spectra, which were corrected according to a specific PIP and probability of type I (or type II) error. When we applied these corrected cutoff values to real cofragmented spectra with PIP larger than or equal to 70%, we were able to identify reliable DEPs by removing about 25% of DEPs, which are highly likely to be false positives. Our experimental results provide useful insight into the effect of cofragmentation on isobaric tag-based quantification methods. The simulation procedure as well as the corrected cutoff calculation method could be adopted for quantifying the effect of cofragmentation and reducing false positives (or false negatives) in the DEP identification with general quantification experiments based on isobaric labeling techniques.


Assuntos
Fragmentos de Peptídeos/química , Proteoma/química , Sequência de Aminoácidos , Simulação por Computador , Humanos , Dados de Sequência Molecular , Mapeamento de Peptídeos , Proteólise , Proteoma/metabolismo , Proteômica , Neoplasias Gástricas/metabolismo , Espectrometria de Massas em Tandem
12.
Hum Mutat ; 35(8): 936-44, 2014 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-24829188

RESUMO

As whole genome sequencing (WGS) uncovers variants associated with rare and common diseases, an immediate challenge is to minimize false-positive findings due to sequencing and variant calling errors. False positives can be reduced by combining results from orthogonal sequencing methods, but costly. Here, we present variant filtering approaches using logistic regression (LR) and ensemble genotyping to minimize false positives without sacrificing sensitivity. We evaluated the methods using paired WGS datasets of an extended family prepared using two sequencing platforms and a validated set of variants in NA12878. Using LR or ensemble genotyping based filtering, false-negative rates were significantly reduced by 1.1- to 17.8-fold at the same levels of false discovery rates (5.4% for heterozygous and 4.5% for homozygous single nucleotide variants (SNVs); 30.0% for heterozygous and 18.7% for homozygous insertions; 25.2% for heterozygous and 16.6% for homozygous deletions) compared to the filtering based on genotype quality scores. Moreover, ensemble genotyping excluded > 98% (105,080 of 107,167) of false positives while retaining > 95% (897 of 937) of true positives in de novo mutation (DNM) discovery in NA12878, and performed better than a consensus method using two sequencing platforms. Our proposed methods were effective in prioritizing phenotype-associated variants, and an ensemble genotyping would be essential to minimize false-positive DNM candidates.


Assuntos
Algoritmos , Genoma Humano , Achados Incidentais , Mutação , Polimorfismo de Nucleotídeo Único , Leucemia-Linfoma Linfoblástico de Células Precursoras/genética , Linhagem Celular Tumoral , Reações Falso-Positivas , Técnicas de Genotipagem/estatística & dados numéricos , Heterozigoto , Sequenciamento de Nucleotídeos em Larga Escala , Homozigoto , Humanos , Modelos Logísticos , Anotação de Sequência Molecular , Mutagênese Insercional , Linhagem
13.
Hum Mutat ; 35(5): 537-47, 2014 May.
Artigo em Inglês | MEDLINE | ID: mdl-24478219

RESUMO

Whole-genome sequencing (WGS) studies are uncovering disease-associated variants in both rare and nonrare diseases. Utilizing the next-generation sequencing for WGS requires a series of computational methods for alignment, variant detection, and annotation, and the accuracy and reproducibility of annotation results are essential for clinical implementation. However, annotating WGS with up to date genomic information is still challenging for biomedical researchers. Here, we present one of the fastest and highly scalable annotation, filtering, and analysis pipeline-gNOME-to prioritize phenotype-associated variants while minimizing false-positive findings. Intuitive graphical user interface of gNOME facilitates the selection of phenotype-associated variants, and the result summaries are provided at variant, gene, and genome levels. Moreover, the enrichment results of specific variants, genes, and gene sets between two groups or compared with population scale WGS datasets that is already integrated in the pipeline can help the interpretation. We found a small number of discordant results between annotation software tools in part due to different reporting strategies for the variants with complex impacts. Using two published whole-exome datasets of uveal melanoma and bladder cancer, we demonstrated gNOME's accuracy of variant annotation and the enrichment of loss-of-function variants in known cancer pathways. gNOME Web server and source codes are freely available to the academic community (http://gnome.tchlab.org).


Assuntos
Genoma Humano , Sequenciamento de Nucleotídeos em Larga Escala , Software , Exoma , Genômica , Humanos , Internet , Anotação de Sequência Molecular , Fenótipo , Polimorfismo de Nucleotídeo Único
14.
BMB Rep ; 46(1): 41-6, 2013 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-23351383

RESUMO

Identifying genes indispensable for an organism's life and their characteristics is one of the central questions in current biological research, and hence it would be helpful to develop computational approaches towards the prediction of essential genes. The performance of a predictor is usually measured by the area under the receiver operating characteristic curve (AUC). We propose a novel method by implementing genetic algorithms to maximize the partial AUC that is restricted to a specific interval of lower false positive rate (FPR), the region relevant to follow-up experimental validation. Our predictor uses various features based on sequence information, protein-protein interaction network topology, and gene expression profiles. A feature selection wrapper was developed to alleviate the over-fitting problem and to weigh each feature's relevance to prediction. We evaluated our method using the proteome of budding yeast. Our implementation of genetic algorithms maximizing the partial AUC below 0.05 or 0.10 of FPR outperformed other popular classification methods.


Assuntos
Algoritmos , Área Sob a Curva , Genes Essenciais , Curva ROC , Saccharomyces cerevisiae/genética
15.
Bioinformatics ; 28(16): 2176-7, 2012 Aug 15.
Artigo em Inglês | MEDLINE | ID: mdl-22730434

RESUMO

BACKGROUND: Various processes such as annotation and filtering of variants or comparison of variants in different genomes are required in whole-genome or exome analysis pipelines. However, processing different databases and searching among millions of genomic loci is not trivial. RESULTS: gSearch compares sequence variants in the Genome Variation Format (GVF) or Variant Call Format (VCF) with a pre-compiled annotation or with variants in other genomes. Its search algorithms are subsequently optimized and implemented in a multi-threaded manner. The proposed method is not a stand-alone annotation tool with its own reference databases. Rather, it is a search utility that readily accepts public or user-prepared reference files in various formats including GVF, Generic Feature Format version 3 (GFF3), Gene Transfer Format (GTF), VCF and Browser Extensible Data (BED) format. Compared to existing tools such as ANNOVAR, gSearch runs more than 10 times faster. For example, it is capable of annotating 52.8 million variants with allele frequencies in 6 min. AVAILABILITY: gSearch is available at http://ml.ssu.ac.kr/gSearch. It can be used as an independent search tool or can easily be integrated to existing pipelines through various programming environments such as Perl, Ruby and Python.


Assuntos
Genômica/métodos , Análise de Sequência de DNA/métodos , Software , Algoritmos , Anotação de Sequência Molecular , Ferramenta de Busca
16.
BMC Bioinformatics ; 13 Suppl 17: S23, 2012.
Artigo em Inglês | MEDLINE | ID: mdl-23282007

RESUMO

BACKGROUND: Multidimensional scaling (MDS) is a widely used approach to dimensionality reduction. It has been applied to feature selection and visualization in various areas. Among diverse MDS methods, the classical MDS is a simple and theoretically sound solution for projecting data objects onto a low dimensional space while preserving the original distances among them as much as possible. However, it is not trivial to apply it to genome-scale data (e.g., microarray gene expression profiles) on regular desktop computers, because of its high computational complexity. RESULTS: We implemented a highly-efficient software application, called CFMDS (CUDA-based Fast MultiDimensional Scaling), which produces an approximate solution of the classical MDS based on CUDA (compute unified device architecture) and the divide-and-conquer principle. CUDA is a parallel computing architecture exploiting the power of the GPU (graphics processing unit). The principle of divide-and-conquer was adopted for circumventing the small memory problem of usual graphics cards. Our application software has been tested on various benchmark datasets including microarrays and compared with the classical MDS algorithms implemented using C# and MATLAB. In our experiments, CFMDS was more than a hundred times faster for large data than such general solutions. Regarding the quality of dimensionality reduction, our approximate solutions were as good as those from the general solutions, as the Pearson's correlation coefficients between them were larger than 0.9. CONCLUSIONS: CFMDS is an expeditious solution for the data dimensionality reduction problem. It is especially useful for efficient processing of genome-scale data consisting of several thousands of objects in several minutes.


Assuntos
Genoma , Genômica/métodos , Software , Algoritmos , Animais , Humanos , Camundongos , Análise Multivariada
17.
PLoS One ; 6(6): e20252, 2011.
Artigo em Inglês | MEDLINE | ID: mdl-21738571

RESUMO

The molecular basis of heat shock response (HSR), a cellular defense mechanism against various stresses, is not well understood. In this, the first comprehensive analysis of gene expression changes in response to heat shock and MG132 (a proteasome inhibitor), both of which are known to induce heat shock proteins (Hsps), we compared the responses of normal mouse fibrosarcoma cell line, RIF-1, and its thermotolerant variant cell line, TR-RIF-1 (TR), to the two stresses. The cellular responses we examined included Hsp expressions, cell viability, total protein synthesis patterns, and accumulation of poly-ubiquitinated proteins. We also compared the mRNA expression profiles and kinetics, in the two cell lines exposed to the two stresses, using microarray analysis. In contrast to RIF-1 cells, TR cells resist heat shock caused changes in cell viability and whole-cell protein synthesis. The patterns of total cellular protein synthesis and accumulation of poly-ubiquitinated proteins in the two cell lines were distinct, depending on the stress and the cell line. Microarray analysis revealed that the gene expression pattern of TR cells was faster and more transient than that of RIF-1 cells, in response to heat shock, while both RIF-1 and TR cells showed similar kinetics of mRNA expression in response to MG132. We also found that 2,208 genes were up-regulated more than 2 fold and could sort them into three groups: 1) genes regulated by both heat shock and MG132, (e.g. chaperones); 2) those regulated only by heat shock (e.g. DNA binding proteins including histones); and 3) those regulated only by MG132 (e.g. innate immunity and defense related molecules). This study shows that heat shock and MG132 share some aspects of HSR signaling pathway, at the same time, inducing distinct stress response signaling pathways, triggered by distinct abnormal proteins.


Assuntos
Resposta ao Choque Térmico/efeitos dos fármacos , Leupeptinas/farmacologia , Animais , Antivirais/farmacologia , Linhagem Celular Tumoral , Sobrevivência Celular/efeitos dos fármacos , Sobrevivência Celular/genética , Resposta ao Choque Térmico/genética , Camundongos , Análise de Sequência com Séries de Oligonucleotídeos , Biossíntese de Proteínas/efeitos dos fármacos , Biossíntese de Proteínas/genética , Vírus Sinciciais Respiratórios/efeitos dos fármacos , Ribavirina/farmacologia , Transdução de Sinais/efeitos dos fármacos , Transdução de Sinais/genética
18.
Bioinformatics ; 23(9): 1141-7, 2007 May 01.
Artigo em Inglês | MEDLINE | ID: mdl-17350973

RESUMO

MOTIVATION: MicroRNAs (miRNAs) and mRNAs constitute an important part of gene regulatory networks, influencing diverse biological phenomena. Elucidating closely related miRNAs and mRNAs can be an essential first step towards the discovery of their combinatorial effects on different cellular states. Here, we propose a probabilistic learning method to identify synergistic miRNAs involving regulation of their condition-specific target genes (mRNAs) from multiple information sources, i.e. computationally predicted target genes of miRNAs and their respective expression profiles. RESULTS: We used data sets consisting of miRNA-target gene binding information and expression profiles of miRNAs and mRNAs on human cancer samples. Our method allowed us to detect functionally correlated miRNA-mRNA modules involved in specific biological processes from multiple data sources by using a balanced fitness function and efficient searching over multiple populations. The proposed algorithm found two miRNA-mRNA modules, highly correlated with respect to their expression and biological function. Moreover, the mRNAs included in the same module showed much higher correlations when the related miRNAs were highly expressed, demonstrating our method's ability for finding coherent miRNA-mRNA modules. Most members of these modules have been reported to be closely related with cancer. Consequently, our method can provide a primary source of miRNA and target sets presumed to constitute closely related parts of gene regulatory pathways.


Assuntos
Inteligência Artificial , Genética Populacional , MicroRNAs/genética , Modelos Genéticos , Reconhecimento Automatizado de Padrão/métodos , RNA Mensageiro/genética , Análise de Sequência de RNA/métodos , Sequência de Bases , Sítios de Ligação , Simulação por Computador , Evolução Molecular , Dados de Sequência Molecular
19.
IEEE Trans Syst Man Cybern B Cybern ; 35(6): 1302-10, 2005 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-16366254

RESUMO

Bayesian model averaging (BMA) can resolve the overfitting problem by explicitly incorporating the model uncertainty into the analysis procedure. Hence, it can be used to improve the generalization performance of Bayesian network classifiers. Until now, BMA of Bayesian network classifiers has only been performed in some restricted forms, e.g., the model is averaged given a single node-order, because of its heavy computational burden. However, it can be hard to obtain a good node-order when the available training dataset is sparse. To alleviate this problem, we propose BMA of Bayesian network classifiers over several distinct node-orders obtained using the Markov chain Monte Carlo sampling technique. The proposed method was examined using two synthetic problems and four real-life datasets. First, we show that the proposed method is especially effective when the given dataset is very sparse. The classification accuracy of averaging over multiple node-orders was higher in most cases than that achieved using a single node-order in our experiments. We also present experimental results for test datasets with unobserved variables, where the quality of the averaged node-order is more important. Through these experiments, we show that the difference in classification performance between the cases of multiple node-orders and single node-order is related to the level of noise, confirming the relative benefit of averaging over multiple node-orders for incomplete data. We conclude that BMA of Bayesian network classifiers over multiple node-orders has an apparent advantage when the given dataset is sparse and noisy, despite the method's heavy computational cost.


Assuntos
Algoritmos , Teorema de Bayes , Armazenamento e Recuperação da Informação/métodos , Redes Neurais de Computação , Reconhecimento Automatizado de Padrão/métodos , Inteligência Artificial , Análise por Conglomerados , Bases de Dados Factuais
20.
J Bioinform Comput Biol ; 3(1): 61-77, 2005 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-15751112

RESUMO

Combined analysis of the microarray and drug-activity datasets has the potential of revealing valuable knowledge about various relations among gene expressions and drug activities in the malignant cell. In this paper, we apply Bayesian networks, a tool for compact representation of the joint probability distribution, to such analysis. For the alleviation of data dimensionality problem, the huge datasets were condensed using a feature abstraction technique. The proposed analysis method was applied to the NCI60 dataset (http://discover.nci.nih.gov) consisting of gene expression profiles and drug activity patterns on human cancer cell lines. The Bayesian networks, learned from the condensed dataset, identified most of the salient pairwise correlations and some known relationships among several features in the original dataset, confirming the effectiveness of the proposed feature abstraction method. Also, a survey of the recent literature confirms the several relationships appearing in the learned Bayesian network to be biologically meaningful.


Assuntos
Antineoplásicos/administração & dosagem , Inteligência Artificial , Perfilação da Expressão Gênica/métodos , Regulação Neoplásica da Expressão Gênica/efeitos dos fármacos , Proteínas de Neoplasias/metabolismo , Neoplasias/metabolismo , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Reconhecimento Automatizado de Padrão/métodos , Algoritmos , Bases de Dados Factuais , Desenho de Fármacos , Humanos , Proteínas de Neoplasias/genética , Neoplasias/genética , Transdução de Sinais/efeitos dos fármacos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...