Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 36
Filtrar
1.
Nucleic Acids Res ; 50(D1): D988-D995, 2022 01 07.
Artigo em Inglês | MEDLINE | ID: mdl-34791404

RESUMO

Ensembl (https://www.ensembl.org) is unique in its flexible infrastructure for access to genomic data and annotation. It has been designed to efficiently deliver annotation at scale for all eukaryotic life, and it also provides deep comprehensive annotation for key species. Genomes representing a greater diversity of species are increasingly being sequenced. In response, we have focussed our recent efforts on expediting the annotation of new assemblies. Here, we report the release of the greatest annual number of newly annotated genomes in the history of Ensembl via our dedicated Ensembl Rapid Release platform (http://rapid.ensembl.org). We have also developed a new method to generate comparative analyses at scale for these assemblies and, for the first time, we have annotated non-vertebrate eukaryotes. Meanwhile, we continually improve, extend and update the annotation for our high-value reference vertebrate genomes and report the details here. We have a range of specific software tools for specific tasks, such as the Ensembl Variant Effect Predictor (VEP) and the newly developed interface for the Variant Recoder. All Ensembl data, software and tools are freely available for download and are accessible programmatically.


Assuntos
Bases de Dados Genéticas , Genoma/genética , Anotação de Sequência Molecular , Software , Animais , Biologia Computacional/classificação , Humanos
2.
São Paulo; s.n; s.n; 2022. 186 p. tab, graf, ilus.
Tese em Português | LILACS | ID: biblio-1397348

RESUMO

Os avanços metodológicos e instrumentais decorrentes do Projeto Genoma Humano formaram o arcabouço necessário para o surgimento das tecnologias de sequenciamento de DNA de Nova Geração, as quais se caracterizam por um custo reduzido, uma baixa demanda operacional e a produção de um grande volume de dados por experimento. Concomitantemente a isso, o aumento no poder de processamento computacional permitiu o desenvolvimento de análises genéticas em larga escala, de modo que, atualmente, é possível estudar características genômicas individualizadas e, até então, pouco ou nunca exploradas. Dentre essas características, aquelas relacionadas às variações estruturais em genomas têm recebido bastante atenção. Os pseudogenes processados, ou retrocópias, são variações estruturais causadas pela duplicação de genes codificadores mediante à transposição de seu RNA mensageiro maduro pela maquinaria enzimática de LINE- 1. As retrocópias podem estar fixadas, ou seja, presentes em todos os genomas de uma dada espécie, os quais são representados pela montagem modelo do genoma de referência, ou podem não estar fixadas, sendo polimórficas, germinativas ou somáticas. No entanto, o conhecimento acerca das retrocópias não fixadas ainda é limitado devido à falta de ferramentas de bioinformática dedicadas a sua identificação e anotação em dados de sequenciamento de DNA. Posto isso, este trabalho apresenta o sideRETRO um programa computacional especializado na detecção de pseudogenes processados ausentes do genoma de referência, mas presentes em dados de sequenciamento de genoma completo e exoma de outros indivíduos. Além de apontar para a presença de retrocópias não fixadas, o sideRETRO é capaz de anotar várias outras características relacionadas a esses evento, tais como: a coordenada genômica de inserção do pseudogene processado, a qual constitui o cromossomo, o ponto de inserção e a fita de DNA (líder or retardada); o contexto genômico do evento (exônico, intrônico ou intergênico); a genotipagem (presente ou ausente) e a haplotipagem (em homozigose ou heterozigose). Para atestar a eficiência da ferramenta, o sideRETRO foi executado para dados simulados e para dados reais validados experimentalmente por um grupo independente. Portanto, em resumo, nesta tese são descritos o desenvolvimento e o uso do sideRETRO uma ferramenta computacional robusta e eficiente, designada para identificar e anotar pseudogenes processados não fixados. Por fim, vale destacar que o sideRETRO preenche uma lacuna metodológica e possibilita novas hipóteses e investigações sistemáticas no campo de chamada de variantes estruturais


The methodological and instrumental advances resulting from the Human Genome Project have created the necessary framework to the emergence of Next Generation DNA sequencing technologies, which are characterized by a reduced cost, low operational demand and the generation of a large volume of data per experiment. Concomitantly with this, the increase in computational processing power has driven the development of large-scale genetic analyses, which allowed us to study individualized genomic traits little or never explored before. Among these characteristics, those related to structural variations in genomes have received much attention. Processed pseudogenes, or retrocopies, are structural variations caused by the duplication of coding genes through the transposition of their mature messenger RNA by the LINE-1 enzymatic machinery. Retrocopies can be fixed (i.e., present in all genomes of a given species and included into the assembly of the reference genome) or unfixed, being polymorphic, germinal or somatic. However, knowledge about unfixed retrocopies is still limited due to the lack of bioinformatics tools dedicated to their identification and annotation in DNA sequencing data. Therefore, this work presents sideRETRO a computer program specialized in the detection of processed pseudogenes absent from the reference genome, but present in whole genome and exome sequencing data from other individuals. In addition to pointing out the presence of unfixed retrocopies, sideRETRO is able to annotate several other characteristics related to these events, such as: the genomic coordinate of the processed pseudogene insetion, which constitutes the chromosome, the insertion point and the DNA strand (leader or retard); the genomic context of the event (exonic, intronic or intergenic); genotyping (present or absent) and haplotyping (homozygous or heterozygous). To certify the sideRETRO efficiency, it was run on simulated data and on real data experimentally validated by an independent group. Therefore, in summary, this thesis describes the development and use of sideRETRO a robust and efficient computational tool, designed to identify and annotate unfixed processed pseudogenes. Finally, it is worth noting that sideRETRO fills a methodological gap and allows new hypotheses and systematic investigations in the field of structural variant calling


Assuntos
Polimorfismo Genético/genética , Biologia Computacional/classificação , Biologia Computacional/instrumentação , Custos e Análise de Custo , Genômica/instrumentação , Análise de Sequência de DNA/instrumentação , Codificação Clínica
3.
Database (Oxford) ; 20202020 01 01.
Artigo em Inglês | MEDLINE | ID: mdl-32294192

RESUMO

Gathering information from the scientific literature is essential for biomedical research, as much knowledge is conveyed through publications. However, the large and rapidly increasing publication rate makes it impractical for researchers to quickly identify all and only those documents related to their interest. As such, automated biomedical document classification attracts much interest. Such classification is critical in the curation of biological databases, because biocurators must scan through a vast number of articles to identify pertinent information within documents most relevant to the database. This is a slow, labor-intensive process that can benefit from effective automation.We present a document classification scheme aiming to identify papers containing information relevant to a specific topic, among a large collection of articles, for supporting the biocuration classification task. Our framework is based on a meta-classification scheme we have introduced before; here we incorporate into it features gathered from figure captions, in addition to those obtained from titles and abstracts. We trained and tested our classifier over a large imbalanced dataset, originally curated by the Gene Expression Database (GXD). GXD collects all the gene expression information in the Mouse Genome Informatics (MGI) resource. As part of the MGI literature classification pipeline, GXD curators identify MGI-selected papers that are relevant for GXD. The dataset consists of ~60 000 documents (5469 labeled as relevant; 52 866 as irrelevant), gathered throughout 2012-2016, in which each document is represented by the text of its title, abstract and figure captions. Our classifier attains precision 0.698, recall 0.784, f-measure 0.738 and Matthews correlation coefficient 0.711, demonstrating that the proposed framework effectively addresses the high imbalance in the GXD classification task. Moreover, our classifier's performance is significantly improved by utilizing information from image captions compared to using titles and abstracts alone; this observation clearly demonstrates that image captions provide substantial information for supporting biomedical document classification and curation.Database URL.


Assuntos
Pesquisa Biomédica/estatística & dados numéricos , Biologia Computacional/métodos , Curadoria de Dados/métodos , Bases de Dados Factuais , Animais , Pesquisa Biomédica/classificação , Pesquisa Biomédica/métodos , Biologia Computacional/classificação , Mineração de Dados/métodos , Humanos , Internet
4.
IEEE Trans Neural Netw Learn Syst ; 31(8): 2857-2867, 2020 08.
Artigo em Inglês | MEDLINE | ID: mdl-31170082

RESUMO

In the postgenome era, many problems in bioinformatics have arisen due to the generation of large amounts of imbalanced data. In particular, the computational classification of precursor microRNA (pre-miRNA) involves a high imbalance in the classes. For this task, a classifier is trained to identify RNA sequences having the highest chance of being miRNA precursors. The big issue is that well-known pre-miRNAs are usually just a few in comparison to the hundreds of thousands of candidate sequences in a genome, which results in highly imbalanced data. This imbalance has a strong influence on most standard classifiers and, if not properly addressed, the classifier is not able to work properly in a real-life scenario. This work provides a comparative assessment of recent deep neural architectures for dealing with the large imbalanced data issue in the classification of pre-miRNAs. We present and analyze recent architectures in a benchmark framework with genomes of animals and plants, with increasing imbalance ratios up to 1:2000. We also propose a new graphical way for comparing classifiers performance in the context of high-class imbalance. The comparative results obtained show that, at a very high imbalance, deep belief neural networks can provide the best performance.


Assuntos
Biologia Computacional/classificação , Biologia Computacional/métodos , Bases de Dados Factuais/classificação , Aprendizado Profundo/classificação , Redes Neurais de Computação , Plantas/classificação , Animais , Elasticidade , Humanos
6.
ScientificWorldJournal ; 2014: 179105, 2014.
Artigo em Inglês | MEDLINE | ID: mdl-25276846

RESUMO

This paper analyses the effect of the effort distribution along the software development lifecycle on the prevalence of software defects. This analysis is based on data that was collected by the International Software Benchmarking Standards Group (ISBSG) on the development of 4,106 software projects. Data mining techniques have been applied to gain a better understanding of the behaviour of the project activities and to identify a link between the effort distribution and the prevalence of software defects. This analysis has been complemented with the use of a hierarchical clustering algorithm with a dissimilarity based on the likelihood ratio statistic, for exploratory purposes. As a result, different behaviours have been identified for this collection of software development projects, allowing for the definition of risk control strategies to diminish the number and impact of the software defects. It is expected that the use of similar estimations might greatly improve the awareness of project managers on the risks at hand.


Assuntos
Algoritmos , Software , Análise por Conglomerados , Biologia Computacional/classificação , Biologia Computacional/métodos , Mineração de Dados/classificação , Mineração de Dados/métodos , Análise Discriminante , Reprodutibilidade dos Testes , Design de Software , Validação de Programas de Computador
9.
BMC Bioinformatics ; 14: 350, 2013 Dec 03.
Artigo em Inglês | MEDLINE | ID: mdl-24299119

RESUMO

BACKGROUND: Drosophila melanogaster has been established as a model organism for investigating the developmental gene interactions. The spatio-temporal gene expression patterns of Drosophila melanogaster can be visualized by in situ hybridization and documented as digital images. Automated and efficient tools for analyzing these expression images will provide biological insights into the gene functions, interactions, and networks. To facilitate pattern recognition and comparison, many web-based resources have been created to conduct comparative analysis based on the body part keywords and the associated images. With the fast accumulation of images from high-throughput techniques, manual inspection of images will impose a serious impediment on the pace of biological discovery. It is thus imperative to design an automated system for efficient image annotation and comparison. RESULTS: We present a computational framework to perform anatomical keywords annotation for Drosophila gene expression images. The spatial sparse coding approach is used to represent local patches of images in comparison with the well-known bag-of-words (BoW) method. Three pooling functions including max pooling, average pooling and Sqrt (square root of mean squared statistics) pooling are employed to transform the sparse codes to image features. Based on the constructed features, we develop both an image-level scheme and a group-level scheme to tackle the key challenges in annotating Drosophila gene expression pattern images automatically. To deal with the imbalanced data distribution inherent in image annotation tasks, the undersampling method is applied together with majority vote. Results on Drosophila embryonic expression pattern images verify the efficacy of our approach. CONCLUSION: In our experiment, the three pooling functions perform comparably well in feature dimension reduction. The undersampling with majority vote is shown to be effective in tackling the problem of imbalanced data. Moreover, combining sparse coding and image-level scheme leads to consistent performance improvement in keywords annotation.


Assuntos
Drosophila melanogaster/citologia , Drosophila melanogaster/genética , Regulação da Expressão Gênica no Desenvolvimento , Genoma de Inseto/genética , Modelos Genéticos , Anotação de Sequência Molecular/métodos , Animais , Diferenciação Celular/genética , Divisão Celular/genética , Biologia Computacional/classificação , Biologia Computacional/métodos , Drosophila melanogaster/embriologia , Perfilação da Expressão Gênica/classificação , Perfilação da Expressão Gênica/métodos , Ensaios de Triagem em Larga Escala , Anotação de Sequência Molecular/classificação , Valor Preditivo dos Testes , Máquina de Vetores de Suporte
10.
Hum Mutat ; 34(1): 200-9, 2013 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-22949379

RESUMO

Mismatch repair (MMR) gene sequence variants of uncertain clinical significance are often identified in suspected Lynch syndrome families, and this constitutes a challenge for both researchers and clinicians. Multifactorial likelihood model approaches provide a quantitative measure of MMR variant pathogenicity, but first require input of likelihood ratios (LRs) for different MMR variation-associated characteristics from appropriate, well-characterized reference datasets. Microsatellite instability (MSI) and somatic BRAF tumor data for unselected colorectal cancer probands of known pathogenic variant status were used to derive LRs for tumor characteristics using the Colon Cancer Family Registry (CFR) resource. These tumor LRs were combined with variant segregation within families, and estimates of prior probability of pathogenicity based on sequence conservation and position, to analyze 44 unclassified variants identified initially in Australasian Colon CFR families. In addition, in vitro splicing analyses were conducted on the subset of variants based on bioinformatic splicing predictions. The LR in favor of pathogenicity was estimated to be ~12-fold for a colorectal tumor with a BRAF mutation-negative MSI-H phenotype. For 31 of the 44 variants, the posterior probabilities of pathogenicity were such that altered clinical management would be indicated. Our findings provide a working multifactorial likelihood model for classification that carefully considers mode of ascertainment for gene testing.


Assuntos
Neoplasias do Colo/genética , Biologia Computacional/métodos , Reparo de Erro de Pareamento de DNA/genética , Mutação , Proteínas Adaptadoras de Transdução de Sinal/genética , Processamento Alternativo/genética , Biologia Computacional/classificação , Biologia Computacional/estatística & dados numéricos , Análise Mutacional de DNA/métodos , Análise Mutacional de DNA/estatística & dados numéricos , Proteínas de Ligação a DNA/genética , Saúde da Família , Humanos , Funções Verossimilhança , Instabilidade de Microssatélites , Repetições de Microssatélites/genética , Proteína 1 Homóloga a MutL , Proteína 2 Homóloga a MutS/genética , Proteínas Nucleares/genética , Proteínas Proto-Oncogênicas B-raf/genética , Sistema de Registros/classificação , Sistema de Registros/estatística & dados numéricos
11.
Hum Mutat ; 34(1): 255-65, 2013 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-22949387

RESUMO

Classification of rare missense substitutions observed during genetic testing for patient management is a considerable problem in clinical genetics. The Bayesian integrated evaluation of unclassified variants is a solution originally developed for BRCA1/2. Here, we take a step toward an analogous system for the mismatch repair (MMR) genes (MLH1, MSH2, MSH6, and PMS2) that confer colon cancer susceptibility in Lynch syndrome by calibrating in silico tools to estimate prior probabilities of pathogenicity for MMR gene missense substitutions. A qualitative five-class classification system was developed and applied to 143 MMR missense variants. This identified 74 missense substitutions suitable for calibration. These substitutions were scored using six different in silico tools (Align-Grantham Variation Grantham Deviation, multivariate analysis of protein polymorphisms [MAPP], MutPred, PolyPhen-2.1, Sorting Intolerant From Tolerant, and Xvar), using curated MMR multiple sequence alignments where possible. The output from each tool was calibrated by regression against the classifications of the 74 missense substitutions; these calibrated outputs are interpretable as prior probabilities of pathogenicity. MAPP was the most accurate tool and MAPP + PolyPhen-2.1 provided the best-combined model (R(2)  = 0.62 and area under receiver operating characteristic = 0.93). The MAPP + PolyPhen-2.1 output is sufficiently predictive to feed as a continuous variable into the quantitative Bayesian integrated evaluation for clinical classification of MMR gene missense substitutions.


Assuntos
Biologia Computacional/métodos , Reparo de Erro de Pareamento de DNA/genética , Predisposição Genética para Doença/genética , Mutação de Sentido Incorreto , Proteínas Adaptadoras de Transdução de Sinal/genética , Adenosina Trifosfatases/genética , Teorema de Bayes , Calibragem , Neoplasias Colorretais Hereditárias sem Polipose/genética , Biologia Computacional/classificação , Biologia Computacional/normas , Enzimas Reparadoras do DNA/genética , Proteínas de Ligação a DNA/genética , Humanos , Endonuclease PMS2 de Reparo de Erro de Pareamento , Proteína 1 Homóloga a MutL , Proteína 2 Homóloga a MutS/genética , Proteínas Nucleares/genética , Análise de Regressão , Reprodutibilidade dos Testes
13.
Adv Exp Med Biol ; 736: 617-43, 2012.
Artigo em Inglês | MEDLINE | ID: mdl-22161356

RESUMO

One of the central problems of cancer systems biology is to understand the complex molecular changes of cancerous cells and tissues, and use this understanding to support the development of new targeted therapies. EPoC (Endogenous Perturbation analysis of Cancer) is a network modeling technique for tumor molecular profiles. EPoC models are constructed from combined copy number aberration (CNA) and mRNA data and aim to (1) identify genes whose copy number aberrations significantly affect target mRNA expression and (2) generate markers for long- and short-term survival of cancer patients. Models are constructed by a combination of regression and bootstrapping methods. Prognostic scores are obtained from a singular value decomposition of the networks. We have previously analyzed the performance of EPoC using glioblastoma data from The Cancer Genome Atlas (TCGA) consortium, and have shown that resulting network models contain both known and candidate disease-relevant genes as network hubs, as well as uncover predictors of patient survival. Here, we give a practical guide how to perform EPoC modeling in practice using R, and present a set of alternative modeling frameworks.


Assuntos
Biologia Computacional/métodos , Redes Reguladoras de Genes/genética , Modelos Genéticos , Neoplasias/genética , Biologia de Sistemas/métodos , Algoritmos , Biologia Computacional/classificação , Dosagem de Genes , Regulação Neoplásica da Expressão Gênica , Redes Reguladoras de Genes/efeitos dos fármacos , Predisposição Genética para Doença/genética , Glioblastoma/tratamento farmacológico , Glioblastoma/genética , Humanos , Neoplasias/tratamento farmacológico , Prognóstico , Reprodutibilidade dos Testes , Análise de Sobrevida
14.
Rev. colomb. biotecnol ; 13(2): 84-96, dic 1, 2011. tab, graf
Artigo em Espanhol | LILACS | ID: lil-645170

RESUMO

La cepa Pseudomonas fluorescens IBUN S1602 conforma el grupo de aislamientos provenientes de suelos colombianos de caña de azúcar, que acumula polihidrioxialcanoato (PHA), fue seleccionada como promisoria para escalamiento comercial por tener afinidad por sustratos alternativos y económicos como el glicerol, aceites usados, suero de leche, entre otros. Dada la importancia de la enzima sintasa en la síntesis de los PHAs, en el presente trabajo se realizó el análisis molecular de los genes phaC1 y phaC2 que codifican las enzimas sintasas tipo II (PhaC1 y PhaC2). Para la obtención de los amplímeros requeridos en la secuenciación, se utilizó la técnica de PCR bajo condiciones estandarizadas para iniciadores diseñados reportados en las bases de datos. Se identificaron dos fragmentos de 1680 pb y 1683 pb correspondientes a phaC1 y phaC2. El análisis comparativo de las secuencias proteicas resultantes de estos genes demuestra que la sintasa IBUN S1602 contiene la región α/β hidrolasa y 8 residuos de aminoácidos conservados, que son características de las sintasas examinadas a nivel mundial. Se analizó la estructura enzimática a nivel primario y se predijo la secundaria. Se concluyó que las sintasas de la cepa Pseudomonas fluorescens IBUN S1602 presentan alta homología con las sintasas tipo II que se reportan para Pseudomonas. Los resultados obtenidos contribuyen al entendimiento básico de la biosíntesis de PHA, la cual permitirá, en un futuro, el aumento de la calidad de PHA debida a la modulación del nivel de sintasa que se exprese en un organismo recombinante, con el fin de variar el peso molecular del biopolímero, propiedad esencial en el estudio de aplicaciones industriales.


The strain Pseudomonas fluorescens IBUN S1602 forms the group of isolates from colombian sugarcane soil´s, which accumulates polyhydroxyalkanoate biopolymer (PHA) and was selected as promising for commercial scale by having affinity for economic and alternative substrates such as glycerol, oils, whey, among others. Given the importance of the synthase enzyme in the synthesis of PHAs, was realized the molecular analysis of genes phaC1 and phaC2 which encode type II synthases (PhaC1 y PhaC2). To obtain the amplimers required in the sequencing, was used the PCR technique under standardized conditions for primers designed based on the updated review in databases. Were identified two fragments of 1680 bp and 1683 bp for phaC1 and phaC2. Comparative analysis of the resulting protein sequences of these genes shows that the IBUN S1602 synthases containing the region α/β hydrolase and 8 conserved amino acid residues that are characteristic of synthases examined worldwide. Enzyme structure was analyzed at the primary level and was predicted the secondary. It is concluded that synthase strain Pseudomonas fluorescens IBUN S1602 has high homology with type II synthases that are reported for Pseudomonas. The results contribute to basic understanding of the biosynthesis of PHA, and will allow in the future, increasing the quality of PHA due to modulation of the level of synthase is expressed in a recombinant organism, in order to vary the weight molecular biopolymer, an essential property in the study of industrial applications.


Assuntos
Biopolímeros/administração & dosagem , Biopolímeros/biossíntese , Biopolímeros/classificação , Biopolímeros/imunologia , Biologia Computacional/classificação , Biologia Computacional/história , Biologia Computacional/instrumentação , Biologia Computacional/tendências
15.
PLoS One ; 6(10): e26146, 2011.
Artigo em Inglês | MEDLINE | ID: mdl-22022543

RESUMO

Integrating gene regulatory networks (GRNs) into the classification process of DNA microarrays is an important issue in bioinformatics, both because this information has a true biological interest and because it helps in the interpretation of the final classifier. We present a method called graph-constrained discriminant analysis (gCDA), which aims to integrate the information contained in one or several GRNs into a classification procedure. We show that when the integrated graph includes erroneous information, gCDA's performance is only slightly worse, thus showing robustness to misspecifications in the given GRNs. The gCDA framework also allows the classification process to take into account as many a priori graphs as there are classes in the dataset. The gCDA procedure was applied to simulated data and to three publicly available microarray datasets. gCDA shows very interesting performance when compared to state-of-the-art classification methods. The software package gcda, along with the real datasets that were used in this study, are available online: http://biodev.cea.fr/gcda/.


Assuntos
Biologia Computacional/classificação , Biologia Computacional/métodos , Análise Discriminante , Redes Reguladoras de Genes/genética , Algoritmos , Simulação por Computador , Bases de Dados Genéticas , Regulação Neoplásica da Expressão Gênica , Humanos , Análise de Sequência com Séries de Oligonucleotídeos , Software
16.
Arch Toxicol ; 85(9): 1015-33, 2011 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-21523460

RESUMO

Thanks to the confluence of genome sequencing and bioinformatics, the number of metabolic databases has expanded from a handful in the mid-1990s to several thousand today. These databases lie within distinct families that have common ancestry and common attributes. The main families are the MetaCyc, KEGG, Reactome, Model SEED, and BiGG families. We survey these database families, as well as important individual metabolic databases, including multiple human metabolic databases. The MetaCyc family is described in particular detail. It contains well over 1,000 databases, including highly curated databases for Escherichia coli, Saccharomyces cerevisiae, Mus musculus, and Arabidopsis thaliana. These databases are available through a number of web sites that offer a range of software tools for querying and visualizing metabolic networks. These web sites also provide multiple tools for analysis of gene expression and metabolomics data, including visualization of those datasets on metabolic network diagrams and over-representation analysis of gene sets and metabolite sets.


Assuntos
Biologia Computacional , Bases de Dados Factuais , Redes e Vias Metabólicas , Biologia Computacional/classificação , Biologia Computacional/normas , Bases de Dados Factuais/classificação , Bases de Dados Factuais/normas , Bases de Dados Genéticas , Enzimas/metabolismo , Armazenamento e Recuperação da Informação/métodos , Internet , Software , Interface Usuário-Computador
17.
PLoS One ; 6(2): e17191, 2011 Feb 16.
Artigo em Inglês | MEDLINE | ID: mdl-21359184

RESUMO

BACKGROUND: Support vector machine (SVM) has been widely used as accurate and reliable method to decipher brain patterns from functional MRI (fMRI) data. Previous studies have not found a clear benefit for non-linear (polynomial kernel) SVM versus linear one. Here, a more effective non-linear SVM using radial basis function (RBF) kernel is compared with linear SVM. Different from traditional studies which focused either merely on the evaluation of different types of SVM or the voxel selection methods, we aimed to investigate the overall performance of linear and RBF SVM for fMRI classification together with voxel selection schemes on classification accuracy and time-consuming. METHODOLOGY/PRINCIPAL FINDINGS: Six different voxel selection methods were employed to decide which voxels of fMRI data would be included in SVM classifiers with linear and RBF kernels in classifying 4-category objects. Then the overall performances of voxel selection and classification methods were compared. Results showed that: (1) Voxel selection had an important impact on the classification accuracy of the classifiers: in a relative low dimensional feature space, RBF SVM outperformed linear SVM significantly; in a relative high dimensional space, linear SVM performed better than its counterpart; (2) Considering the classification accuracy and time-consuming holistically, linear SVM with relative more voxels as features and RBF SVM with small set of voxels (after PCA) could achieve the better accuracy and cost shorter time. CONCLUSIONS/SIGNIFICANCE: The present work provides the first empirical result of linear and RBF SVM in classification of fMRI data, combined with voxel selection methods. Based on the findings, if only classification accuracy was concerned, RBF SVM with appropriate small voxels and linear SVM with relative more voxels were two suggested solutions; if users concerned more about the computational time, RBF SVM with relative small set of voxels when part of the principal components were kept as features was a better choice.


Assuntos
Algoritmos , Processamento Eletrônico de Dados/métodos , Imageamento por Ressonância Magnética/métodos , Reconhecimento Automatizado de Padrão/métodos , Software , Mapeamento Encefálico/classificação , Mapeamento Encefálico/métodos , Mapeamento Encefálico/estatística & dados numéricos , Biologia Computacional/classificação , Biologia Computacional/métodos , Biologia Computacional/estatística & dados numéricos , Processamento Eletrônico de Dados/classificação , Feminino , Humanos , Masculino , Dinâmica não Linear , Reconhecimento Automatizado de Padrão/classificação , Reprodutibilidade dos Testes , Software/classificação
19.
ChemMedChem ; 4(7): 1174-81, 2009 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-19384901

RESUMO

Fragment formal concept analysis (FragFCA) for compound classification: Signature fragment combinations for compound classes with closely related biological activity were identified using FragFCA. These combinations are used to accurately classify active test compounds on the basis of fragment mapping. FragFCA can extract class-specific fragment combinations from compounds active against different target families that have signature character and practical utility in compound classification and database searching.Formal concept analysis (FCA), originally developed in information science, has been adapted to identify relationships between fragments of compounds and their biological activity. Here applications of the FragFCA approach with practical utility for medicinal chemistry are explored. Hierarchically derived fragment populations of 24 classes of compounds active against eight target families were subjected to FragFCA analysis, and fragment combinations were identified that distinguished compounds with closely related biological activity from each other. Mapping of signature fragment combinations was carried out to classify active compounds for different target families with high accuracy. The results indicate that compound-class-specific structural information and selectivity determinants are predominantly encoded by fragment combinations, rather than individual fragments. Furthermore, class-specific fragment combinations were successfully applied in similarity searching. The results demonstrate that FragFCA is capable of identifying fragment combinations that differentiate between compound sets with closely related biological activity and that can be used to predict structure-activity relationships.


Assuntos
Técnicas de Química Combinatória/métodos , Biologia Computacional/métodos , Desenho de Fármacos , Técnicas de Química Combinatória/classificação , Biologia Computacional/classificação , Bases de Dados Factuais , Preparações Farmacêuticas/classificação , Relação Estrutura-Atividade
20.
Brief Bioinform ; 10(5): 537-46, 2009 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-19346320

RESUMO

Recent development of high-throughput technology has accelerated interest in the development of molecular biomarker classifiers for safety assessment, disease diagnostics and prognostics, and prediction of response for patient assignment. This article reviews and evaluates some important aspects and key issues in the development of biomarker classifiers. Development of a biomarker classifier for high-throughput data involves two components: (i) model building and (ii) performance assessment. This article focuses on feature selection in model building and cross validation for performance assessment. A 'frequency' approach to feature selection is presented and compared to the 'conventional' approach in terms of the predictive accuracy and stability of the selected feature set. The two approaches are compared based on four biomarker classifiers, each with a different feature selection method and well-known classification algorithm. In each of the four classifiers the feature predictor set selected by the frequency approach is more stable than the feature set selected by the conventional approach.


Assuntos
Algoritmos , Biomarcadores , Biologia Computacional , Modelos Biológicos , Biologia Computacional/classificação , Biologia Computacional/métodos , Bases de Dados Genéticas , Matemática , Reprodutibilidade dos Testes
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...