Pesquisa | Portal Regional da BVS

Optimal Thresholding of Classifiers to Maximize F1 Measure.

Lipton, Zachary C; Elkan, Charles; Naryanaswamy, Balakrishnan.

Mach Learn Knowl Discov Databases ; 8725: 225-239, 2014.

Artigo em Inglês | MEDLINE | ID: mdl-26023687

RESUMO

This paper provides new insight into maximizing F1 measures in the context of binary classification and also in the context of multilabel classification. The harmonic mean of precision and recall, the F1 measure is widely used to evaluate the success of a binary classifier when one class is rare. Micro average, macro average, and per instance average F1 measures are used in multilabel classification. For any classifier that produces a real-valued output, we derive the relationship between the best achievable F1 value and the decision-making threshold that achieves this optimum. As a special case, if the classifier outputs are well-calibrated conditional probabilities, then the optimal threshold is half the optimal F1 value. As another special case, if the classifier is completely uninformative, then the optimal behavior is to classify all examples as positive. When the actual prevalence of positive examples is low, this behavior can be undesirable. As a case study, we discuss the results, which can be surprising, of maximizing F1 when predicting 26,853 labels for Medline documents.

Differential privacy based on importance weighting.

Ji, Zhanglong; Elkan, Charles.

Mach Learn ; 93(1): 163-183, 2013 Oct.

Artigo em Inglês | MEDLINE | ID: mdl-24482559

RESUMO

This paper analyzes a novel method for publishing data while still protecting privacy. The method is based on computing weights that make an existing dataset, for which there are no confidentiality issues, analogous to the dataset that must be kept private. The existing dataset may be genuine but public already, or it may be synthetic. The weights are importance sampling weights, but to protect privacy, they are regularized and have noise added. The weights allow statistical queries to be answered approximately while provably guaranteeing differential privacy. We derive an expression for the asymptotic variance of the approximate answers. Experiments show that the new mechanism performs well even when the privacy budget is small, and when the public and private datasets are drawn from different populations.

Inhibition in multiclass classification.

Huerta, Ramón; Vembu, Shankar; Amigó, José M; Nowotny, Thomas; Elkan, Charles.

Neural Comput ; 24(9): 2473-507, 2012 Sep.

Artigo em Inglês | MEDLINE | ID: mdl-22594829

RESUMO

The role of inhibition is investigated in a multiclass support vector machine formalism inspired by the brain structure of insects. The so-called mushroom bodies have a set of output neurons, or classification functions, that compete with each other to encode a particular input. Strongly active output neurons depress or inhibit the remaining outputs without knowing which is correct or incorrect. Accordingly, we propose to use a classification function that embodies unselective inhibition and train it in the large margin classifier framework. Inhibition leads to more robust classifiers in the sense that they perform better on larger areas of appropriate hyperparameters when assessed with leave-one-out strategies. We also show that the classifier with inhibition is a tight bound to probabilistic exponential models and is Bayes consistent for 3-class problems. These properties make this approach useful for data sets with a limited number of labeled examples. For larger data sets, there is no significant comparative advantage to other multiclass SVM approaches.

Assuntos

Algoritmos , Inteligência Artificial , Inibição Psicológica , Reconhecimento Automatizado de Padrão , Animais , Encéfalo/anatomia & histologia , Classificação/métodos , Humanos , Modelos Estatísticos , Processos Estocásticos , Sinapses/fisiologia

Predicting accurate probabilities with a ranking loss.

Menon, Aditya Krishna; Jiang, Xiaoqian J; Vembu, Shankar; Elkan, Charles; Ohno-Machado, Lucila.

Proc Int Conf Mach Learn ; 2012: 703-710, 2012.

Artigo em Inglês | MEDLINE | ID: mdl-25285328

RESUMO

In many real-world applications of machine learning classifiers, it is essential to predict the probability of an example belonging to a particular class. This paper proposes a simple technique for predicting probabilities based on optimizing a ranking loss, followed by isotonic regression. This semi-parametric technique offers both good ranking and regression performance, and models a richer set of probability distributions than statistical workhorses such as logistic regression. We provide experimental results that show the effectiveness of this technique on real-world applications of probability prediction.

The Protein-Protein Interaction tasks of BioCreative III: classification/ranking of articles and linking bio-ontology concepts to full text.

Krallinger, Martin; Vazquez, Miguel; Leitner, Florian; Salgado, David; Chatr-Aryamontri, Andrew; Winter, Andrew; Perfetto, Livia; Briganti, Leonardo; Licata, Luana; Iannuccelli, Marta; Castagnoli, Luisa; Cesareni, Gianni; Tyers, Mike; Schneider, Gerold; Rinaldi, Fabio; Leaman, Robert; Gonzalez, Graciela; Matos, Sergio; Kim, Sun; Wilbur, W John; Rocha, Luis; Shatkay, Hagit; Tendulkar, Ashish V; Agarwal, Shashank; Liu, Feifan; Wang, Xinglong; Rak, Rafal; Noto, Keith; Elkan, Charles; Lu, Zhiyong; Dogan, Rezarta Islamaj; Fontaine, Jean-Fred; Andrade-Navarro, Miguel A; Valencia, Alfonso.

BMC Bioinformatics ; 12 Suppl 8: S3, 2011 Oct 03.

Artigo em Inglês | MEDLINE | ID: mdl-22151929

RESUMO

BACKGROUND: Determining usefulness of biomedical text mining systems requires realistic task definition and data selection criteria without artificial constraints, measuring performance aspects that go beyond traditional metrics. The BioCreative III Protein-Protein Interaction (PPI) tasks were motivated by such considerations, trying to address aspects including how the end user would oversee the generated output, for instance by providing ranked results, textual evidence for human interpretation or measuring time savings by using automated systems. Detecting articles describing complex biological events like PPIs was addressed in the Article Classification Task (ACT), where participants were asked to implement tools for detecting PPI-describing abstracts. Therefore the BCIII-ACT corpus was provided, which includes a training, development and test set of over 12,000 PPI relevant and non-relevant PubMed abstracts labeled manually by domain experts and recording also the human classification times. The Interaction Method Task (IMT) went beyond abstracts and required mining for associations between more than 3,500 full text articles and interaction detection method ontology concepts that had been applied to detect the PPIs reported in them. RESULTS: A total of 11 teams participated in at least one of the two PPI tasks (10 in ACT and 8 in the IMT) and a total of 62 persons were involved either as participants or in preparing data sets/evaluating these tasks. Per task, each team was allowed to submit five runs offline and another five online via the BioCreative Meta-Server. From the 52 runs submitted for the ACT, the highest Matthew's Correlation Coefficient (MCC) score measured was 0.55 at an accuracy of 89% and the best AUC iP/R was 68%. Most ACT teams explored machine learning methods, some of them also used lexical resources like MeSH terms, PSI-MI concepts or particular lists of verbs and nouns, some integrated NER approaches. For the IMT, a total of 42 runs were evaluated by comparing systems against manually generated annotations done by curators from the BioGRID and MINT databases. The highest AUC iP/R achieved by any run was 53%, the best MCC score 0.55. In case of competitive systems with an acceptable recall (above 35%) the macro-averaged precision ranged between 50% and 80%, with a maximum F-Score of 55%. CONCLUSIONS: The results of the ACT task of BioCreative III indicate that classification of large unbalanced article collections reflecting the real class imbalance is still challenging. Nevertheless, text-mining tools that report ranked lists of relevant articles for manual selection can potentially reduce the time needed to identify half of the relevant articles to less than 1/4 of the time when compared to unranked results. Detecting associations between full text articles and interaction detection method PSI-MI terms (IMT) is more difficult than might be anticipated. This is due to the variability of method term mentions, errors resulting from pre-processing of articles provided as PDF files, and the heterogeneity and different granularity of method term concepts encountered in the ontology. However, combining the sophisticated techniques developed by the participants with supporting evidence strings derived from the articles for human interpretation could result in practical modules for biological annotation workflows.

Assuntos

Algoritmos , Mineração de Dados , Proteínas/metabolismo , Animais , Bases de Dados de Proteínas , Humanos , Publicações Periódicas como Assunto , PubMed

Identifying relevant data for a biological database: handcrafted rules versus machine learning.

Sehgal, Aditya Kumar; Das, Sanmay; Noto, Keith; Saier, Milton H; Elkan, Charles.

IEEE/ACM Trans Comput Biol Bioinform ; 8(3): 851-7, 2011.

Artigo em Inglês | MEDLINE | ID: mdl-21393656

RESUMO

With well over 1,000 specialized biological databases in use today, the task of automatically identifying novel, relevant data for such databases is increasingly important. In this paper, we describe practical machine learning approaches for identifying MEDLINE documents and Swiss-Prot/TrEMBL protein records, for incorporation into a specialized biological database of transport proteins named TCDB. We show that both learning approaches outperform rules created by hand by a human expert. As one of the first case studies involving two different approaches to updating a deployed database, both the methods compared and the results will be of interest to curators of many specialized databases.

Assuntos

Algoritmos , Inteligência Artificial , Mineração de Dados/métodos , Bases de Dados Genéticas , Genômica/métodos , Proteínas de Transporte , Análise por Conglomerados , Humanos , MEDLINE , Proteínas/classificação , Proteínas/genética

Learning gene regulatory networks from only positive and unlabeled data.

Cerulo, Luigi; Elkan, Charles; Ceccarelli, Michele.

BMC Bioinformatics ; 11: 228, 2010 May 05.

Artigo em Inglês | MEDLINE | ID: mdl-20444264

RESUMO

BACKGROUND: Recently, supervised learning methods have been exploited to reconstruct gene regulatory networks from gene expression data. The reconstruction of a network is modeled as a binary classification problem for each pair of genes. A statistical classifier is trained to recognize the relationships between the activation profiles of gene pairs. This approach has been proven to outperform previous unsupervised methods. However, the supervised approach raises open questions. In particular, although known regulatory connections can safely be assumed to be positive training examples, obtaining negative examples is not straightforward, because definite knowledge is typically not available that a given pair of genes do not interact. RESULTS: A recent advance in research on data mining is a method capable of learning a classifier from only positive and unlabeled examples, that does not need labeled negative examples. Applied to the reconstruction of gene regulatory networks, we show that this method significantly outperforms the current state of the art of machine learning methods. We assess the new method using both simulated and experimental data, and obtain major performance improvement. CONCLUSIONS: Compared to unsupervised methods for gene network inference, supervised methods are potentially more accurate, but for training they need a complete set of known regulatory connections. A supervised method that can be trained using only positive and unlabeled data, as presented in this paper, is especially beneficial for the task of inferring gene regulatory networks, because only an incomplete set of known regulatory connections is available in public databases such as RegulonDB, TRRD, KEGG, Transfac, and IPA.

Assuntos

Perfilação da Expressão Gênica/métodos , Redes Reguladoras de Genes/genética , Inteligência Artificial , Mineração de Dados , Bases de Dados Genéticas

The Transporter Classification Database: recent advances.

Saier, Milton H; Yen, Ming Ren; Noto, Keith; Tamang, Dorjee G; Elkan, Charles.

Nucleic Acids Res ; 37(Database issue): D274-8, 2009 Jan.

Artigo em Inglês | MEDLINE | ID: mdl-19022853

RESUMO

The Transporter Classification Database (TCDB), freely accessible at http://www.tcdb.org, is a relational database containing sequence, structural, functional and evolutionary information about transport systems from a variety of living organisms, based on the International Union of Biochemistry and Molecular Biology-approved transporter classification (TC) system. It is a curated repository for factual information compiled largely from published references. It uses a functional/phylogenetic system of classification, and currently encompasses about 5000 representative transporters and putative transporters in more than 500 families. We here describe novel software designed to support and extend the usefulness of TCDB. Our recent efforts render it more user friendly, incorporate machine learning to input novel data in a semiautomatic fashion, and allow analyses that are more accurate and less time consuming. The availability of these tools has resulted in recognition of distant phylogenetic relationships and tremendous expansion of the information available to TCDB users.

Assuntos

Bases de Dados de Proteínas , Proteínas de Membrana Transportadoras/classificação , Inteligência Artificial , Proteínas de Membrana Transportadoras/química , Proteínas de Membrana Transportadoras/genética , Filogenia , Homologia de Sequência de Aminoácidos

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA