Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 12 de 12
Filtrar
1.
Research (Wash D C) ; 6: 0189, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37727321

RESUMO

Offensive language detection has received important attention and plays a crucial role in promoting healthy communication on social platforms, as well as promoting the safe deployment of large language models. Training data is the basis for developing detectors; however, the available offense-related dataset in Chinese is severely limited in terms of data scale and coverage when compared to English resources. This significantly affects the accuracy of Chinese offensive language detectors in practical applications, especially when dealing with hard cases or out-of-domain samples. To alleviate the limitations posed by available datasets, we introduce AugCOLD (Augmented Chinese Offensive Language Dataset), a large-scale unsupervised dataset containing 1 million samples gathered by data crawling and model generation. Furthermore, we employ a multiteacher distillation framework to enhance detection performance with unsupervised data. That is, we build multiple teachers with publicly accessible datasets and use them to assign soft labels to AugCOLD. The soft labels serve as a bridge for knowledge to be distilled from both AugCOLD and multiteacher to the student network, i.e., the final offensive detector. We conduct experiments on multiple public test sets and our well-designed hard tests, demonstrating that our proposal can effectively improve the generalization and robustness of the offensive language detector.

2.
Front Digit Health ; 5: 1133987, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37214342

RESUMO

Introduction: The growing demand for mental health support has highlighted the importance of conversational agents as human supporters worldwide and in China. These agents could increase availability and reduce the relative costs of mental health support. The provided support can be divided into two main types: cognitive and emotional. Existing work on this topic mainly focuses on constructing agents that adopt Cognitive Behavioral Therapy (CBT) principles. Such agents operate based on pre-defined templates and exercises to provide cognitive support. However, research on emotional support using such agents is limited. In addition, most of the constructed agents operate in English, highlighting the importance of conducting such studies in China. To this end, we introduce Emohaa, a conversational agent that provides cognitive support through CBT-Bot exercises and guided conversations. It also emotionally supports users through ES-Bot, enabling them to vent their emotional problems. In this study, we analyze the effectiveness of Emohaa in reducing symptoms of mental distress. Methods and Results: Following the RCT design, the current study randomly assigned participants into three groups: Emohaa (CBT-Bot), Emohaa (Full), and control. With both Intention-To-Treat (N=247) and PerProtocol (N=134) analyses, the results demonstrated that compared to the control group, participants who used two types of Emohaa experienced considerably more significant improvements in symptoms of mental distress, including depression (F[2,244]=6.26, p=0.002), negative affect (F[2,244]=6.09, p=0.003), and insomnia (F[2,244]=3.69, p=0.026). Discussion: Based on the obtained results and participants' satisfaction with the platform, we concluded that Emohaa is a practical and effective tool for reducing mental distress.

3.
BMC Bioinformatics ; 12 Suppl 8: S2, 2011 Oct 03.
Artigo em Inglês | MEDLINE | ID: mdl-22151901

RESUMO

BACKGROUND: We report the Gene Normalization (GN) challenge in BioCreative III where participating teams were asked to return a ranked list of identifiers of the genes detected in full-text articles. For training, 32 fully and 500 partially annotated articles were prepared. A total of 507 articles were selected as the test set. Due to the high annotation cost, it was not feasible to obtain gold-standard human annotations for all test articles. Instead, we developed an Expectation Maximization (EM) algorithm approach for choosing a small number of test articles for manual annotation that were most capable of differentiating team performance. Moreover, the same algorithm was subsequently used for inferring ground truth based solely on team submissions. We report team performance on both gold standard and inferred ground truth using a newly proposed metric called Threshold Average Precision (TAP-k). RESULTS: We received a total of 37 runs from 14 different teams for the task. When evaluated using the gold-standard annotations of the 50 articles, the highest TAP-k scores were 0.3297 (k=5), 0.3538 (k=10), and 0.3535 (k=20), respectively. Higher TAP-k scores of 0.4916 (k=5, 10, 20) were observed when evaluated using the inferred ground truth over the full test set. When combining team results using machine learning, the best composite system achieved TAP-k scores of 0.3707 (k=5), 0.4311 (k=10), and 0.4477 (k=20) on the gold standard, representing improvements of 12.4%, 21.8%, and 26.6% over the best team results, respectively. CONCLUSIONS: By using full text and being species non-specific, the GN task in BioCreative III has moved closer to a real literature curation task than similar tasks in the past and presents additional challenges for the text mining community, as revealed in the overall team results. By evaluating teams using the gold standard, we show that the EM algorithm allows team submissions to be differentiated while keeping the manual annotation effort feasible. Using the inferred ground truth we show measures of comparative performance between teams. Finally, by comparing team rankings on gold standard vs. inferred ground truth, we further demonstrate that the inferred ground truth is as effective as the gold standard for detecting good team performance.


Assuntos
Algoritmos , Mineração de Dados/métodos , Genes , Animais , Mineração de Dados/normas , Humanos , National Library of Medicine (U.S.) , Publicações Periódicas como Assunto , Estados Unidos
4.
J Am Med Inform Assoc ; 18(5): 660-7, 2011.
Artigo em Inglês | MEDLINE | ID: mdl-21613640

RESUMO

BACKGROUND: Due to the high cost of manual curation of key aspects from the scientific literature, automated methods for assisting this process are greatly desired. Here, we report a novel approach to facilitate MeSH indexing, a challenging task of assigning MeSH terms to MEDLINE citations for their archiving and retrieval. METHODS: Unlike previous methods for automatic MeSH term assignment, we reformulate the indexing task as a ranking problem such that relevant MeSH headings are ranked higher than those irrelevant ones. Specifically, for each document we retrieve 20 neighbor documents, obtain a list of MeSH main headings from neighbors, and rank the MeSH main headings using ListNet-a learning-to-rank algorithm. We trained our algorithm on 200 documents and tested on a previously used benchmark set of 200 documents and a larger dataset of 1000 documents. RESULTS: Tested on the benchmark dataset, our method achieved a precision of 0.390, recall of 0.712, and mean average precision (MAP) of 0.626. In comparison to the state of the art, we observe statistically significant improvements as large as 39% in MAP (p-value <0.001). Similar significant improvements were also obtained on the larger document set. CONCLUSION: Experimental results show that our approach makes the most accurate MeSH predictions to date, which suggests its great potential in making a practical impact on MeSH indexing. Furthermore, as discussed the proposed learning framework is robust and can be adapted to many other similar tasks beyond MeSH indexing in the biomedical domain. All data sets are available at: http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/indexing.


Assuntos
Indexação e Redação de Resumos/métodos , Medical Subject Headings , Processamento de Linguagem Natural , PubMed , Algoritmos , Inteligência Artificial , Automação , Humanos , Semântica , Estados Unidos
5.
Bioinformatics ; 27(7): 1032-3, 2011 Apr 01.
Artigo em Inglês | MEDLINE | ID: mdl-21303863

RESUMO

MOTIVATION: Linking gene mentions in an article to entries of biological databases can facilitate indexing and querying biological literature greatly. Due to the high ambiguity of gene names, this task is particularly challenging. Manual annotation for this task is cost expensive, time consuming and labor intensive. Therefore, providing assistive tools to facilitate the task is of high value. RESULTS: We developed GeneTUKit, a document-level gene normalization software for full-text articles. This software employs both local context surrounding gene mentions and global context from the whole full-text document. It can normalize genes of different species simultaneously. When participating in BioCreAtIvE III, the system obtained good results among 37 runs: the system was ranked first, fourth and seventh in terms of TAP-20, TAP-10 and TAP-5, respectively on the 507 full-text test articles. AVAILABILITY AND IMPLEMENTATION: The software is available at http://www.qanswers.net/GeneTUKit/.


Assuntos
Genes , Software , Mineração de Dados , Valores de Referência
6.
BMC Bioinformatics ; 10 Suppl 1: S55, 2009 Jan 30.
Artigo em Inglês | MEDLINE | ID: mdl-19208158

RESUMO

BACKGROUND: Considerable efforts have been made to extract protein-protein interactions from the biological literature, but little work has been done on the extraction of interaction detection methods. It is crucial to annotate the detection methods in the literature, since different detection methods shed different degrees of reliability on the reported interactions. However, the diversity of method mentions in the literature makes the automatic extraction quite challenging. RESULTS: In this article, we develop a generative topic model, the Correlated Method-Word model (CMW model) to extract the detection methods from the literature. In the CMW model, we formulate the correlation between the different methods and related words in a probabilistic framework in order to infer the potential methods from the given document. By applying the model on a corpus of 5319 full text documents annotated by the MINT and IntAct databases, we observe promising results, which outperform the best result reported in the BioCreative II challenge evaluation. CONCLUSION: From the promising experiment results, we can see that the CMW model overcomes the issues caused by the diversity in the method mentions and properly captures the in-depth correlations between the detection methods and related words. The performance outperforming the baseline methods confirms that the dependence assumptions of the model are reasonable and the model is competent for the practical processing.


Assuntos
Biologia Computacional/métodos , Armazenamento e Recuperação da Informação/métodos , Mapeamento de Interação de Proteínas , Reconhecimento Automatizado de Padrão
7.
Artigo em Inglês | MEDLINE | ID: mdl-21243102

RESUMO

This paper presents a novel method of generating extractive summaries for multiple documents. Given a cluster of documents, we firstly construct a graph where each vertex represents a sentence and edges are created according to the asymmetric relationship between sentences. Then we develop a method to measure the importance of a subset of vertices by adding a super-vertex into the original graph. The importance of such a super-vertex is quantified as super-centrality, a quantitative measure for the importance of a subset of vertices within the whole graph. Finally, we propose a heuristic algorithm to find the best summary. Our method is evaluated with extensive experiments. The comparative results show that the proposed method outperforms other methods on several datasets.

8.
Genome Biol ; 9 Suppl 2: S12, 2008.
Artigo em Inglês | MEDLINE | ID: mdl-18834490

RESUMO

BACKGROUND: Deciphering physical protein-protein interactions is fundamental to elucidating both the functions of proteins and biological processes. The development of high-throughput experimental technologies such as the yeast two-hybrid screening has produced an explosion in data relating to interactions. Since manual curation is intensive in terms of time and cost, there is an urgent need for text-mining tools to facilitate the extraction of such information. The BioCreative (Critical Assessment of Information Extraction systems in Biology) challenge evaluation provided common standards and shared evaluation criteria to enable comparisons among different approaches. RESULTS: During the benchmark evaluation of BioCreative 2006, all of our results ranked in the top three places. In the task of filtering articles irrelevant to physical protein interactions, our method contributes a precision of 75.07%, a recall of 81.07%, and an AUC (area under the receiver operating characteristic curve) of 0.847. In the task of identifying protein mentions and normalizing mentions to molecule identifiers, our method is competitive among runs submitted, with a precision of 34.83%, a recall of 24.10%, and an F1 score of 28.5%. In extracting protein interaction pairs, our profile-based method was competitive on the SwissProt-only subset (precision = 36.95%, recall = 32.68%, and F1 score = 30.40%) and on the entire dataset (30.96%, 29.35%, and 26.20%, respectively). From the biologist's point of view, however, these findings are far from satisfactory. The error analysis presented in this report provides insight into how performance could be improved: three-quarters of false negatives were due to protein normalization problems (532/698), and about one-quarter were due to problems with correctly extracting interactions for this system. CONCLUSION: We present a text-mining framework to extract physical protein-protein interactions from the literature. Three key issues are addressed, namely filtering irrelevant articles, identifying protein names and normalizing them to molecule identifiers, and extracting protein-protein interactions. Our system is among the top three performers in the benchmark evaluation of BioCreative 2006. The tool will be helpful for manual interaction curation and can greatly facilitate the process of extracting protein-protein interactions.


Assuntos
Bases de Dados Bibliográficas , Mapeamento de Interação de Proteínas , Bases de Dados de Proteínas
9.
BMC Bioinformatics ; 9 Suppl 3: S4, 2008 Apr 11.
Artigo em Inglês | MEDLINE | ID: mdl-18426549

RESUMO

BACKGROUND: Efficient features play an important role in automated text classification, which definitely facilitates the access of large-scale data. In the bioscience field, biological structures and terminologies are described by a large number of features; domain dependent features would significantly improve the classification performance. How to effectively select and integrate different types of features to improve the biological literature classification performance is the major issue studied in this paper. RESULTS: To efficiently classify the biological literatures, we propose a novel feature value schema TF*ML, features covering from lower level domain independent "string feature" to higher level domain dependent "semantic template feature", and proper integrations among the features. Compared to our previous approaches, the performance is improved in terms of AUC and F-Score by 11.5% and 8.8% respectively, and outperforms the best performance achieved in BioCreAtIvE 2006. CONCLUSIONS: Different types of features possess different discriminative capabilities in literature classification; proper integration of domain independent and dependent features would significantly improve the performance and overcome the over-fitting on data distribution.


Assuntos
Algoritmos , Inteligência Artificial , Dicionários como Assunto , Processamento de Linguagem Natural , Reconhecimento Automatizado de Padrão/métodos , Publicações Periódicas como Assunto , Terminologia como Assunto , Vocabulário Controlado
10.
Int J Med Inform ; 75(6): 443-55, 2006 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-16095962

RESUMO

PURPOSE: Over recent years, there has been a growing interest in extracting entities and relations from biomedical literature. There are a vast number of systems and approaches being proposed to extract biological relations, but none of them achieves satisfactory results. These methodologies are either parsing-based or pattern-based, which are not competent to handle the grammatical complexities of biomedical texts, or too complicated to be adapted. It is well known that appositive, coordinative propositions and such grammatical structures are extremely common in biomedical texts, particularly in full texts. However, these problems are still untouched for most of researchers. METHODS: In this paper, we have proposed a new approach, which is hybrid with both shallow parsing and pattern matching, to extract relations between proteins from scientific papers of biomedical themes. In the method, appositive and coordinative structures are interpreted based on the shallow parsing analysis, with both syntactic and semantic constraints. Then long sentences are splitted into sub-ones, from which relations are extracted by a greedy pattern matching algorithm, along with automatically generated patterns. RESULTS: Our approach is experimented to extract protein-protein interactions from full biomedical texts, and has achieved an average F-score of 80% on individual verbs, and 66% on all verbs. With the help of shallow parsing analysis, pattern matching is improved remarkably. Compared with the traditional pattern matching algorithm, our approach achieves about 7% improvement of both precision and F-score. In contrast to other systems, our approach achieves performance comparable to the best. A demo system has been available at http://spies.cs.tsinghua.edu.cn.


Assuntos
Indexação e Redação de Resumos/métodos , Biologia , Medicina , Processamento de Linguagem Natural , Publicações Periódicas como Assunto , Terminologia como Assunto , Vocabulário Controlado , Inteligência Artificial , Sistemas de Gerenciamento de Base de Dados , Bases de Dados Bibliográficas , Armazenamento e Recuperação da Informação/métodos , Linguística , Semântica
11.
Bioinformatics ; 21(15): 3294-300, 2005 Aug 01.
Artigo em Inglês | MEDLINE | ID: mdl-15890744

RESUMO

MOTIVATION: An enormous number of protein-protein interaction relationships are buried in millions of research articles published over the years, and the number is growing. Rediscovering them automatically is a challenging bioinformatics task. Solutions to this problem also reach far beyond bioinformatics. RESULTS: We study a new approach that involves automatically discovering English expression patterns, optimizing them and using them to extract protein-protein interactions. In a sister paper, we described how to generate English expression patterns related to protein-protein interactions, and this approach alone has already achieved precision and recall rates significantly higher than those of other automatic systems. This paper continues to present our theory, focusing on how to improve the patterns. A minimum description length (MDL)-based pattern-optimization algorithm is designed to reduce and merge patterns. This has significantly increased generalization power, and hence the recall and precision rates, as confirmed by our experiments. AVAILABILITY: http://spies.cs.tsinghua.edu.cn.


Assuntos
Inteligência Artificial , Sistemas de Gerenciamento de Base de Dados , Armazenamento e Recuperação da Informação/métodos , MEDLINE , Processamento de Linguagem Natural , Reconhecimento Automatizado de Padrão/métodos , Publicações Periódicas como Assunto , Mapeamento de Interação de Proteínas/métodos , Indexação e Redação de Resumos/métodos , Algoritmos , Software , Interface Usuário-Computador , Vocabulário Controlado
12.
Bioinformatics ; 20(18): 3604-12, 2004 Dec 12.
Artigo em Inglês | MEDLINE | ID: mdl-15284092

RESUMO

MOTIVATION: Although there are several databases storing protein-protein interactions, most such data still exist only in the scientific literature. They are scattered in scientific literature written in natural languages, defying data mining efforts. Much time and labor have to be spent on extracting protein pathways from literature. Our aim is to develop a robust and powerful methodology to mine protein-protein interactions from biomedical texts. RESULTS: We present a novel and robust approach for extracting protein-protein interactions from literature. Our method uses a dynamic programming algorithm to compute distinguishing patterns by aligning relevant sentences and key verbs that describe protein interactions. A matching algorithm is designed to extract the interactions between proteins. Equipped only with a dictionary of protein names, our system achieves a recall rate of 80.0% and precision rate of 80.5%. AVAILABILITY: The program is available on request from the authors.


Assuntos
Inteligência Artificial , Processamento de Linguagem Natural , Reconhecimento Automatizado de Padrão/métodos , Publicações Periódicas como Assunto , Mapeamento de Interação de Proteínas/métodos , Proteínas/química , Proteínas/metabolismo , Algoritmos , Sistemas de Gerenciamento de Base de Dados , Bases de Dados Bibliográficas , Armazenamento e Recuperação da Informação/métodos , Software , Vocabulário Controlado
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...