Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 6 de 6
Filter
Add more filters










Database
Language
Publication year range
1.
JMIR Med Inform ; 12: e49607, 2024 Apr 04.
Article in English | MEDLINE | ID: mdl-38596859

ABSTRACT

Background: Biomedical natural language processing tasks are best performed with English models, and translation tools have undergone major improvements. On the other hand, building annotated biomedical data sets remains a challenge. Objective: The aim of our study is to determine whether the use of English tools to extract and normalize French medical concepts based on translations provides comparable performance to that of French models trained on a set of annotated French clinical notes. Methods: We compared 2 methods: 1 involving French-language models and 1 involving English-language models. For the native French method, the named entity recognition and normalization steps were performed separately. For the translated English method, after the first translation step, we compared a 2-step method and a terminology-oriented method that performs extraction and normalization at the same time. We used French, English, and bilingual annotated data sets to evaluate all stages (named entity recognition, normalization, and translation) of our algorithms. Results: The native French method outperformed the translated English method, with an overall F1-score of 0.51 (95% CI 0.47-0.55), compared with 0.39 (95% CI 0.34-0.44) and 0.38 (95% CI 0.36-0.40) for the 2 English methods tested. Conclusions: Despite recent improvements in translation models, there is a significant difference in performance between the 2 approaches in favor of the native French method, which is more effective on French medical texts, even with few annotated documents.

2.
JMIR Med Inform ; 2024 Jan 10.
Article in English | MEDLINE | ID: mdl-38427586

ABSTRACT

BACKGROUND: Biomedical natural language processing tasks are best performed with English models, and translation tools have undergone major improvements. On the other hand, building annotated biomedical datasets remains a challenge. OBJECTIVE: The aim of our study is to determine whether the use of English tools to extract and normalize French medical concepts on translations provides comparable performance to that of French models trained on a set of annotated French clinical notes. METHODS: We compare two methods: one involving French-language models and one involving English-language models. For the native French method, the Named Entity Recognition (NER) and normalization steps are performed separately. For the translated English method, after the first translation step, we compare a two-step method and a terminology-oriented method that performs extraction and normalization at the same time. We used French, English and bilingual annotated datasets to evaluate all stages (NER, normalization and translation) of our algorithms. RESULTS: The native French method outperformed the translated English method, with an overall f1 score of 0.51 [0.47;0.55], compared with 0.39 [0.34;0.44] and 0.38 [0.36;0.40] for the two English methods tested. CONCLUSIONS: Despite recent improvements in translation models, there is a significant difference in performance between the two approaches in favor of the native French method, which is more effective on French medical texts, even with few annotated documents.

3.
Methods Inf Med ; 2024 Mar 05.
Article in English | MEDLINE | ID: mdl-38442906

ABSTRACT

OBJECTIVE: The objective of this study is to address the critical issue of deidentification of clinical reports to allow access to data for research purposes, while ensuring patient privacy. The study highlights the difficulties faced in sharing tools and resources in this domain and presents the experience of the Greater Paris University Hospitals (AP-HP for Assistance Publique-Hôpitaux de Paris) in implementing a systematic pseudonymization of text documents from its Clinical Data Warehouse. METHODS: We annotated a corpus of clinical documents according to 12 types of identifying entities and built a hybrid system, merging the results of a deep learning model as well as manual rules. RESULTS AND DISCUSSION: Our results show an overall performance of 0.99 of F1-score. We discuss implementation choices and present experiments to better understand the effort involved in such a task, including dataset size, document types, language models, or rule addition. We share guidelines and code under a 3-Clause BSD license.

4.
Artif Intell Med ; 128: 102311, 2022 06.
Article in English | MEDLINE | ID: mdl-35534148

ABSTRACT

BACKGROUND: The development of electronic health records has provided a large volume of unstructured biomedical information. Extracting patient characteristics from these data has become a major challenge, especially in languages other than English. METHODS: Inspired by the French Text Mining Challenge (DEFT 2021) [1] in which we participated, our study proposes a multilabel classification of clinical narratives, allowing us to automatically extract the main features of a patient report. Our system is an end-to-end pipeline from raw text to labels with two main steps: named entity recognition and multilabel classification. Both steps are based on a neural network architecture based on transformers. To train our final classifier, we extended the dataset with all English and French Unified Medical Language System (UMLS) vocabularies related to human diseases. We focus our study on the multilingualism of training resources and models, with experiments combining French and English in different ways (multilingual embeddings or translation). RESULTS: We obtained an overall average micro-F1 score of 0.811 for the multilingual version, 0.807 for the French-only version and 0.797 for the translated version. CONCLUSION: Our study proposes an original multilabel classification of French clinical notes for patient phenotyping. We show that a multilingual algorithm trained on annotated real clinical notes and UMLS vocabularies leads to the best results.


Subject(s)
Multilingualism , Natural Language Processing , Data Mining , Humans , Language , Unified Medical Language System
5.
J Biomed Inform ; 130: 104073, 2022 06.
Article in English | MEDLINE | ID: mdl-35427797

ABSTRACT

A vast amount of crucial information about patients resides solely in unstructured clinical narrative notes. There has been a growing interest in clinical Named Entity Recognition (NER) task using deep learning models. Such approaches require sufficient annotated data. However, there is little publicly available annotated corpora in the medical field due to the sensitive nature of the clinical text. In this paper, we tackle this problem by building privacy-preserving shareable models for French clinical Named Entity Recognition using the mimic learning approach to enable the knowledge transfer through a teacher model trained on a private corpus to a student model. This student model could be publicly shared without any access to the original sensitive data. We evaluated three privacy-preserving models using three medical corpora and compared the performance of our models to those of baseline models such as dictionary-based models. An overall macro F-measure of 70.6% could be achieved by a student model trained using silver annotations produced by the teacher model, compared to 85.7% for the original private teacher model. Our results revealed that these privacy-preserving mimic learning models offer a good compromise between performance and data privacy preservation.


Subject(s)
Narration , Privacy , Humans , Natural Language Processing
6.
J Biomed Inform ; 114: 103684, 2021 02.
Article in English | MEDLINE | ID: mdl-33450387

ABSTRACT

INTRODUCTION: Concept normalization is the task of linking terms from textual medical documents to their concept in terminologies such as the UMLS®. Traditional approaches to this problem depend heavily on the coverage of available resources, which poses a problem for languages other than English. OBJECTIVE: We present a system for concept normalization in French. We consider textual mentions already extracted and labeled by a named entity recognition system, and we classify these mentions with a UMLS concept unique identifier. We take advantage of the multilingual nature of available terminologies and embedding models to improve concept normalization in French without translation nor direct supervision. MATERIALS AND METHODS: We consider the task as a highly-multiclass classification problem. The terms are encoded with contextualized embeddings and classified via cosine similarity and softmax. A first step uses a subset of the terminology to finetune the embeddings and train the model. A second step adds the entire target terminology, and the model is trained further with hard negative selection and softmax sampling. RESULTS: On two corpora from the Quaero FrenchMed benchmark, we show that our approach can lead to good results even with no labeled data at all; and that it outperforms existing supervised methods with labeled data. DISCUSSION: Training the system with both French and English terms improves by a large margin the performance of the system on a French benchmark, regardless of the way the embeddings were pretrained (French, English, multilingual). Our distantly supervised method can be applied to any kind of documents or medical domain, as it does not require any concept-labeled documents. CONCLUSION: These experiments pave the way for simpler and more effective multilingual approaches to processing medical texts in languages other than English.


Subject(s)
Multilingualism , Unified Medical Language System , Language , Natural Language Processing
SELECTION OF CITATIONS
SEARCH DETAIL
...