Search | VHL Regional Portal

1.

Evaluating the Portability of Rheumatoid Arthritis Phenotyping Algorithms: A Case Study on French EHRs.

Fabacher, Thibaut; Sauleau, Erik-André; Leclerc Du Sablon, Noémie; Bergier, Hugo; Gottenberg, Jacques-Eric; Coulet, Adrien; Névéol, Aurélie.

Stud Health Technol Inform ; 302: 768-772, 2023 May 18.

Article in English | MEDLINE | ID: mdl-37203492

ABSTRACT

Previous work has successfully used machine learning and natural language processing for the phenotyping of Rheumatoid Arthritis (RA) patients in hospitals within the United States and France. Our goal is to evaluate the adaptability of RA phenotyping algorithms to a new hospital, both at the patient and encounter levels. Two algorithms are adapted and evaluated with a newly developed RA gold standard corpus, including annotations at the encounter level. The adapted algorithms offer comparably good performance for patient-level phenotyping on the new corpus (F1 0.68 to 0.82), but lower performance for encounter-level (F1 0.54). Regarding adaptation feasibility and cost, the first algorithm incurred a heavier adaptation burden because it required manual feature engineering. However, it is less computationally intensive than the second, semi-supervised, algorithm.

Subject(s)

Arthritis, Rheumatoid , Electronic Health Records , Humans , Algorithms , Arthritis, Rheumatoid/diagnosis , Machine Learning , Natural Language Processing

2.

Privacy-preserving mimic models for clinical named entity recognition in French.

Bannour, Nesrine; Wajsbürt, Perceval; Rance, Bastien; Tannier, Xavier; Névéol, Aurélie.

J Biomed Inform ; 130: 104073, 2022 06.

Article in English | MEDLINE | ID: mdl-35427797

ABSTRACT

A vast amount of crucial information about patients resides solely in unstructured clinical narrative notes. There has been a growing interest in clinical Named Entity Recognition (NER) task using deep learning models. Such approaches require sufficient annotated data. However, there is little publicly available annotated corpora in the medical field due to the sensitive nature of the clinical text. In this paper, we tackle this problem by building privacy-preserving shareable models for French clinical Named Entity Recognition using the mimic learning approach to enable the knowledge transfer through a teacher model trained on a private corpus to a student model. This student model could be publicly shared without any access to the original sensitive data. We evaluated three privacy-preserving models using three medical corpora and compared the performance of our models to those of baseline models such as dictionary-based models. An overall macro F-measure of 70.6% could be achieved by a student model trained using silver annotations produced by the teacher model, compared to 85.7% for the original private teacher model. Our results revealed that these privacy-preserving mimic learning models offer a good compromise between performance and data privacy preservation.

Subject(s)

Narration , Privacy , Humans , Natural Language Processing

3.

Diversity in Health Informatics: Mentoring and Leadership.

Moen, Anne; Chronaki, Catherine; Petelos, Elena; Voulgaraki, Despina; Turk, Eva; Névéol, Aurélie.

Stud Health Technol Inform ; 281: 1031-1035, 2021 May 27.

Article in English | MEDLINE | ID: mdl-34042835

ABSTRACT

Diversity, inclusion and interdisciplinary collaboration are drivers for healthcare innovation and adoption of new, technology-mediated services. The importance of diversity has been highlighted by the United Nations' in SDG5 "Achieve gender equality and empower all women and girls", to drive adoption of social and digital innovation. Women play an instrumental role in health care and are in position to bring about significant changes to support ongoing digitalization and transformation. At the same time, women are underrepresented in Science, Technology, Engineering and Mathematics (STEM). To some extent, the same holds for health care informatics. This paper sums up input to strategies for peer mentoring to ensure diversity in health informatics, to target systemic inequalities and build sustainable, intergenerational communities, improve digital health literacy and build capacity in digital health without losing the human touch.

Subject(s)

Medical Informatics , Mentoring , Engineering , Female , Humans , Leadership , Mentors

4.

Can reproducibility be improved in clinical natural language processing? A study of 7 clinical NLP suites.

Digan, William; Névéol, Aurélie; Neuraz, Antoine; Wack, Maxime; Baudoin, David; Burgun, Anita; Rance, Bastien.

J Am Med Inform Assoc ; 28(3): 504-515, 2021 03 01.

Article in English | MEDLINE | ID: mdl-33319904

ABSTRACT

BACKGROUND: The increasing complexity of data streams and computational processes in modern clinical health information systems makes reproducibility challenging. Clinical natural language processing (NLP) pipelines are routinely leveraged for the secondary use of data. Workflow management systems (WMS) have been widely used in bioinformatics to handle the reproducibility bottleneck. OBJECTIVE: To evaluate if WMS and other bioinformatics practices could impact the reproducibility of clinical NLP frameworks. MATERIALS AND METHODS: Based on the literature across multiple researcho fields (NLP, bioinformatics and clinical informatics) we selected articles which (1) review reproducibility practices and (2) highlight a set of rules or guidelines to ensure tool or pipeline reproducibility. We aggregate insight from the literature to define reproducibility recommendations. Finally, we assess the compliance of 7 NLP frameworks to the recommendations. RESULTS: We identified 40 reproducibility features from 8 selected articles. Frameworks based on WMS match more than 50% of features (26 features for LAPPS Grid, 22 features for OpenMinted) compared to 18 features for current clinical NLP framework (cTakes, CLAMP) and 17 features for GATE, ScispaCy, and Textflows. DISCUSSION: 34 recommendations are endorsed by at least 2 articles from our selection. Overall, 15 features were adopted by every NLP Framework. Nevertheless, frameworks based on WMS had a better compliance with the features. CONCLUSION: NLP frameworks could benefit from lessons learned from the bioinformatics field (eg, public repositories of curated tools and workflows or use of containers for shareability) to enhance the reproducibility in a clinical setting.

Subject(s)

Natural Language Processing , Reproducibility of Results , Computational Biology , Database Management Systems , Medical Informatics

5.

Ten simple rules to make your research more sustainable.

Ligozat, Anne-Laure; Névéol, Aurélie; Daly, Bénédicte; Frenoux, Emmanuelle.

PLoS Comput Biol ; 16(9): e1008148, 2020 09.

Article in English | MEDLINE | ID: mdl-32970666

Subject(s)

Biomedical Research , Carbon Footprint/standards , Cooperative Behavior , Sustainable Development , Conservation of Natural Resources , Health Knowledge, Attitudes, Practice , Humans , Internet , Travel

6.

Evaluation of an automatic article selection method for timelier updates of the Comet Core Outcome Set database.

Norman, Christopher R; Gargon, Elizabeth; Leeflang, Mariska M G; Névéol, Aurélie; Williamson, Paula R.

Database (Oxford) ; 20192019 01 01.

Article in English | MEDLINE | ID: mdl-31697361

ABSTRACT

Curated databases of scientific literature play an important role in helping researchers find relevant literature, but populating such databases is a labour intensive and time-consuming process. One such database is the freely accessible Comet Core Outcome Set database, which was originally populated using manual screening in an annually updated systematic review. In order to reduce the workload and facilitate more timely updates we are evaluating machine learning methods to reduce the number of references needed to screen. In this study we have evaluated a machine learning approach based on logistic regression to automatically rank the candidate articles. Data from the original systematic review and its four first review updates were used to train the model and evaluate performance. We estimated that using automatic screening would yield a workload reduction of at least 75% while keeping the number of missed references around 2%. We judged this to be an acceptable trade-off for this systematic review, and the method is now being used for the next round of the Comet database update.

Subject(s)

Data Curation , Data Mining , Databases, Factual , Machine Learning , Systematic Reviews as Topic

7.

Measuring the impact of screening automation on meta-analyses of diagnostic test accuracy.

Norman, Christopher R; Leeflang, Mariska M G; Porcher, Raphaël; Névéol, Aurélie.

Syst Rev ; 8(1): 243, 2019 10 28.

Article in English | MEDLINE | ID: mdl-31661028

ABSTRACT

BACKGROUND: The large and increasing number of new studies published each year is making literature identification in systematic reviews ever more time-consuming and costly. Technological assistance has been suggested as an alternative to the conventional, manual study identification to mitigate the cost, but previous literature has mainly evaluated methods in terms of recall (search sensitivity) and workload reduction. There is a need to also evaluate whether screening prioritization methods leads to the same results and conclusions as exhaustive manual screening. In this study, we examined the impact of one screening prioritization method based on active learning on sensitivity and specificity estimates in systematic reviews of diagnostic test accuracy. METHODS: We simulated the screening process in 48 Cochrane reviews of diagnostic test accuracy and re-run 400 meta-analyses based on a least 3 studies. We compared screening prioritization (with technological assistance) and screening in randomized order (standard practice without technology assistance). We examined if the screening could have been stopped before identifying all relevant studies while still producing reliable summary estimates. For all meta-analyses, we also examined the relationship between the number of relevant studies and the reliability of the final estimates. RESULTS: The main meta-analysis in each systematic review could have been performed after screening an average of 30% of the candidate articles (range 0.07 to 100%). No systematic review would have required screening more than 2308 studies, whereas manual screening would have required screening up to 43,363 studies. Despite an average 70% recall, the estimation error would have been 1.3% on average, compared to an average 2% estimation error expected when replicating summary estimate calculations. CONCLUSION: Screening prioritization coupled with stopping criteria in diagnostic test accuracy reviews can reliably detect when the screening process has identified a sufficient number of studies to perform the main meta-analysis with an accuracy within pre-specified tolerance limits. However, many of the systematic reviews did not identify a sufficient number of studies that the meta-analyses were accurate within a 2% limit even with exhaustive manual screening, i.e., using current practice.

Subject(s)

Automation , Diagnostic Tests, Routine , Mass Screening , Humans , Diagnostic Tests, Routine/standards , Reproducibility of Results , Research Design , Sensitivity and Specificity , Systematic Reviews as Topic , Meta-Analysis as Topic

8.

Expanding the Diversity of Texts and Applications: Findings from the Section on Clinical Natural Language Processing of the International Medical Informatics Association Yearbook.

Névéol, Aurélie; Zweigenbaum, Pierre.

Yearb Med Inform ; 27(1): 193-198, 2018 Aug.

Article in English | MEDLINE | ID: mdl-30157523

ABSTRACT

OBJECTIVES: To summarize recent research and present a selection of the best papers published in 2017 in the field of clinical Natural Language Processing (NLP). METHODS: A survey of the literature was performed by the two editors of the NLP section of the International Medical Informatics Association (IMIA) Yearbook. Bibliographic databases PubMed and Association of Computational Linguistics (ACL) Anthology were searched for papers with a focus on NLP efforts applied to clinical texts or aimed at a clinical outcome. A total of 709 papers were automatically ranked and then manually reviewed based on title and abstract. A shortlist of 15 candidate best papers was selected by the section editors and peer-reviewed by independent external reviewers to come to the three best clinical NLP papers for 2017. RESULTS: Clinical NLP best papers provide a contribution that ranges from methodological studies to the application of research results to practical clinical settings. They draw from text genres as diverse as clinical narratives across hospitals and languages or social media. CONCLUSIONS: Clinical NLP continued to thrive in 2017, with an increasing number of contributions towards applications compared to fundamental methods. Methodological work explores deep learning and system adaptation across language variants. Research results continue to translate into freely available tools and corpora, mainly for the English language.

Subject(s)

Natural Language Processing , Health Personnel , Humans , Medical Informatics

9.

Three Dimensions of Reproducibility in Natural Language Processing.

Cohen, K Bretonnel; Xia, Jingbo; Zweigenbaum, Pierre; Callahan, Tiffany J; Hargraves, Orin; Goss, Foster; Ide, Nancy; Névéol, Aurélie; Grouin, Cyril; Hunter, Lawrence E.

LREC Int Conf Lang Resour Eval ; 2018: 156-165, 2018 May.

Article in English | MEDLINE | ID: mdl-29911205

ABSTRACT

Despite considerable recent attention to problems with reproducibility of scientific research, there is a striking lack of agreement about the definition of the term. That is a problem, because the lack of a consensus definition makes it difficult to compare studies of reproducibility, and thus to have even a broad overview of the state of the issue in natural language processing. This paper proposes an ontology of reproducibility in that field. Its goal is to enhance both future research and communication about the topic, and retrospective meta-analyses. We show that three dimensions of reproducibility, corresponding to three kinds of claims in natural language processing papers, can account for a variety of types of research reports. These dimensions are reproducibility of a conclusion, of a finding, and of a value. Three biomedical natural language processing papers by the authors of this paper are analyzed with respect to these dimensions.

10.

Clinical Natural Language Processing in languages other than English: opportunities and challenges.

Névéol, Aurélie; Dalianis, Hercules; Velupillai, Sumithra; Savova, Guergana; Zweigenbaum, Pierre.

J Biomed Semantics ; 9(1): 12, 2018 03 30.

Article in English | MEDLINE | ID: mdl-29602312

ABSTRACT

BACKGROUND: Natural language processing applied to clinical text or aimed at a clinical outcome has been thriving in recent years. This paper offers the first broad overview of clinical Natural Language Processing (NLP) for languages other than English. Recent studies are summarized to offer insights and outline opportunities in this area. MAIN BODY: We envision three groups of intended readers: (1) NLP researchers leveraging experience gained in other languages, (2) NLP researchers faced with establishing clinical text processing in a language other than English, and (3) clinical informatics researchers and practitioners looking for resources in their languages in order to apply NLP techniques and tools to clinical practice and/or investigation. We review work in clinical NLP in languages other than English. We classify these studies into three groups: (i) studies describing the development of new NLP systems or components de novo, (ii) studies describing the adaptation of NLP architectures developed for English to another language, and (iii) studies focusing on a particular clinical application. CONCLUSION: We show the advantages and drawbacks of each method, and highlight the appropriate application context. Finally, we identify major challenges and opportunities that will affect the impact of NLP on clinical practice and public health studies in a context that encompasses English as well as other languages.

Subject(s)

Natural Language Processing , Humans , Semantics

11.

Data Extraction and Synthesis in Systematic Reviews of Diagnostic Test Accuracy: A Corpus for Automating and Evaluating the Process.

Norman, Christopher; Leeflang, Mariska; Névéol, Aurélie.

AMIA Annu Symp Proc ; 2018: 817-826, 2018.

Article in English | MEDLINE | ID: mdl-30815124

ABSTRACT

BACKGROUND: Systematic reviews are critical for obtaining accurate estimates of diagnostic test accuracy, yet these require extracting information buried in free text articles, an often laborious process. OBJECTIVE: We create a dataset describing the data extraction and synthesis processes in 63 DTA systematic reviews, and demonstrate its utility by using it to replicate the data synthesis in the original reviews. METHOD: We construct our dataset using a custom automated extraction pipeline complemented with manual extraction, verification, and post-editing. We evaluate using manual assessment by two annotators and by comparing against data extracted from source files. RESULTS: The constructed dataset contains 5,848 test results for 1,354 diagnostic tests from 1,738 diagnostic studies. We observe an extraction error rate of 0.06-0.3%. CONCLUSIONS: This constitutes the first dataset describing the later stages of the DTA systematic review process, and is intended to be useful for automating or evaluating the process.

Subject(s)

Datasets as Topic , Diagnostic Tests, Routine , Information Storage and Retrieval , Systematic Reviews as Topic

12.

Design of an extensive information representation scheme for clinical narratives.

Deléger, Louise; Campillos, Leonardo; Ligozat, Anne-Laure; Névéol, Aurélie.

J Biomed Semantics ; 8(1): 37, 2017 Sep 11.

Article in English | MEDLINE | ID: mdl-28893314

ABSTRACT

BACKGROUND: Knowledge representation frameworks are essential to the understanding of complex biomedical processes, and to the analysis of biomedical texts that describe them. Combined with natural language processing (NLP), they have the potential to contribute to retrospective studies by unlocking important phenotyping information contained in the narrative content of electronic health records (EHRs). This work aims to develop an extensive information representation scheme for clinical information contained in EHR narratives, and to support secondary use of EHR narrative data to answer clinical questions. METHODS: We review recent work that proposed information representation schemes and applied them to the analysis of clinical narratives. We then propose a unifying scheme that supports the extraction of information to address a large variety of clinical questions. RESULTS: We devised a new information representation scheme for clinical narratives that comprises 13 entities, 11 attributes and 37 relations. The associated annotation guidelines can be used to consistently apply the scheme to clinical narratives and are https://cabernet.limsi.fr/annotation_guide_for_the_merlot_french_clinical_corpus-Sept2016.pdf . CONCLUSION: The information scheme includes many elements of the major schemes described in the clinical natural language processing literature, as well as a uniquely detailed set of relations.

Subject(s)

Biological Ontologies , Data Mining/methods , Electronic Health Records , Natural Language Processing , Humans

13.

Contribution of Natural Language Processing in Predicting Rehospitalization Risk.

Norman, Christopher; Nguyen, Thu Van; Névéol, Aurélie.

Med Care ; 55(8): 781, 2017 08.

Article in English | MEDLINE | ID: mdl-28549001

Subject(s)

Natural Language Processing , Patient Readmission , Humans , Risk , Risk Factors

14.

Automatic classification of registered clinical trials towards the Global Burden of Diseases taxonomy of diseases and injuries.

Atal, Ignacio; Zeitoun, Jean-David; Névéol, Aurélie; Ravaud, Philippe; Porcher, Raphaël; Trinquart, Ludovic.

BMC Bioinformatics ; 17(1): 392, 2016 Sep 22.

Article in English | MEDLINE | ID: mdl-27659604

ABSTRACT

BACKGROUND: Clinical trial registries may allow for producing a global mapping of health research. However, health conditions are not described with standardized taxonomies in registries. Previous work analyzed clinical trial registries to improve the retrieval of relevant clinical trials for patients. However, no previous work has classified clinical trials across diseases using a standardized taxonomy allowing a comparison between global health research and global burden across diseases. We developed a knowledge-based classifier of health conditions studied in registered clinical trials towards categories of diseases and injuries from the Global Burden of Diseases (GBD) 2010 study. The classifier relies on the UMLS® knowledge source (Unified Medical Language System®) and on heuristic algorithms for parsing data. It maps trial records to a 28-class grouping of the GBD categories by automatically extracting UMLS concepts from text fields and by projecting concepts between medical terminologies. The classifier allows deriving pathways between the clinical trial record and candidate GBD categories using natural language processing and links between knowledge sources, and selects the relevant GBD classification based on rules of prioritization across the pathways found. We compared automatic and manual classifications for an external test set of 2,763 trials. We automatically classified 109,603 interventional trials registered before February 2014 at WHO ICTRP. RESULTS: In the external test set, the classifier identified the exact GBD categories for 78 % of the trials. It had very good performance for most of the 28 categories, especially "Neoplasms" (sensitivity 97.4 %, specificity 97.5 %). The sensitivity was moderate for trials not relevant to any GBD category (53 %) and low for trials of injuries (16 %). For the 109,603 trials registered at WHO ICTRP, the classifier did not assign any GBD category to 20.5 % of trials while the most common GBD categories were "Neoplasms" (22.8 %) and "Diabetes" (8.9 %). CONCLUSIONS: We developed and validated a knowledge-based classifier allowing for automatically identifying the diseases studied in registered trials by using the taxonomy from the GBD 2010 study. This tool is freely available to the research community and can be used for large-scale public health studies.

15.

Clinical Information Extraction at the CLEF eHealth Evaluation lab 2016.

Névéol, Aurélie; Cohen, K Bretonnel; Grouin, Cyril; Hamon, Thierry; Lavergne, Thomas; Kelly, Liadh; Goeuriot, Lorraine; Rey, Grégoire; Robert, Aude; Tannier, Xavier; Zweigenbaum, Pierre.

CEUR Workshop Proc ; 1609: 28-42, 2016 Sep.

Article in English | MEDLINE | ID: mdl-29308065

ABSTRACT

This paper reports on Task 2 of the 2016 CLEF eHealth evaluation lab which extended the previous information extraction tasks of ShARe/CLEF eHealth evaluation labs. The task continued with named entity recognition and normalization in French narratives, as offered in CLEF eHealth 2015. Named entity recognition involved ten types of entities including disorders that were defined according to Semantic Groups in the Unified Medical Language System® (UMLS®), which was also used for normalizing the entities. In addition, we introduced a large-scale classification task in French death certificates, which consisted of extracting causes of death as coded in the International Classification of Diseases, tenth revision (ICD10). Participant systems were evaluated against a blind reference standard of 832 titles of scientific articles indexed in MEDLINE, 4 drug monographs published by the European Medicines Agency (EMEA) and 27,850 death certificates using Precision, Recall and F-measure. In total, seven teams participated, including five in the entity recognition and normalization task, and five in the death certificate coding task. Three teams submitted their systems to our newly offered reproducibility track. For entity recognition, the highest performance was achieved on the EMEA corpus, with an overall F-measure of 0.702 for plain entities recognition and 0.529 for normalized entity recognition. For entity normalization, the highest performance was achieved on the MEDLINE corpus, with an overall F-measure of 0.552. For death certificate coding, the highest performance was 0.848 F-measure.

16.

Natural language processing of radiology reports for the detection of thromboembolic diseases and clinically relevant incidental findings.

Pham, Anne-Dominique; Névéol, Aurélie; Lavergne, Thomas; Yasunaga, Daisuke; Clément, Olivier; Meyer, Guy; Morello, Rémy; Burgun, Anita.

BMC Bioinformatics ; 15: 266, 2014 Aug 07.

Article in English | MEDLINE | ID: mdl-25099227

ABSTRACT

BACKGROUND: Natural Language Processing (NLP) has been shown effective to analyze the content of radiology reports and identify diagnosis or patient characteristics. We evaluate the combination of NLP and machine learning to detect thromboembolic disease diagnosis and incidental clinically relevant findings from angiography and venography reports written in French. We model thromboembolic diagnosis and incidental findings as a set of concepts, modalities and relations between concepts that can be used as features by a supervised machine learning algorithm. A corpus of 573 radiology reports was de-identified and manually annotated with the support of NLP tools by a physician for relevant concepts, modalities and relations. A machine learning classifier was trained on the dataset interpreted by a physician for diagnosis of deep-vein thrombosis, pulmonary embolism and clinically relevant incidental findings. Decision models accounted for the imbalanced nature of the data and exploited the structure of the reports. RESULTS: The best model achieved an F measure of 0.98 for pulmonary embolism identification, 1.00 for deep vein thrombosis, and 0.80 for incidental clinically relevant findings. The use of concepts, modalities and relations improved performances in all cases. CONCLUSIONS: This study demonstrates the benefits of developing an automated method to identify medical concepts, modality and relations from radiology reports in French. An end-to-end automatic system for annotation and classification which could be applied to other radiology reports databases would be valuable for epidemiological surveillance, performance monitoring, and accreditation in French hospitals.

Subject(s)

Computational Biology/methods , Incidental Findings , Natural Language Processing , Pulmonary Embolism/diagnostic imaging , Radiology , Research Report , Tomography, X-Ray Computed , Algorithms , Humans

17.

De-identification of clinical notes in French: towards a protocol for reference corpus development.

Grouin, Cyril; Névéol, Aurélie.

J Biomed Inform ; 50: 151-61, 2014 Aug.

Article in English | MEDLINE | ID: mdl-24380818

ABSTRACT

BACKGROUND: To facilitate research applying Natural Language Processing to clinical documents, tools and resources are needed for the automatic de-identification of Electronic Health Records. OBJECTIVE: This study investigates methods for developing a high-quality reference corpus for the de-identification of clinical documents in French. METHODS: A corpus comprising a variety of clinical document types covering several medical specialties was pre-processed with two automatic de-identification systems from the MEDINA suite of tools: a rule-based system and a system using Conditional Random Fields (CRF). The pre-annotated documents were revised by two human annotators trained to mark ten categories of Protected Health Information (PHI). The human annotators worked independently and were blind to the system that produced the pre-annotations they were revising.The best pre-annotation system was applied to another random selection of 100 documents.After revision by one annotator, this set was used to train a statistical de-identification system. RESULTS: Two gold standard sets of 100 documents were created based on the consensus of two human revisions of the automatic pre-annotations.The annotation experiment showed that (i) automatic pre-annotation obtained with the rule-based system performed better (F=0.813) than the CRF system (F=0.519), (ii) the human annotators spent more time revising the pre-annotations obtained with the rule-based system (from 102 to 160minutes for 50 documents), compared to the CRF system (from 93 to 142minutes for 50 documents), (iii) the quality of human annotation is higher when pre-annotations are obtained with the rule-based system (F-measure ranging from 0.970 to 0.987), compared to the CRF system (F-measure ranging from 0.914 to 0.981).Finally, only 20 documents from the training set were needed for the statistical system to outperform the pre-annotation systems that were trained on corpora from a medical speciality and hospital different from those in the reference corpus developed herein. CONCLUSION: We find that better pre-annotations increase the quality of the reference corpus but require more revision time. A statistical de-identification method outperforms our rule-based system when as little as 20 custom training documents are available.

Subject(s)

Electronic Health Records , France , Humans , Natural Language Processing

18.

Combining MEDLINE and publisher data to create parallel corpora for the automatic translation of biomedical text.

Yepes, Antonio Jimeno; Prieur-Gaston, Elise; Névéol, Aurélie.

BMC Bioinformatics ; 14: 146, 2013 Apr 30.

Article in English | MEDLINE | ID: mdl-23631733

ABSTRACT

BACKGROUND: Most of the institutional and research information in the biomedical domain is available in the form of English text. Even in countries where English is an official language, such as the United States, language can be a barrier for accessing biomedical information for non-native speakers. Recent progress in machine translation suggests that this technique could help make English texts accessible to speakers of other languages. However, the lack of adequate specialized corpora needed to train statistical models currently limits the quality of automatic translations in the biomedical domain. RESULTS: We show how a large-sized parallel corpus can automatically be obtained for the biomedical domain, using the MEDLINE database. The corpus generated in this work comprises article titles obtained from MEDLINE and abstract text automatically retrieved from journal websites, which substantially extends the corpora used in previous work. After assessing the quality of the corpus for two language pairs (English/French and English/Spanish) we use the Moses package to train a statistical machine translation model that outperforms previous models for automatic translation of biomedical text. CONCLUSIONS: We have built translation data sets in the biomedical domain that can easily be extended to other languages available in MEDLINE. These sets can successfully be applied to train statistical machine translation models. While further progress should be made by incorporating out-of-domain corpora and domain-specific lexicons, we believe that this work improves the automatic translation of biomedical texts.

Subject(s)

MEDLINE , Translating , Linguistics/methods , Models, Statistical , Publishing

19.

Improving information retrieval using Medical Subject Headings Concepts: a test case on rare and chronic diseases.

Darmoni, Stéfan J; Soualmia, Lina F; Letord, Catherine; Jaulent, Marie-Christine; Griffon, Nicolas; Thirion, Benoît; Névéol, Aurélie.

J Med Libr Assoc ; 100(3): 176-83, 2012 Jul.

Article in English | MEDLINE | ID: mdl-22879806

ABSTRACT

BACKGROUND: As more scientific work is published, it is important to improve access to the biomedical literature. Since 2000, when Medical Subject Headings (MeSH) Concepts were introduced, the MeSH Thesaurus has been concept based. Nevertheless, information retrieval is still performed at the MeSH Descriptor or Supplementary Concept level. OBJECTIVE: The study assesses the benefit of using MeSH Concepts for indexing and information retrieval. METHODS: Three sets of queries were built for thirty-two rare diseases and twenty-two chronic diseases: (1) using PubMed Automatic Term Mapping (ATM), (2) using Catalog and Index of French-language Health Internet (CISMeF) ATM, and (3) extrapolating the MEDLINE citations that should be indexed with a MeSH Concept. RESULTS: Type 3 queries retrieve significantly fewer results than type 1 or type 2 queries (about 18,000 citations versus 200,000 for rare diseases; about 300,000 citations versus 2,000,000 for chronic diseases). CISMeF ATM also provides better precision than PubMed ATM for both disease categories. DISCUSSION: Using MeSH Concept indexing instead of ATM is theoretically possible to improve retrieval performance with the current indexing policy. However, using MeSH Concept information retrieval and indexing rules would be a fundamentally better approach. These modifications have already been implemented in the CISMeF search engine.

Subject(s)

Abstracting and Indexing/statistics & numerical data , Databases as Topic/statistics & numerical data , Medical Subject Headings/statistics & numerical data , Terminology as Topic , Algorithms , Chronic Disease , Electronic Data Processing , France , Humans , Information Storage and Retrieval , Language , MEDLINE/statistics & numerical data , Quality Control , Rare Diseases

20.

Improving links between literature and biological data with text mining: a case study with GEO, PDB and MEDLINE.

Névéol, Aurélie; Wilbur, W John; Lu, Zhiyong.

Database (Oxford) ; 2012: bas026, 2012.

Article in English | MEDLINE | ID: mdl-22685160

ABSTRACT

High-throughput experiments and bioinformatics techniques are creating an exploding volume of data that are becoming overwhelming to keep track of for biologists and researchers who need to access, analyze and process existing data. Much of the available data are being deposited in specialized databases, such as the Gene Expression Omnibus (GEO) for microarrays or the Protein Data Bank (PDB) for protein structures and coordinates. Data sets are also being described by their authors in publications archived in literature databases such as MEDLINE and PubMed Central. Currently, the curation of links between biological databases and the literature mainly relies on manual labour, which makes it a time-consuming and daunting task. Herein, we analysed the current state of link curation between GEO, PDB and MEDLINE. We found that the link curation is heterogeneous depending on the sources and databases involved, and that overlap between sources is low, <50% for PDB and GEO. Furthermore, we showed that text-mining tools can automatically provide valuable evidence to help curators broaden the scope of articles and database entries that they review. As a result, we made recommendations to improve the coverage of curated links, as well as the consistency of information available from different databases while maintaining high-quality curation. Database URLs: http://www.ncbi.nlm.nih.gov/PubMed, http://www.ncbi.nlm.nih.gov/geo/, http://www.rcsb.org/pdb/

Subject(s)

Data Mining/methods , Data Mining/standards , Databases, Bibliographic , Databases, Genetic , MEDLINE , Abstracting and Indexing/standards , Database Management Systems

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL