Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 78
Filter
1.
Stud Health Technol Inform ; 316: 1487-1491, 2024 Aug 22.
Article in English | MEDLINE | ID: mdl-39176485

ABSTRACT

This article presents our experience in development an ontological model can be used in clinical decision support systems (CDSS) creating. We have used the largest international biomedical terminological metathesaurus the Unified Medical Language System (UMLS) as the basis of our model. This metathesaurus has been adapted into Russian using an automated hybrid translation system with expert control. The product we have created was named as the National Unified Terminological System (NUTS). We have added more than 33 million scientific and clinical relationships between NUTS terms, extracted from the texts of scientific articles and electronic health records. We have also computed weights for each relationship, standardized their values and created symptom checker in preliminary diagnostics based on this. We expect, that the NUTS allow solving task of named entity recognition (NER) and increasing terms interoperability in different CDSS.


Subject(s)
Electronic Health Records , Knowledge Bases , Unified Medical Language System , Decision Support Systems, Clinical , Natural Language Processing , Humans , Russia , Vocabulary, Controlled
2.
Stud Health Technol Inform ; 316: 1492-1493, 2024 Aug 22.
Article in English | MEDLINE | ID: mdl-39176486

ABSTRACT

This article presents experience in construction the National Unified Terminological System (NUTS) with an ontological structure based on international Unified Medical Language System (UMLS). UMLS has been adapted and enriched with formulations from national directories, relationships, extracted from the texts of scientific articles and electronic health records, and weight coefficients.


Subject(s)
Electronic Health Records , Unified Medical Language System , Natural Language Processing , Terminology as Topic , Vocabulary, Controlled
3.
Stud Health Technol Inform ; 316: 771-775, 2024 Aug 22.
Article in English | MEDLINE | ID: mdl-39176907

ABSTRACT

Ontologies play a key role in representing and structuring domain knowledge. In the biomedical domain, the need for this type of representation is crucial for structuring, coding, and retrieving data. However, available ontologies do not encompass all the relevant concepts and relationships. In this paper, we propose the framework SiMHOMer (Siamese Models for Health Ontologies Merging) to semantically merge and integrate the most relevant ontologies in the healthcare domain, with a first focus on diseases, symptoms, drugs, and adverse events. We propose to rely on the siamese neural models we developed and trained on biomedical data, BioSTransformers, to identify new relevant relations between concepts and to create new semantic relations, the objective being to build a new merging ontology that could be used in applications. To validate the proposed approach and the new relations, we relied on the UMLS Metathesaurus and the Semantic Network. Our first results show promising improvements for future research.


Subject(s)
Biological Ontologies , Semantics , Neural Networks, Computer , Humans , Unified Medical Language System
4.
Diagnostics (Basel) ; 14(11)2024 Jun 06.
Article in English | MEDLINE | ID: mdl-38893730

ABSTRACT

In recent years, Convolutional Neural Network (CNN) models have demonstrated notable advancements in various domains such as image classification and Natural Language Processing (NLP). Despite their success in image classification tasks, their potential impact on medical image retrieval, particularly in text-based medical image retrieval (TBMIR) tasks, has not yet been fully realized. This could be attributed to the complexity of the ranking process, as there is ambiguity in treating TBMIR as an image retrieval task rather than a traditional information retrieval or NLP task. To address this gap, our paper proposes a novel approach to re-ranking medical images using a Deep Matching Model (DMM) and Medical-Dependent Features (MDF). These features incorporate categorical attributes such as medical terminologies and imaging modalities. Specifically, our DMM aims to generate effective representations for query and image metadata using a personalized CNN, facilitating matching between these representations. By using MDF, a semantic similarity matrix based on Unified Medical Language System (UMLS) meta-thesaurus, and a set of personalized filters taking into account some ranking features, our deep matching model can effectively consider the TBMIR task as an image retrieval task, as previously mentioned. To evaluate our approach, we performed experiments on the medical ImageCLEF datasets from 2009 to 2012. The experimental results show that the proposed model significantly enhances image retrieval performance compared to the baseline and state-of-the-art approaches.

5.
BMC Med Inform Decis Mak ; 23(Suppl 4): 299, 2024 Feb 07.
Article in English | MEDLINE | ID: mdl-38326827

ABSTRACT

BACKGROUND: In this era of big data, data harmonization is an important step to ensure reproducible, scalable, and collaborative research. Thus, terminology mapping is a necessary step to harmonize heterogeneous data. Take the Medical Dictionary for Regulatory Activities (MedDRA) and International Classification of Diseases (ICD) for example, the mapping between them is essential for drug safety and pharmacovigilance research. Our main objective is to provide a quantitative and qualitative analysis of the mapping status between MedDRA and ICD. We focus on evaluating the current mapping status between MedDRA and ICD through the Unified Medical Language System (UMLS) and Observational Medical Outcomes Partnership Common Data Model (OMOP CDM). We summarized the current mapping statistics and evaluated the quality of the current MedDRA-ICD mapping; for unmapped terms, we used our self-developed algorithm to rank the best possible mapping candidates for additional mapping coverage. RESULTS: The identified MedDRA-ICD mapped pairs cover 27.23% of the overall MedDRA preferred terms (PT). The systematic quality analysis demonstrated that, among the mapped pairs provided by UMLS, only 51.44% are considered an exact match. For the 2400 sampled unmapped terms, 56 of the 2400 MedDRA Preferred Terms (PT) could have exact match terms from ICD. CONCLUSION: Some of the mapped pairs between MedDRA and ICD are not exact matches due to differences in granularity and focus. For 72% of the unmapped PT terms, the identified exact match pairs illustrate the possibility of identifying additional mapped pairs. Referring to its own mapping standard, some of the unmapped terms should qualify for the expansion of MedDRA to ICD mapping in UMLS.


Subject(s)
Adverse Drug Reaction Reporting Systems , International Classification of Diseases , Humans , Unified Medical Language System , Pharmacovigilance , Algorithms
6.
Artif Intell Med ; 148: 102758, 2024 02.
Article in English | MEDLINE | ID: mdl-38325934

ABSTRACT

The development of intelligent systems that use social media data for decision-making processes in numerous domains such as politics, business, marketing, and finance, has been made possible by the popularity of social media platforms. However, the utilization of textual data from social media in the healthcare management industry is still somewhat limited when it is compared to other industries. Investigating how current machine learning and natural language processing technologies can be used in the healthcare industry to gauge public sentiment is an important study. Earlier works on healthcare sentiment analysis have utilized traditional word embedding models trained on the general and medical corpus. However, integration of medical knowledge to pre-trained word embedding models has not been considered yet. Word embedding models trained on the general corpus led to the problem of lacking medical knowledge and the models trained on the small size of the medical corpus have limitations in capturing semantic and syntactic properties. This research proposes a new word embedding model named Word Embedding Integrated with Medical Knowledge Vector (WE-iMKVec). The proposed model integrates sentiment lexicons and medical knowledgebases into the pre-trained word embedding to enrich the properties of word embedding. A new medical-aware sentiment polarity score is proposed for the utilization in learning neural-network sentiment and these vectors incorporate with the original pre-trained word vectors. The resulting vectors are enriched with lexicon vectors and the medical knowledge vectors: Adverse Drug Reaction (ADR) vector and Unified Medical Language System (UMLS) vector are used to build the proposed WE-iMKVec model. WE-iMKVec is validated on the five different social media healthcare review datasets and the empirical results showed its superiority over traditional word embedding models in medical sentiment analysis. The highest improvement can be found in the patients.info medical condition dataset where the proposed model outperforms three conventional word2vec models (Google-News, PubMed-PMC, and Drug Reviews) by 12.7 %, 31.4 %, and 25.4 % respectively in terms of F1 score.


Subject(s)
Deep Learning , Sentiment Analysis , Humans , Neural Networks, Computer , Machine Learning , Natural Language Processing
7.
Heliyon ; 10(1): e22766, 2024 Jan 15.
Article in English | MEDLINE | ID: mdl-38163107

ABSTRACT

A transient ischemic attack (TIA) affects millions of people worldwide. Although TIA risk factors have been identified individually, a systemic quantitative analysis of all health factors relevant to TIA using electronic medical records (EMR) remains lacking. This study employed a data-driven approach, leveraging hospital EMR data to create a TIA patient health factor graph. This graph consisted of 737 TIA and 737 control patient nodes, 740 health factor nodes, and over 33,000 relations between patients and factors. For all health factors in the graph, the connection delta ratios (CDRs) were determined and ranked, generating a quantitative distribution of TIA health factors. A literature review confirmed 56 risk factors in the distribution and unveiled a potential new risk factor "rhinosinusitis" for future validation. Moreover, the patient graph was visualized together with the TIA knowledge graph in the Unified Medical Language System. This integration enables clinicians to access and visualize patient data and international standard knowledge within a unified graph. In conclusion, graph CDR analysis can effectively quantify the distribution of TIA risk factors. The resulting TIA risk factor distribution might be instrumental in developing new risk prediction machine learning models for screening and early detection of TIA.

8.
Heliyon ; 9(6): e16818, 2023 Jun.
Article in English | MEDLINE | ID: mdl-37332929

ABSTRACT

Embeddings are fundamental resources often reused for building intelligent systems in the biomedical context. As a result, evaluating the quality of previously trained embeddings and ensuring they cover the desired information is critical for the success of applications. This paper proposes a new evaluation methodology to test the coverage of embeddings against a targetted domain of interest. It defines measures to assess the terminology, similarity, and analogy coverage, which are core aspects of the embeddings. Then, it discusses the experimentation carried out on existing biomedical embeddings in the specific context of pulmonary diseases. The proposed methodology and measures are general and may be applied to any application domain.

9.
Stud Health Technol Inform ; 305: 97-101, 2023 Jun 29.
Article in English | MEDLINE | ID: mdl-37386967

ABSTRACT

Currently, there is very little research aimed at developing medical knowledge extraction tools for major West Slavic languages (Czech, Polish, and Slovak). This project lays the groundwork for a general medical knowledge extraction pipeline, introducing the resource vocabularies available for the respective languages (UMLS resources, ICD-10 translations and national drug databases). It demonstrates the utility of this approach on a case study using a large proprietary corpus of Czech oncology records consisting of more than 40 million words written about more than 4,000 patients. After correlating MedDRA terms found in patients' records with drugs prescribed to them, significant non-obvious associations were found between selected medical conditions being mentioned and the probability of certain drugs being prescribed over the course of the patient's treatment, in some cases increasing the probability of prescriptions by over 250%. This direction of research, producing large amounts of annotated data, is a prerequisite for training deep learning models and predictive systems.


Subject(s)
Databases, Pharmaceutical , Language , Humans , International Classification of Diseases , Knowledge , Medical Oncology
10.
Stud Health Technol Inform ; 305: 186-189, 2023 Jun 29.
Article in English | MEDLINE | ID: mdl-37386992

ABSTRACT

Clinical search engines development is actual task for medical informatics. The main issue in this area is to implement high-quality unstructured texts processing. Ontological interdisciplinary metathesaurus UMLS can be used to solve this problem. Currently, there is no unified method to relevant information aggregation from UMLS. In this research, we have presented the UMLS as graph model and performed the spot check of UMLS structure to identify basic problems. Then we created and integrated new graph metric in two created by us program modules for relevant knowledge aggregation from UMLS.


Subject(s)
Medical Informatics , Unified Medical Language System , Interdisciplinary Studies , Knowledge , Search Engine
11.
Artif Intell Med ; 140: 102551, 2023 06.
Article in English | MEDLINE | ID: mdl-37210157

ABSTRACT

Text-Based Medical Image Retrieval (TBMIR) has been known to be successful in retrieving medical images with textual descriptions. Usually, these descriptions are very brief and cannot express the whole visual content of the image in words, hence negatively affect the retrieval performance. One of the solutions offered in the literature is to form a Bayesian Network thesaurus taking advantage of some medical terms extracted from the image datasets. Despite the interestingness of this solution, it is not efficient as it is highly related to the co-occurrence measure, the layer arrangement and the arc directions. A significant drawback of the co-occurrence measure is the generation of a lot of uninteresting co-occurring terms. Several studies applied the association rules mining and its measures to discover the correlation between the terms. In this paper, we propose a new efficient association Rule Based Bayesian Network (R2BN) model for TBMIR using updated medically-dependent features (MDF) based on Unified Medical Language System (UMLS). The MDF are a set of medical terms that refers to the imaging modalities, the image color, the searched object dimension, etc. The proposed model presents the association rules mined from MDF in the form of Bayesian Network model. Then, it exploits the association rule measures (support, confidence, and lift) to prune the Bayesian Network model for efficient computation. The proposed R2BN model is combined with a literature probabilistic model to predict the relevance of an image to a given query. Experiments are carried out with ImageCLEF medical retrieval task collections from 2009 to 2013. Results show that our proposed model enhances significantly the image retrieval accuracy compared to the state-of-the-art retrieval models.


Subject(s)
Information Storage and Retrieval , Models, Statistical , Bayes Theorem , Unified Medical Language System
12.
Stud Health Technol Inform ; 302: 823-824, 2023 May 18.
Article in English | MEDLINE | ID: mdl-37203506

ABSTRACT

This paper describes a first attempt to map UMLS concepts to pictographs as a resource for translation systems for the medical domain. An evaluation of pictographs from two freely available sets shows that for many concepts no pictograph could be found and that word-based lookup is inadequate for this task.


Subject(s)
Unified Medical Language System
13.
Heliyon ; 9(3): e14636, 2023 Mar.
Article in English | MEDLINE | ID: mdl-37020943

ABSTRACT

Background and objectives: Medical notes are narratives that describe the health of the patient in free text format. These notes can be more informative than structured data such as the history of medications or disease conditions. They are routinely collected and can be used to evaluate the patient's risk for developing chronic diseases such as dementia. This study investigates different methodologies for transforming routine care notes into dementia risk classifiers and evaluates the generalizability of these classifiers to new patients and new health care institutions. Methods: The notes collected over the relevant history of the patient are lengthy. In this study, TF-ICF is used to select keywords with the highest discriminative ability between at risk dementia patients and healthy controls. The medical notes are then summarized in the form of occurrences of the selected keywords. Two different encodings of the summary are compared. The first encoding consists of the average of the vector embedding of each keyword occurrence as produced by the BERT or Clinical BERT pre-trained language models. The second encoding aggregates the keywords according to UMLS concepts and uses each concept as an exposure variable. For both encodings, misspellings of the selected keywords are also considered in an effort to improve the predictive performance of the classifiers. A neural network is developed over the first encoding and a gradient boosted trees model is applied to the second encoding. Patients from a single health care institution are used to develop all the classifiers which are then evaluated on held-out patients from the same health care institution as well as test patients from two other health care institutions. Results: The results indicate that it is possible to identify patients at risk for dementia one year ahead of the onset of the disease using medical notes with an AUC of 75% when a gradient boosted trees model is used in conjunction with exposure variables derived from UMLS concepts. However, this performance is not maintained with an embedded feature space and when the classifier is applied to patients from other health care institutions. Moreover, an analysis of the top predictors of the gradient boosted trees model indicates that different features inform the classification depending on whether or not spelling variants of the keywords are included. Conclusion: The present study demonstrates that medical notes can enable risk prediction models for complex chronic diseases such as dementia. However, additional research efforts are needed to improve the generalizability of these models. These efforts should take into consideration the length and localization of the medical notes; the availability of sufficient training data for each disease condition; and the variabilities resulting from different feature engineering techniques.

14.
J Med Internet Res ; 24(11): e40361, 2022 11 25.
Article in English | MEDLINE | ID: mdl-36427233

ABSTRACT

BACKGROUND: Electronic medical records (EMRs) of patients with lung cancer (LC) capture a variety of health factors. Understanding the distribution of these factors will help identify key factors for risk prediction in preventive screening for LC. OBJECTIVE: We aimed to generate an integrated biomedical graph from EMR data and Unified Medical Language System (UMLS) ontology for LC, and to generate an LC health factor distribution from a hospital EMR of approximately 1 million patients. METHODS: The data were collected from 2 sets of 1397 patients with and those without LC. A patient-centered health factor graph was plotted with 108,000 standardized data, and a graph database was generated to integrate the graphs of patient health factors and the UMLS ontology. With the patient graph, we calculated the connection delta ratio (CDR) for each of the health factors to measure the relative strength of the factor's relationship to LC. RESULTS: The patient graph had 93,000 relations between the 2794 patient nodes and 650 factor nodes. An LC graph with 187 related biomedical concepts and 188 horizontal biomedical relations was plotted and linked to the patient graph. Searching the integrated biomedical graph with any number or category of health factors resulted in graphical representations of relationships between patients and factors, while searches using any patient presented the patient's health factors from the EMR and the LC knowledge graph (KG) from the UMLS in the same graph. Sorting the health factors by CDR in descending order generated a distribution of health factors for LC. The top 70 CDR-ranked factors of disease, symptom, medical history, observation, and laboratory test categories were verified to be concordant with those found in the literature. CONCLUSIONS: By collecting standardized data of thousands of patients with and those without LC from the EMR, it was possible to generate a hospital-wide patient-centered health factor graph for graph search and presentation. The patient graph could be integrated with the UMLS KG for LC and thus enable hospitals to bring continuously updated international standard biomedical KGs from the UMLS for clinical use in hospitals. CDR analysis of the graph of patients with LC generated a CDR-sorted distribution of health factors, in which the top CDR-ranked health factors were concordant with the literature. The resulting distribution of LC health factors can be used to help personalize risk evaluation and preventive screening recommendations.


Subject(s)
Electronic Health Records , Lung Neoplasms , Humans , Retrospective Studies , Unified Medical Language System , Lung Neoplasms/epidemiology , Hospitals
15.
Proc Int World Wide Web Conf ; 2022: 1037-1046, 2022 Apr.
Article in English | MEDLINE | ID: mdl-36108322

ABSTRACT

The Unified Medical Language System (UMLS) Metathesaurus construction process mainly relies on lexical algorithms and manual expert curation for integrating over 200 biomedical vocabularies. A lexical-based learning model (LexLM) was developed to predict synonymy among Metathesaurus terms and largely outperforms a rule-based approach (RBA) that approximates the current construction process. However, the LexLM has the potential for being improved further because it only uses lexical information from the source vocabularies, while the RBA also takes advantage of contextual information. We investigate the role of multiple types of contextual information available to the UMLS editors, namely source synonymy (SS), source semantic group (SG), and source hierarchical relations (HR), for the UMLS vocabulary alignment (UVA) problem. In this paper, we develop multiple variants of context-enriched learning models (ConLMs) by adding to the LexLM the types of contextual information listed above. We represent these context types in context-enriched knowledge graphs (ConKGs) with four variants ConSS, ConSG, ConHR, and ConAll. We train these ConKG embeddings using seven KG embedding techniques. We create the ConLMs by concatenating the ConKG embedding vectors with the word embedding vectors from the LexLM. We evaluate the performance of the ConLMs using the UVA generalization test datasets with hundreds of millions of pairs. Our extensive experiments show a significant performance improvement from the ConLMs over the LexLM, namely +5.0% in precision (93.75%), +0.69% in recall (93.23%), +2.88% in F1 (93.49%) for the best ConLM. Our experiments also show that the ConAll variant including the three context types takes more time, but does not always perform better than other variants with a single context type. Finally, our experiments show that the pairs of terms with high lexical similarity benefit most from adding contextual information, namely +6.56% in precision (94.97%), +2.13% in recall (93.23%), +4.35% in F1 (94.09%) for the best ConLM. The pairs with lower degrees of lexical similarity also show performance improvement with +0.85% in F1 (96%) for low similarity and +1.31% in F1 (96.34%) for no similarity. These results demonstrate the importance of using contextual information in the UVA problem.

16.
J Biomed Inform ; 131: 104118, 2022 07.
Article in English | MEDLINE | ID: mdl-35690349

ABSTRACT

OBJECTIVE: To propose a new vector-based relatedness metric that derives word vectors from the intrinsic structure of biomedical ontologies, without consulting external resources such as large-scale biomedical corpora. MATERIALS AND METHODS: SNOMED CT on the mapping layer of UMLS was used as a testbed ontology. Vectors were created for every concept at the end of all semantic relations-attribute-value relations and descendants as well as is_a relation-of the defining concept. The cosine similarity between the averages of those vectors with respect to each defining concept was computed to produce a final semantic relatedness. RESULTS: Two benchmark sets that include a total of 62 biomedical term pairs were used for evaluation. Spearman's rank coefficient of the current method was 0.655, 0.744, and 0.742 with the relatedness rated by physicians, coders, and medical experts, respectively. The proposed method was comparable to a word-embedding method and outperformed path-based, information content-based, and another multiple relation-based relatedness metrics. DISCUSSION: The current study demonstrated that the addition of attribute relations to the is_a hierarchy of SNOMED CT better conforms to the human sense of relatedness than models based on taxonomic relations. The current approach also showed that it is robust to the design inconsistency of ontologies. CONCLUSION: Unlike the previous vector-based approach, the current study exploited the intrinsic semantic structure of an ontology, precluding the need for external textual resources to obtain context information of defining terms. Future research is recommended to prove the validity of the current method with other biomedical ontologies.


Subject(s)
Biological Ontologies , Systematized Nomenclature of Medicine , Humans , Natural Language Processing , Semantics , Unified Medical Language System
17.
BMC Med Res Methodol ; 22(1): 141, 2022 05 14.
Article in English | MEDLINE | ID: mdl-35568796

ABSTRACT

BACKGROUND: Screening for eligible patients continues to pose a great challenge for many clinical trials. This has led to a rapidly growing interest in standardizing computable representations of eligibility criteria (EC) in order to develop tools that leverage data from electronic health record (EHR) systems. Although laboratory procedures (LP) represent a common entity of EC that is readily available and retrievable from EHR systems, there is a lack of interoperable data models for this entity of EC. A public, specialized data model that utilizes international, widely-adopted terminology for LP, e.g. Logical Observation Identifiers Names and Codes (LOINC®), is much needed to support automated screening tools. OBJECTIVE: The aim of this study is to establish a core dataset for LP most frequently requested to recruit patients for clinical trials using LOINC terminology. Employing such a core dataset could enhance the interface between study feasibility platforms and EHR systems and significantly improve automatic patient recruitment. METHODS: We used a semi-automated approach to analyze 10,516 screening forms from the Medical Data Models (MDM) portal's data repository that are pre-annotated with Unified Medical Language System (UMLS). An automated semantic analysis based on concept frequency is followed by an extensive manual expert review performed by physicians to analyze complex recruitment-relevant concepts not amenable to automatic approach. RESULTS: Based on analysis of 138,225 EC from 10,516 screening forms, 55 laboratory procedures represented 77.87% of all UMLS laboratory concept occurrences identified in the selected EC forms. We identified 26,413 unique UMLS concepts from 118 UMLS semantic types and covered the vast majority of Medical Subject Headings (MeSH) disease domains. CONCLUSIONS: Only a small set of common LP covers the majority of laboratory concepts in screening EC forms which supports the feasibility of establishing a focused core dataset for LP. We present ELaPro, a novel, LOINC-mapped, core dataset for the most frequent 55 LP requested in screening for clinical trials. ELaPro is available in multiple machine-readable data formats like CSV, ODM and HL7 FHIR. The extensive manual curation of this large number of free-text EC as well as the combining of UMLS and LOINC terminologies distinguishes this specialized dataset from previous relevant datasets in the literature.


Subject(s)
Logical Observation Identifiers Names and Codes , Medical Subject Headings , Humans , Semantics
18.
BMC Med Inform Decis Mak ; 22(1): 114, 2022 04 29.
Article in English | MEDLINE | ID: mdl-35488252

ABSTRACT

BACKGROUND: Health providers create Electronic Health Records (EHRs) to describe the conditions and procedures used to treat their patients. Medical notes entered by medical staff in the form of free text are a particularly insightful component of EHRs. There is a great interest in applying machine learning tools on medical notes in numerous medical informatics applications. Learning vector representations, or embeddings, of terms in the notes, is an important pre-processing step in such applications. However, learning good embeddings is challenging because medical notes are rich in specialized terminology, and the number of available EHRs in practical applications is often very small. METHODS: In this paper, we propose a novel algorithm to learn embeddings of medical terms from a limited set of medical notes. The algorithm, called definition2vec, exploits external information in the form of medical term definitions. It is an extension of a skip-gram algorithm that incorporates textual definitions of medical terms provided by the Unified Medical Language System (UMLS) Metathesaurus. RESULTS: To evaluate the proposed approach, we used a publicly available Medical Information Mart for Intensive Care (MIMIC-III) EHR data set. We performed quantitative and qualitative experiments to measure the usefulness of the learned embeddings. The experimental results show that definition2vec keeps the semantically similar medical terms together in the embedding vector space even when they are rare or unobserved in the corpus. We also demonstrate that learned vector embeddings are helpful in downstream medical informatics applications. CONCLUSION: This paper shows that medical term definitions can be helpful when learning embeddings of rare or previously unseen medical terms from a small corpus of specialized documents such as medical notes.


Subject(s)
Electronic Health Records , Unified Medical Language System , Algorithms , Humans , Machine Learning
19.
Front Artif Intell ; 5: 1051724, 2022.
Article in English | MEDLINE | ID: mdl-36714202

ABSTRACT

Objective: The adoption of electronic health records (EHRs) has produced enormous amounts of data, creating research opportunities in clinical data sciences. Several concept recognition systems have been developed to facilitate clinical information extraction from these data. While studies exist that compare the performance of many concept recognition systems, they are typically developed internally and may be biased due to different internal implementations, parameters used, and limited number of systems included in the evaluations. The goal of this research is to evaluate the performance of existing systems to retrieve relevant clinical concepts from EHRs. Methods: We investigated six concept recognition systems, including CLAMP, cTAKES, MetaMap, NCBO Annotator, QuickUMLS, and ScispaCy. Clinical concepts extracted included procedures, disorders, medications, and anatomical location. The system performance was evaluated on two datasets: the 2010 i2b2 and the MIMIC-III. Additionally, we assessed the performance of these systems in five challenging situations, including negation, severity, abbreviation, ambiguity, and misspelling. Results: For clinical concept extraction, CLAMP achieved the best performance on exact and inexact matching, with an F-score of 0.70 and 0.94, respectively, on i2b2; and 0.39 and 0.50, respectively, on MIMIC-III. Across the five challenging situations, ScispaCy excelled in extracting abbreviation information (F-score: 0.86) followed by NCBO Annotator (F-score: 0.79). CLAMP outperformed in extracting severity terms (F-score 0.73) followed by NCBO Annotator (F-score: 0.68). CLAMP outperformed other systems in extracting negated concepts (F-score 0.63). Conclusions: Several concept recognition systems exist to extract clinical information from unstructured data. This study provides an external evaluation by end-users of six commonly used systems across different extraction tasks. Our findings suggest that CLAMP provides the most comprehensive set of annotations for clinical concept extraction tasks and associated challenges. Comparing standard extraction tasks across systems provides guidance to other clinical researchers when selecting a concept recognition system relevant to their clinical information extraction task.

20.
Article in English | MEDLINE | ID: mdl-36776766

ABSTRACT

Biomedical ontologies provide formalized information and knowledge in the biomedical domain. Over the years, biomedical ontologies have played an important role in facilitating biomedical research and applications. Common quality issues of biomedical ontologies include inconsistent naming of concepts, redundant concepts, redundant relations, incomplete/incorrect concept definitions, and incomplete/incorrect class hierarchies. In this work, we focus on addressing the incompleteness of the class hierarchy in SNOMED CT. We develop a substring replacement approach, leveraging concepts' lexical features and existing IS-A relations to identify potential missing IS-A relations in SNOMED CT. To evaluate the effectiveness of our approach, we performed both automated and manual validation. For the automated evaluation, we leverage relations from external terminologies in the Unified Medical Language System (UMLS) to validate the identified missing IS-A relations. For the manual validation, a randomly selected 100 samples from the results are reviewed by a domain expert. Applying our approach to the March 2022 release of SNOMED CT US Edition, we identified 3,228 potential missing IS-A relations, among which 63 were validated through the UMLS. The evaluation by the domain expert revealed that 89 out of 100 (a precision of 89%) missing IS-A relations are valid cases, showing the effectiveness of this substring replacement approach to facilitate the quality assurance of IS-A relations in SNOMED CT.

SELECTION OF CITATIONS
SEARCH DETAIL