Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 4.369
Filter
1.
MAbs ; 16(1): 2361928, 2024.
Article in English | MEDLINE | ID: mdl-38844871

ABSTRACT

The naïve human antibody repertoire has theoretical access to an estimated > 1015 antibodies. Identifying subsets of this prohibitively large space where therapeutically relevant antibodies may be found is useful for development of these agents. It was previously demonstrated that, despite the immense sequence space, different individuals can produce the same antibodies. It was also shown that therapeutic antibodies, which typically follow seemingly unnatural development processes, can arise independently naturally. To check for biases in how the sequence space is explored, we data mined public repositories to identify 220 bioprojects with a combined seven billion reads. Of these, we created a subset of human bioprojects that we make available as the AbNGS database (https://naturalantibody.com/ngs/). AbNGS contains 135 bioprojects with four billion productive human heavy variable region sequences and 385 million unique complementarity-determining region (CDR)-H3s. We find that 270,000 (0.07% of 385 million) unique CDR-H3s are highly public in that they occur in at least five of 135 bioprojects. Of 700 unique therapeutic CDR-H3, a total of 6% has direct matches in the small set of 270,000. This observation extends to a match between CDR-H3 and V-gene call as well. Thus, the subspace of shared ('public') CDR-H3s shows utility for serving as a starting point for therapeutic antibody design.


Subject(s)
Biological Products , Complementarity Determining Regions , Data Mining , Drug Discovery , Humans , Data Mining/methods , Drug Discovery/methods , Biological Products/immunology , Complementarity Determining Regions/genetics , Complementarity Determining Regions/immunology , Immunoglobulin Variable Region/immunology , Immunoglobulin Variable Region/genetics
2.
J Med Internet Res ; 26: e48491, 2024 Jun 06.
Article in English | MEDLINE | ID: mdl-38843521

ABSTRACT

BACKGROUND: Social media has become an increasingly popular and critical tool for users to digest diverse information and express their perceptions and attitudes. While most studies endeavor to delineate the emotional responses of social media users, there is limited research exploring the factors associated with the emergence of emotions, particularly negative ones, during news consumption. OBJECTIVE: We aim to first depict the web coverage by news organizations on social media and then explore the crucial elements of news coverage that trigger the public's negative emotions. Our findings can act as a reference for responsible parties and news organizations in times of crisis. METHODS: We collected 23,705 Facebook posts with 1,019,317 comments from the public pages of representative news organizations in Hong Kong. We used text mining techniques, such as topic models and Bidirectional Encoder Representations from Transformers, to analyze news components and public reactions. Beyond descriptive analysis, we used regression models to shed light on how news coverage on social media is associated with the public's negative emotional responses. RESULTS: Our results suggest that occurrences of issues regarding pandemic situations, antipandemic measures, and supportive actions are likely to reduce the public's negative emotions, while comments on the posts mentioning the central government and the Government of Hong Kong reveal more negativeness. Negative and neutral media tones can alleviate the rage and interact with the subjects and issues in the news to affect users' negative emotions. Post length is found to have a curvilinear relationship with users' negative emotions. CONCLUSIONS: This study sheds light on the impacts of various components of news coverage (issues, subjects, media tone, and length) on social media on the public's negative emotions (anger, fear, and sadness). Our comprehensive analysis provides a reference framework for efficient crisis communication for similar pandemics at present or in the future. This research, although first extending the analysis between the components of news coverage and negative user emotions to the scenario of social media, echoes previous studies drawn from traditional media and its derivatives, such as web newspapers. Although the era of COVID-19 pandemic gradually brings down the curtain, the commonality of this research and previous studies also contributes to establishing a clearer territory in the field of health crises.


Subject(s)
COVID-19 , Emotions , Social Media , Humans , COVID-19/psychology , COVID-19/epidemiology , Hong Kong , Pandemics , Mass Media/statistics & numerical data , SARS-CoV-2 , Data Mining/methods
3.
BMC Health Serv Res ; 24(1): 636, 2024 May 17.
Article in English | MEDLINE | ID: mdl-38760814

ABSTRACT

BACKGROUND: In Japan, over 450 public health centers played a central role in the operation of the local public health system in response to the COVID-19 pandemic. This study aimed to identify key issues for improving the system for public health centers for future pandemics. METHODS: We conducted a cross-sectional study using an online questionnaire. The respondents were first line workers in public health centers or local governments during the pandemic. We solicited open-ended responses concerning improvements needed for future pandemics. Issues were identified from these descriptions using morphological analysis and a topic model with KHcoder3.0. The number of topics was estimated using Perplexity as a measure, and Latent Dirichlet Allocation for meaning identification. RESULTS: We received open-ended responses from 784 (48.6%) of the 1,612 survey respondents, which included 111 physicians, 330 nurses, and 172 administrative staff. Morphological analysis processed these descriptions into 36,632 words. The topic model summarized them into eight issues: 1) establishment of a crisis management system, 2) division of functions among public health centers, prefectures, and medical institutions, 3) clear role distribution in public health center staff, 4) training of specialists, 5) information sharing system (information about infectious diseases and government policies), 6) response to excessive workload (support from other local governments, cooperation within public health centers, and outsourcing), 7) streamlining operations, and 8) balance with regular duties. CONCLUSIONS: This study identified key issues that need to be addressed to prepare Japan's public health centers for future pandemics. These findings are vital for discussions aimed at strengthening the public health system based on experiences from the COVID-19 pandemic.


Subject(s)
COVID-19 , Pandemics , Humans , Japan , COVID-19/epidemiology , Cross-Sectional Studies , Surveys and Questionnaires , Data Mining/methods , Public Health , SARS-CoV-2 , Male
4.
BMC Plant Biol ; 24(1): 373, 2024 May 08.
Article in English | MEDLINE | ID: mdl-38714965

ABSTRACT

BACKGROUND: As one of the world's most important beverage crops, tea plants (Camellia sinensis) are renowned for their unique flavors and numerous beneficial secondary metabolites, attracting researchers to investigate the formation of tea quality. With the increasing availability of transcriptome data on tea plants in public databases, conducting large-scale co-expression analyses has become feasible to meet the demand for functional characterization of tea plant genes. However, as the multidimensional noise increases, larger-scale co-expression analyses are not always effective. Analyzing a subset of samples generated by effectively downsampling and reorganizing the global sample set often leads to more accurate results in co-expression analysis. Meanwhile, global-based co-expression analyses are more likely to overlook condition-specific gene interactions, which may be more important and worthy of exploration and research. RESULTS: Here, we employed the k-means clustering method to organize and classify the global samples of tea plants, resulting in clustered samples. Metadata annotations were then performed on these clustered samples to determine the "conditions" represented by each cluster. Subsequently, we conducted gene co-expression network analysis (WGCNA) separately on the global samples and the clustered samples, resulting in global modules and cluster-specific modules. Comparative analyses of global modules and cluster-specific modules have demonstrated that cluster-specific modules exhibit higher accuracy in co-expression analysis. To measure the degree of condition specificity of genes within condition-specific clusters, we introduced the correlation difference value (CDV). By incorporating the CDV into co-expression analyses, we can assess the condition specificity of genes. This approach proved instrumental in identifying a series of high CDV transcription factor encoding genes upregulated during sustained cold treatment in Camellia sinensis leaves and buds, and pinpointing a pair of genes that participate in the antioxidant defense system of tea plants under sustained cold stress. CONCLUSIONS: To summarize, downsampling and reorganizing the sample set improved the accuracy of co-expression analysis. Cluster-specific modules were more accurate in capturing condition-specific gene interactions. The introduction of CDV allowed for the assessment of condition specificity in gene co-expression analyses. Using this approach, we identified a series of high CDV transcription factor encoding genes related to sustained cold stress in Camellia sinensis. This study highlights the importance of considering condition specificity in co-expression analysis and provides insights into the regulation of the cold stress in Camellia sinensis.


Subject(s)
Camellia sinensis , Camellia sinensis/genetics , Camellia sinensis/metabolism , Cluster Analysis , Genes, Plant , Gene Expression Profiling/methods , Data Mining/methods , Transcriptome , Gene Expression Regulation, Plant , Gene Regulatory Networks
5.
PLoS One ; 19(5): e0302595, 2024.
Article in English | MEDLINE | ID: mdl-38718024

ABSTRACT

Diabetes Mellitus is one of the oldest diseases known to humankind, dating back to ancient Egypt. The disease is a chronic metabolic disorder that heavily burdens healthcare providers worldwide due to the steady increment of patients yearly. Worryingly, diabetes affects not only the aging population but also children. It is prevalent to control this problem, as diabetes can lead to many health complications. As evolution happens, humankind starts integrating computer technology with the healthcare system. The utilization of artificial intelligence assists healthcare to be more efficient in diagnosing diabetes patients, better healthcare delivery, and more patient eccentric. Among the advanced data mining techniques in artificial intelligence, stacking is among the most prominent methods applied in the diabetes domain. Hence, this study opts to investigate the potential of stacking ensembles. The aim of this study is to reduce the high complexity inherent in stacking, as this problem contributes to longer training time and reduces the outliers in the diabetes data to improve the classification performance. In addressing this concern, a novel machine learning method called the Stacking Recursive Feature Elimination-Isolation Forest was introduced for diabetes prediction. The application of stacking with Recursive Feature Elimination is to design an efficient model for diabetes diagnosis while using fewer features as resources. This method also incorporates the utilization of Isolation Forest as an outlier removal method. The study uses accuracy, precision, recall, F1 measure, training time, and standard deviation metrics to identify the classification performances. The proposed method acquired an accuracy of 79.077% for PIMA Indians Diabetes and 97.446% for the Diabetes Prediction dataset, outperforming many existing methods and demonstrating effectiveness in the diabetes domain.


Subject(s)
Diabetes Mellitus , Machine Learning , Humans , Diabetes Mellitus/diagnosis , Algorithms , Data Mining/methods , Support Vector Machine , Male
6.
Health Informatics J ; 30(2): 14604582241240680, 2024.
Article in English | MEDLINE | ID: mdl-38739488

ABSTRACT

Objective: This study examined major themes and sentiments and their trajectories and interactions over time using subcategories of Reddit data. The aim was to facilitate decision-making for psychosocial rehabilitation. Materials and Methods: We utilized natural language processing techniques, including topic modeling and sentiment analysis, on a dataset consisting of more than 38,000 topics, comments, and posts collected from a subreddit dedicated to the experiences of people who tested positive for COVID-19. In this longitudinal exploratory analysis, we studied the dynamics between the most dominant topics and subjects' emotional states over an 18-month period. Results: Our findings highlight the evolution of the textual and sentimental status of major topics discussed by COVID survivors over an extended period of time during the pandemic. We particularly studied pre- and post-vaccination eras as a turning point in the timeline of the pandemic. The results show that not only does the relevance of topics change over time, but the emotions attached to them also vary. Major social events, such as the administration of vaccines or enforcement of nationwide policies, are also reflected through the discussions and inquiries of social media users. In particular, the emotional state (i.e., sentiments and polarity of their feelings) of those who have experienced COVID personally. Discussion: Cumulative societal knowledge regarding the COVID-19 pandemic impacts the patterns with which people discuss their experiences, concerns, and opinions. The subjects' emotional state with respect to different topics was also impacted by extraneous factors and events, such as vaccination. Conclusion: By mining major topics, sentiments, and trajectories demonstrated in COVID-19 survivors' interactions on Reddit, this study contributes to the emerging body of scholarship on COVID-19 survivors' mental health outcomes, providing insights into the design of mental health support and rehabilitation services for COVID-19 survivors.


Subject(s)
COVID-19 , SARS-CoV-2 , Survivors , Humans , COVID-19/psychology , COVID-19/epidemiology , Survivors/psychology , Data Mining/methods , Pandemics , Natural Language Processing , Social Media/trends , Longitudinal Studies
7.
PLoS One ; 19(5): e0303231, 2024.
Article in English | MEDLINE | ID: mdl-38771886

ABSTRACT

Extracting biological interactions from published literature helps us understand complex biological systems, accelerate research, and support decision-making in drug or treatment development. Despite efforts to automate the extraction of biological relations using text mining tools and machine learning pipelines, manual curation continues to serve as the gold standard. However, the rapidly increasing volume of literature pertaining to biological relations poses challenges in its manual curation and refinement. These challenges are further compounded because only a small fraction of the published literature is relevant to biological relation extraction, and the embedded sentences of relevant sections have complex structures, which can lead to incorrect inference of relationships. To overcome these challenges, we propose GIX, an automated and robust Gene Interaction Extraction framework, based on pre-trained Large Language models fine-tuned through extensive evaluations on various gene/protein interaction corpora including LLL and RegulonDB. GIX identifies relevant publications with minimal keywords, optimises sentence selection to reduce computational overhead, simplifies sentence structure while preserving meaning, and provides a confidence factor indicating the reliability of extracted relations. GIX's Stage-2 relation extraction method performed well on benchmark protein/gene interaction datasets, assessed using 10-fold cross-validation, surpassing state-of-the-art approaches. We demonstrated that the proposed method, although fully automated, performs as well as manual relation extraction, with enhanced robustness. We also observed GIX's capability to augment existing datasets with new sentences, incorporating newly discovered biological terms and processes. Further, we demonstrated GIX's real-world applicability in inferring E. coli gene circuits.


Subject(s)
Data Mining , Data Mining/methods , Natural Language Processing , Machine Learning , Computational Biology/methods , Humans , Algorithms
8.
Sensors (Basel) ; 24(9)2024 Apr 30.
Article in English | MEDLINE | ID: mdl-38732962

ABSTRACT

Being motivated has positive influences on task performance. However, motivation could result from various motives that affect different parts of the brain. Analyzing the motivation effect from all affected areas requires a high number of EEG electrodes, resulting in high cost, inflexibility, and burden to users. In various real-world applications, only the motivation effect is required for performance evaluation regardless of the motive. Analyzing the relationships between the motivation-affected brain areas associated with the task's performance could limit the required electrodes. This study introduced a method to identify the cognitive motivation effect with a reduced number of EEG electrodes. The temporal association rule mining (TARM) concept was used to analyze the relationships between attention and memorization brain areas under the effect of motivation from the cognitive motivation task. For accuracy improvement, the artificial bee colony (ABC) algorithm was applied with the central limit theorem (CLT) concept to optimize the TARM parameters. From the results, our method can identify the motivation effect with only FCz and P3 electrodes, with 74.5% classification accuracy on average with individual tests.


Subject(s)
Algorithms , Cognition , Electroencephalography , Motivation , Motivation/physiology , Electroencephalography/methods , Humans , Cognition/physiology , Male , Adult , Female , Brain/physiology , Young Adult , Electrodes , Data Mining/methods
9.
PLoS One ; 19(5): e0301262, 2024.
Article in English | MEDLINE | ID: mdl-38722864

ABSTRACT

Frequent sequence pattern mining is an excellent tool to discover patterns in event chains. In complex systems, events from parallel processes are present, often without proper labelling. To identify the groups of events related to the subprocess, frequent sequential pattern mining can be applied. Since most algorithms provide too many frequent sequences that make it difficult to interpret the results, it is necessary to post-process the resulting frequent patterns. The available visualisation techniques do not allow easy access to multiple properties that support a faster and better understanding of the event scenarios. To answer this issue, our work proposes an intuitive and interactive solution to support this task, introducing three novel network-based sequence visualisation methods that can reduce the time of information processing from a cognitive perspective. The proposed visualisation methods offer a more information rich and easily understandable interpretation of sequential pattern mining results compared to the usual text-like outcome of pattern mining algorithms. The first uses the confidence values of the transitions to create a weighted network, while the second enriches the adjacency matrix based on the confidence values with similarities of the transitive nodes. The enriched matrix enables a similarity-based Multidimensional Scaling (MDS) projection of the sequences. The third method uses similarity measurement based on the overlap of the occurrences of the supporting events of the sequences. The applicability of the method is presented in an industrial alarm management problem and in the analysis of clickstreams of a website. The method was fully implemented in Python environment. The results show that the proposed methods are highly applicable for the interactive processing of frequent sequences, supporting the exploration of the inner mechanisms of complex systems.


Subject(s)
Algorithms , Data Mining/methods , Humans
10.
PLoS One ; 19(5): e0301608, 2024.
Article in English | MEDLINE | ID: mdl-38691555

ABSTRACT

The application of pattern mining algorithms to extract movement patterns from sports big data can improve training specificity by facilitating a more granular evaluation of movement. Since movement patterns can only occur as consecutive, non-consecutive, or non-sequential, this study aimed to identify the best set of movement patterns for player movement profiling in professional rugby league and quantify the similarity among distinct movement patterns. Three pattern mining algorithms (l-length Closed Contiguous [LCCspm], Longest Common Subsequence [LCS] and AprioriClose) were used to extract patterns to profile elite rugby football league hookers (n = 22 players) and wingers (n = 28 players) match-games movements across 319 matches. Jaccard similarity score was used to quantify the similarity between algorithms' movement patterns and machine learning classification modelling identified the best algorithm's movement patterns to separate playing positions. LCCspm and LCS movement patterns shared a 0.19 Jaccard similarity score. AprioriClose movement patterns shared no significant Jaccard similarity with LCCspm (0.008) and LCS (0.009) patterns. The closed contiguous movement patterns profiled by LCCspm best-separated players into playing positions. Multi-layered Perceptron classification algorithm achieved the highest accuracy of 91.02% and precision, recall and F1 scores of 0.91 respectively. Therefore, we recommend the extraction of closed contiguous (consecutive) over non-consecutive and non-sequential movement patterns for separating groups of players.


Subject(s)
Algorithms , Football , Movement , Humans , Football/physiology , Movement/physiology , Athletic Performance/physiology , Male , Machine Learning , Athletes , Data Mining/methods , Adult , Rugby
11.
PLoS One ; 19(5): e0304090, 2024.
Article in English | MEDLINE | ID: mdl-38776300

ABSTRACT

BACKGROUND: The aim of the How Farm Vets Cope project was to co-design, with farm veterinary surgeons, a set of web-based resources to help them and others deal with the different situations that they can face. As part of the wider project, participants were recruited for one-to-one semi-structured phone interviews. These interviews focused on elements of job satisfaction and how the participants coped during periods of poor mental wellbeing or with setbacks and failure. METHODS: Transcripts of these interviews were analysed using both quantitative methods of sentiment analysis and text mining, including term frequency/inverse document frequency and rapid automated keyword extraction, and qualitative content analysis. The twin aims of the analysis were identifying the important themes discussed by the participants and comparing the results of the two methods to see what differences, if any, arose. RESULTS: Analysis using the afinn and nrc sentiment lexicons identified emotional themes of anticipation and trust. Rapid automated keyword extraction highlighted issues around age of vets and support, whilst using term frequency/inverse document frequency allowed for individual themes, such as religion, not present across all responses, to be identified. Content analysis supported these findings, pinpointing examples of trust around relationships with farmers and more experienced vets, along with some examples of the difference good support networks can make, particularly to younger vets. FINDINGS: This work has confirmed previous results in identifying the themes of trust, communication and support to be integral to the experience of practicing farm veterinary surgeons. Younger or less experienced vets recognised themselves as benefiting from further support and signposting, leading to a discussion around the preparation of veterinary students for entry into a farm animal vet practice. The two different approaches taken showed very good agreement in their results. The quantitative approaches can be scaled to allow a larger number of interviews to be utilised in studies whilst still allowing the important qualitative results to be identified.


Subject(s)
Data Mining , Livestock , Mental Health , Veterinarians , Humans , Data Mining/methods , Animals , Veterinarians/psychology , Male , Female , Adult , Job Satisfaction , Farmers/psychology , Middle Aged , Interviews as Topic , Farms
12.
Stud Health Technol Inform ; 314: 98-102, 2024 May 23.
Article in English | MEDLINE | ID: mdl-38785011

ABSTRACT

This paper explores the potential of leveraging electronic health records (EHRs) for personalized health research through the application of artificial intelligence (AI) techniques, specifically Named Entity Recognition (NER). By extracting crucial patient information from clinical texts, including diagnoses, medications, symptoms, and lab tests, AI facilitates the rapid identification of relevant data, paving the way for future care paradigms. The study focuses on Non-small cell lung cancer (NSCLC) in Italian clinical notes, introducing a novel set of 29 clinical entities that include both presence or absence (negation) of relevant information associated with NSCLC. Using a state-of-the-art model pretrained on Italian biomedical texts, we achieve promising results (average F1-score of 80.8%), demonstrating the feasibility of employing AI for extracting biomedical information in the Italian language.


Subject(s)
Artificial Intelligence , Electronic Health Records , Lung Neoplasms , Natural Language Processing , Italy , Humans , Lung Neoplasms/diagnosis , Carcinoma, Non-Small-Cell Lung/diagnosis , Data Mining/methods
13.
Comput Biol Med ; 176: 108539, 2024 Jun.
Article in English | MEDLINE | ID: mdl-38728992

ABSTRACT

Nested entities and relationship extraction are two tasks for analysis of electronic medical records. However, most of existing medical information extraction models consider these tasks separately, resulting in a lack of consistency between them. In this paper, we propose a joint medical entity-relation extraction model with progressive recognition and targeted assignment (PRTA). Entities and relations share the information of sequence and word embedding layers in the joint decoding stage. They are trained simultaneously and realize information interaction by updating the shared parameters. Specifically, we design a compound triangle strategy for the nested entity recognition and an adaptive multi-space interactive strategy for relationship extraction. Then, we construct a parameter-shared information space based on semantic continuity to decode entities and relationships. Extensive experiments were conducted on the Private Liver Disease Dataset (PLDD) provided by Beijing Friendship Hospital of Capital Medical University and public datasets (NYT, ACE04 and ACE05). The results show that our method outperforms existing SOTA methods in most indicators, and effectively handles nested entities and overlapping relationships.


Subject(s)
Electronic Health Records , Humans , Data Mining/methods , Algorithms , Databases, Factual , Liver Diseases
14.
Bioinformatics ; 40(5)2024 May 02.
Article in English | MEDLINE | ID: mdl-38775676

ABSTRACT

MOTIVATION: Cytometry comprises powerful techniques for analyzing the cell heterogeneity of a biological sample by examining the expression of protein markers. These technologies impact especially the field of oncoimmunology, where cell identification is essential to analyze the tumor microenvironment. Several classification tools have been developed for the annotation of cytometry datasets, which include supervised tools that require a training set as a reference (i.e. reference-based) and semisupervised tools based on the manual definition of a marker table. The latter is closer to the traditional annotation of cytometry data based on manual gating. However, they require the manual definition of a marker table that cannot be extracted automatically in a reference-based fashion. Therefore, we are lacking methods that allow both classification approaches while maintaining the high biological interpretability given by the marker table. RESULTS: We present a new tool called GateMeClass (Gate Mining and Classification) which overcomes the limitation of the current methods of classification of cytometry data allowing both semisupervised and supervised annotation based on a marker table that can be defined manually or extracted from an external annotated dataset. We measured the accuracy of GateMeClass for annotating three well-established benchmark mass cytometry datasets and one flow cytometry dataset. The performance of GateMeClass is comparable to reference-based methods and marker table-based techniques, offering greater flexibility and rapid execution times. AVAILABILITY AND IMPLEMENTATION: GateMeClass is implemented in R language and is publicly available at https://github.com/simo1c/GateMeClass.


Subject(s)
Data Mining , Flow Cytometry , Flow Cytometry/methods , Data Mining/methods , Humans , Software , Algorithms , Tumor Microenvironment
15.
Database (Oxford) ; 20242024 May 28.
Article in English | MEDLINE | ID: mdl-38805753

ABSTRACT

While biomedical relation extraction (bioRE) datasets have been instrumental in the development of methods to support biocuration of single variants from texts, no datasets are currently available for the extraction of digenic or even oligogenic variant relations, despite the reports in literature that epistatic effects between combinations of variants in different loci (or genes) are important to understand disease etiologies. This work presents the creation of a unique dataset of oligogenic variant combinations, geared to train tools to help in the curation of scientific literature. To overcome the hurdles associated with the number of unlabelled instances and the cost of expertise, active learning (AL) was used to optimize the annotation, thus getting assistance in finding the most informative subset of samples to label. By pre-annotating 85 full-text articles containing the relevant relations from the Oligogenic Diseases Database (OLIDA) with PubTator, text fragments featuring potential digenic variant combinations, i.e. gene-variant-gene-variant, were extracted. The resulting fragments of texts were annotated with ALAMBIC, an AL-based annotation platform. The resulting dataset, called DUVEL, is used to fine-tune four state-of-the-art biomedical language models: BiomedBERT, BiomedBERT-large, BioLinkBERT and BioM-BERT. More than 500 000 text fragments were considered for annotation, finally resulting in a dataset with 8442 fragments, 794 of them being positive instances, covering 95% of the original annotated articles. When applied to gene-variant pair detection, BiomedBERT-large achieves the highest F1 score (0.84) after fine-tuning, demonstrating significant improvement compared to the non-fine-tuned model, underlining the relevance of the DUVEL dataset. This study shows how AL may play an important role in the creation of bioRE dataset relevant for biomedical curation applications. DUVEL provides a unique biomedical corpus focusing on 4-ary relations between two genes and two variants. It is made freely available for research on GitHub and Hugging Face. Database URL: https://huggingface.co/datasets/cnachteg/duvel or https://doi.org/10.57967/hf/1571.


Subject(s)
Supervised Machine Learning , Humans , Data Mining/methods , Data Curation/methods , Databases, Genetic
16.
J Med Syst ; 48(1): 51, 2024 May 16.
Article in English | MEDLINE | ID: mdl-38753223

ABSTRACT

Reports from spontaneous reporting systems (SRS) are hypothesis generating. Additional evidence such as more reports is required to determine whether the generated drug-event associations are in fact safety signals. However, underreporting of adverse drug reactions (ADRs) delays signal detection. Through the use of natural language processing, different sources of real-world data can be used to proactively collect additional evidence for potential safety signals. This study aims to explore the feasibility of using Electronic Health Records (EHRs) to identify additional cases based on initial indications from spontaneous ADR reports, with the goal of strengthening the evidence base for potential safety signals. For two confirmed and two potential signals generated by the SRS of the Netherlands Pharmacovigilance Centre Lareb, targeted searches in the EHR of the Leiden University Medical Centre were performed using a text-mining based tool, CTcue. The search for additional cases was done by constructing and running queries in the structured and free-text fields of the EHRs. We identified at least five additional cases for the confirmed signals and one additional case for each potential safety signal. The majority of the identified cases for the confirmed signals were documented in the EHRs before signal detection by the Dutch Medicines Evaluation Board. The identified cases for the potential signals were reported to Lareb as further evidence for signal detection. Our findings highlight the feasibility of performing targeted searches in the EHR based on an underlying hypothesis to provide further evidence for signal generation.


Subject(s)
Adverse Drug Reaction Reporting Systems , Electronic Health Records , Pharmacovigilance , Electronic Health Records/organization & administration , Humans , Adverse Drug Reaction Reporting Systems/organization & administration , Netherlands , Natural Language Processing , Drug-Related Side Effects and Adverse Reactions/prevention & control , Data Mining/methods
17.
Syst Rev ; 13(1): 135, 2024 May 16.
Article in English | MEDLINE | ID: mdl-38755704

ABSTRACT

We aimed to compare the concordance of information extracted and the time taken between a large language model (OpenAI's GPT-3.5 Turbo via API) against conventional human extraction methods in retrieving information from scientific articles on diabetic retinopathy (DR). The extraction was done using GPT3.5 Turbo as of October 2023. OpenAI's GPT-3.5 Turbo significantly reduced the time taken for extraction. Concordance was highest at 100% for the extraction of the country of study, 64.7% for significant risk factors of DR, 47.1% for exclusion and inclusion criteria, and lastly 41.2% for odds ratio (OR) and 95% confidence interval (CI). The concordance levels seemed to indicate the complexity associated with each prompt. This suggests that OpenAI's GPT-3.5 Turbo may be adopted to extract simple information that is easily located in the text, leaving more complex information to be extracted by the researcher. It is crucial to note that the foundation model is constantly improving significantly with new versions being released quickly. Subsequent work can focus on retrieval-augmented generation (RAG), embedding, chunking PDF into useful sections, and prompting to improve the accuracy of extraction.


Subject(s)
Diabetic Retinopathy , Humans , Information Storage and Retrieval/methods , Natural Language Processing , Data Mining/methods
18.
J Med Internet Res ; 26: e48572, 2024 May 03.
Article in English | MEDLINE | ID: mdl-38700923

ABSTRACT

BACKGROUND: Adverse drug reactions (ADRs), which are the phenotypic manifestations of clinical drug toxicity in humans, are a major concern in precision clinical medicine. A comprehensive evaluation of ADRs is helpful for unbiased supervision of marketed drugs and for discovering new drugs with high success rates. OBJECTIVE: In current practice, drug safety evaluation is often oversimplified to the occurrence or nonoccurrence of ADRs. Given the limitations of current qualitative methods, there is an urgent need for a quantitative evaluation model to improve pharmacovigilance and the accurate assessment of drug safety. METHODS: In this study, we developed a mathematical model, namely the Adverse Drug Reaction Classification System (ADReCS) severity-grading model, for the quantitative characterization of ADR severity, a crucial feature for evaluating the impact of ADRs on human health. The model was constructed by mining millions of real-world historical adverse drug event reports. A new parameter called Severity_score was introduced to measure the severity of ADRs, and upper and lower score boundaries were determined for 5 severity grades. RESULTS: The ADReCS severity-grading model exhibited excellent consistency (99.22%) with the expert-grading system, the Common Terminology Criteria for Adverse Events. Hence, we graded the severity of 6277 standard ADRs for 129,407 drug-ADR pairs. Moreover, we calculated the occurrence rates of 6272 distinct ADRs for 127,763 drug-ADR pairs in large patient populations by mining real-world medication prescriptions. With the quantitative features, we demonstrated example applications in systematically elucidating ADR mechanisms and thereby discovered a list of drugs with improper dosages. CONCLUSIONS: In summary, this study represents the first comprehensive determination of both ADR severity grades and ADR frequencies. This endeavor establishes a strong foundation for future artificial intelligence applications in discovering new drugs with high efficacy and low toxicity. It also heralds a paradigm shift in clinical toxicity research, moving from qualitative description to quantitative evaluation.


Subject(s)
Big Data , Data Mining , Drug-Related Side Effects and Adverse Reactions , Humans , Data Mining/methods , Pharmacovigilance , Models, Theoretical , Adverse Drug Reaction Reporting Systems/statistics & numerical data
19.
Genes (Basel) ; 15(5)2024 May 11.
Article in English | MEDLINE | ID: mdl-38790243

ABSTRACT

Alzheimer's disease (AD), a multifactorial neurodegenerative disorder, is prevalent among the elderly population. It is a complex trait with mutations in multiple genes. Although the US Food and Drug Administration (FDA) has approved a few drugs for AD treatment, a definitive cure remains elusive. Research efforts persist in seeking improved treatment options for AD. Here, a hybrid pipeline is proposed to apply text mining to identify comorbid diseases for AD and an omics approach to identify the common genes between AD and five comorbid diseases-dementia, type 2 diabetes, hypertension, Parkinson's disease, and Down syndrome. We further identified the pathways and drugs for common genes. The rationale behind this approach is rooted in the fact that elderly individuals often receive multiple medications for various comorbid diseases, and an insight into the genes that are common to comorbid diseases may enhance treatment strategies. We identified seven common genes-PSEN1, PSEN2, MAPT, APP, APOE, NOTCH, and HFE-for AD and five comorbid diseases. We investigated the drugs interacting with these common genes using LINCS gene-drug perturbation. Our analysis unveiled several promising candidates, including MG-132 and Masitinib, which exhibit potential efficacy for both AD and its comorbid diseases. The pipeline can be extended to other diseases.


Subject(s)
Alzheimer Disease , Comorbidity , Data Mining , Alzheimer Disease/genetics , Alzheimer Disease/drug therapy , Humans , Data Mining/methods , Parkinson Disease/genetics , Parkinson Disease/drug therapy , Diabetes Mellitus, Type 2/genetics , Diabetes Mellitus, Type 2/drug therapy , Down Syndrome/genetics , Down Syndrome/drug therapy , Hypertension/genetics , Hypertension/drug therapy
20.
Sci Rep ; 14(1): 11942, 2024 05 24.
Article in English | MEDLINE | ID: mdl-38789488

ABSTRACT

With the deepening of enterprise digital construction, the portrait portrayal based on employee behaviors has gradually become a research focus. Currently, the employee's portrait portrayal mostly has the problems of simple means, low efficiency, limited solving ability, etc., making the results more one-sided. Therefore, a data mining-based employee portrait portrayal model is proposed. The content of employee portrait portrayal is deeply analyzed, and the overall framework of the model is designed. A diligence analysis model (DAM) based on improved GAN is constructed, and the diligence evaluation of employees is clarified to realize the diligence evaluation. The results of diligence analysis of DAM have high accuracy (80.39%) and outperform SA (70.24%), K-means (51.79%) and GAN (67.25%). The Kappa coefficient of DAM reaches 0.7384, which is highly consistent and higher than SA (0.6075), K-means (0.3711) and GAN (0.5661). The Local Outlier Factor (LOF) and Isolation Forest (IF) are used to detect abnormal behaviors on the employees, and mine the abnormal behavior patterns on different granularity time. The LSTM model (Att-LSTM) based on the attention mechanism is used to complete the prediction of employees' software usage behaviors, and analyze and summarize the characteristics of employee's behaviors from multiple perspectives. Att-LSTM predicts the best with an RMSE of 0.82983, which is better than LSTM (0.90833) and SA (0.97767); AM-LSTM has a MAPE of 0.80323, which is better than LSTM (0.86233) and SA (0.92223). The results show that the data mining-based employee portrait portrayal method can better solve the problem of enterprise employees' digital construction, and provide a new way of thinking for the construction of enterprise-level employees' digital portrait model and the analysis of employee behavior.


Subject(s)
Data Mining , Humans , Data Mining/methods , Employment
SELECTION OF CITATIONS
SEARCH DETAIL
...