Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 36
Filter
1.
J Biomed Semantics ; 14(1): 1, 2023 01 31.
Article in English | MEDLINE | ID: mdl-36721225

ABSTRACT

BACKGROUND: Information pertaining to mechanisms, management and treatment of disease-causing pathogens including viruses and bacteria is readily available from research publications indexed in MEDLINE. However, identifying the literature that specifically characterises these pathogens and their properties based on experimental research, important for understanding of the molecular basis of diseases caused by these agents, requires sifting through a large number of articles to exclude incidental mentions of the pathogens, or references to pathogens in other non-experimental contexts such as public health. OBJECTIVE: In this work, we lay the foundations for the development of automatic methods for characterising mentions of pathogens in scientific literature, focusing on the task of identifying research that involves the experimental study of a pathogen in an experimental context. There are no manually annotated pathogen corpora available for this purpose, while such resources are necessary to support the development of machine learning-based models. We therefore aim to fill this gap, producing a large data set automatically from MEDLINE under some simplifying assumptions for the task definition, and using it to explore automatic methods that specifically support the detection of experimentally studied pathogen mentions in research publications. METHODS: We developed a pathogen mention characterisation literature data set -READBiomed-Pathogens- automatically using NCBI resources, which we make available. Resources such as the NCBI Taxonomy, MeSH and GenBank can be used effectively to identify relevant literature about experimentally researched pathogens, more specifically using MeSH to link to MEDLINE citations including titles and abstracts with experimentally researched pathogens. We experiment with several machine learning-based natural language processing (NLP) algorithms leveraging this data set as training data, to model the task of detecting papers that specifically describe experimental study of a pathogen. RESULTS: We show that our data set READBiomed-Pathogens can be used to explore natural language processing configurations for experimental pathogen mention characterisation. READBiomed-Pathogens includes citations related to organisms including bacteria, viruses, and a small number of toxins and other disease-causing agents. CONCLUSIONS: We studied the characterisation of experimentally studied pathogens in scientific literature, developing several natural language processing methods supported by an automatically developed data set. As a core contribution of the work, we presented a methodology to automatically construct a data set for pathogen identification using existing biomedical resources. The data set and the annotation code are made publicly available. Performance of the pathogen mention identification and characterisation algorithms were additionally evaluated on a small manually annotated data set shows that the data set that we have generated allows characterising pathogens of interest. TRIAL REGISTRATION: N/A.


Subject(s)
Algorithms , Natural Language Processing , Databases, Genetic , MEDLINE , Machine Learning
2.
J Med Internet Res ; 25: e35568, 2023 03 13.
Article in English | MEDLINE | ID: mdl-36722350

ABSTRACT

BACKGROUND: Assessment of the quality of medical evidence available on the web is a critical step in the preparation of systematic reviews. Existing tools that automate parts of this task validate the quality of individual studies but not of entire bodies of evidence and focus on a restricted set of quality criteria. OBJECTIVE: We proposed a quality assessment task that provides an overall quality rating for each body of evidence (BoE), as well as finer-grained justification for different quality criteria according to the Grading of Recommendation, Assessment, Development, and Evaluation formalization framework. For this purpose, we constructed a new data set and developed a machine learning baseline system (EvidenceGRADEr). METHODS: We algorithmically extracted quality-related data from all summaries of findings found in the Cochrane Database of Systematic Reviews. Each BoE was defined by a set of population, intervention, comparison, and outcome criteria and assigned a quality grade (high, moderate, low, or very low) together with quality criteria (justification) that influenced that decision. Different statistical data, metadata about the review, and parts of the review text were extracted as support for grading each BoE. After pruning the resulting data set with various quality checks, we used it to train several neural-model variants. The predictions were compared against the labels originally assigned by the authors of the systematic reviews. RESULTS: Our quality assessment data set, Cochrane Database of Systematic Reviews Quality of Evidence, contains 13,440 instances, or BoEs labeled for quality, originating from 2252 systematic reviews published on the internet from 2002 to 2020. On the basis of a 10-fold cross-validation, the best neural binary classifiers for quality criteria detected risk of bias at 0.78 F1 (P=.68; R=0.92) and imprecision at 0.75 F1 (P=.66; R=0.86), while the performance on inconsistency, indirectness, and publication bias criteria was lower (F1 in the range of 0.3-0.4). The prediction of the overall quality grade into 1 of the 4 levels resulted in 0.5 F1. When casting the task as a binary problem by merging the Grading of Recommendation, Assessment, Development, and Evaluation classes (high+moderate vs low+very low-quality evidence), we attained 0.74 F1. We also found that the results varied depending on the supporting information that is provided as an input to the models. CONCLUSIONS: Different factors affect the quality of evidence in the context of systematic reviews of medical evidence. Some of these (risk of bias and imprecision) can be automated with reasonable accuracy. Other quality dimensions such as indirectness, inconsistency, and publication bias prove more challenging for machine learning, largely because they are much rarer. This technology could substantially reduce reviewer workload in the future and expedite quality assessment as part of evidence synthesis.


Subject(s)
Machine Learning , Humans , Systematic Reviews as Topic , Bias
3.
BMC Geriatr ; 22(1): 922, 2022 12 01.
Article in English | MEDLINE | ID: mdl-36451137

ABSTRACT

BACKGROUND: Although elderly population is generally frail, it is important to closely monitor their health deterioration to improve the care and support in residential aged care homes (RACs). Currently, the best identification approach is through time-consuming regular geriatric assessments. This study aimed to develop and validate a retrospective electronic frailty index (reFI) to track the health status of people staying at RACs using the daily routine operational data records. METHODS: We have access to patient records from the Royal Freemasons Benevolent Institution RACs (Australia) over the age of 65, spanning 2010 to 2021. The reFI was developed using the cumulative deficit frailty model whose value was calculated as the ratio of number of present frailty deficits to the total possible frailty indicators (32). Frailty categories were defined using population quartiles. 1, 3 and 5-year mortality were used for validation. Survival analysis was performed using Kaplan-Meier estimate. Hazard ratios (HRs) were estimated using Cox regression analyses and the association was assessed using receiver operating characteristic (ROC) curves. RESULTS: Two thousand five hundred eighty-eight residents were assessed, with an average length of stay of 1.2 ± 2.2 years. The RAC cohort was generally frail with an average reFI of 0.21 ± 0.11. According to the Kaplan-Meier estimate, survival varied significantly across different frailty categories (p < 0.01). The estimated hazard ratios (HRs) were 1.12 (95% CI 1.09-1.15), 1.11 (95% CI 1.07-1.14), and 1.1 (95% CI 1.04-1.17) at 1, 3 and 5 years. The ROC analysis of the reFI for mortality outcome showed an area under the curve (AUC) of ≥0.60 for 1, 3 and 5-year mortality. CONCLUSION: A novel reFI was developed using the routine data recorded at RACs. reFI can identify changes in the frailty index over time for elderly people, that could potentially help in creating personalised care plans for addressing their health deterioration.


Subject(s)
Frailty , Aged , Humans , Retrospective Studies , Frailty/diagnosis , Frailty/epidemiology , Homes for the Aged , Electronics , Kaplan-Meier Estimate
4.
Trends Hear ; 25: 23312165211066174, 2021.
Article in English | MEDLINE | ID: mdl-34903103

ABSTRACT

While cochlear implants have helped hundreds of thousands of individuals, it remains difficult to predict the extent to which an individual's hearing will benefit from implantation. Several publications indicate that machine learning may improve predictive accuracy of cochlear implant outcomes compared to classical statistical methods. However, existing studies are limited in terms of model validation and evaluating factors like sample size on predictive performance. We conduct a thorough examination of machine learning approaches to predict word recognition scores (WRS) measured approximately 12 months after implantation in adults with post-lingual hearing loss. This is the largest retrospective study of cochlear implant outcomes to date, evaluating 2,489 cochlear implant recipients from three clinics. We demonstrate that while machine learning models significantly outperform linear models in prediction of WRS, their overall accuracy remains limited (mean absolute error: 17.9-21.8). The models are robust across clinical cohorts, with predictive error increasing by at most 16% when evaluated on a clinic excluded from the training set. We show that predictive improvement is unlikely to be improved by increasing sample size alone, with doubling of sample size estimated to only increasing performance by 3% on the combined dataset. Finally, we demonstrate how the current models could support clinical decision making, highlighting that subsets of individuals can be identified that have a 94% chance of improving WRS by at least 10% points after implantation, which is likely to be clinically meaningful. We discuss several implications of this analysis, focusing on the need to improve and standardize data collection.


Subject(s)
Cochlear Implantation , Cochlear Implants , Deafness , Hearing Aids , Speech Perception , Adult , Cochlear Implantation/methods , Deafness/diagnosis , Humans , Retrospective Studies , Treatment Outcome
5.
Trends Hear ; 25: 23312165211037525, 2021.
Article in English | MEDLINE | ID: mdl-34524944

ABSTRACT

While the majority of cochlear implant recipients benefit from the device, it remains difficult to estimate the degree of benefit for a specific patient prior to implantation. Using data from 2,735 cochlear-implant recipients from across three clinics, the largest retrospective study of cochlear-implant outcomes to date, we investigate the association between 21 preoperative factors and speech recognition approximately one year after implantation and explore the consistency of their effects across the three constituent datasets. We provide evidence of 17 statistically significant associations, in either univariate or multivariate analysis, including confirmation of associations for several predictive factors, which have only been examined in prior smaller studies. Despite the large sample size, a multivariate analysis shows that the variance explained by our models remains modest across the datasets (R2=0.12-0.21). Finally, we report a novel statistical interaction indicating that the duration of deafness in the implanted ear has a stronger impact on hearing outcome when considered relative to a candidate's age. Our multicenter study highlights several real-world complexities that impact the clinical translation of predictive factors for cochlear implantation outcome. We suggest several directions to overcome these challenges and further improve our ability to model patient outcomes with increased accuracy.


Subject(s)
Cochlear Implantation , Cochlear Implants , Deafness , Speech Perception , Adult , Deafness/diagnosis , Deafness/surgery , Hearing , Humans , Retrospective Studies , Treatment Outcome
6.
Bioinformatics ; 37(8): 1156-1163, 2021 05 23.
Article in English | MEDLINE | ID: mdl-33107905

ABSTRACT

MOTIVATION: Structured semantic resources, for example, biological knowledge bases and ontologies, formally define biological concepts, entities and their semantic relationships, manifested as structured axioms and unstructured texts (e.g. textual definitions). The resources contain accurate expressions of biological reality and have been used by machine-learning models to assist intelligent applications like knowledge discovery. The current methods use both the axioms and definitions as plain texts in representation learning (RL). However, since the axioms are machine-readable while the natural language is human-understandable, difference in meaning of token and structure impedes the representations to encode desirable biological knowledge. RESULTS: We propose ERBK, a RL model of bio-entities. Instead of using the axioms and definitions as a textual corpus, our method uses knowledge graph embedding method and deep convolutional neural models to encode the axioms and definitions respectively. The representations could not only encode more underlying biological knowledge but also be further applied to zero-shot circumstance where existing approaches fall short. Experimental evaluations show that ERBK outperforms the existing methods for predicting protein-protein interactions and gene-disease associations. Moreover, it shows that ERBK still maintains promising performance under the zero-shot circumstance. We believe the representations and the method have certain generality and could extend to other types of bio-relation. AVAILABILITY AND IMPLEMENTATION: The source code is available at the gitlab repository https://gitlab.com/BioAI/erbk. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Knowledge Bases , Machine Learning , Humans , Language , Semantics , Software
7.
Drug Saf ; 43(9): 893-903, 2020 09.
Article in English | MEDLINE | ID: mdl-32385840

ABSTRACT

INTRODUCTION: Adverse drug reactions (ADRs) are unintended reactions caused by a drug or combination of drugs taken by a patient. The current safety surveillance system relies on spontaneous reporting systems (SRSs) and more recently on observational health data; however, ADR detection may be delayed and lack geographic diversity. The broad scope of social media conversations, such as those on Twitter, can include health-related topics. Consequently, these data could be used to detect potentially novel ADRs with less latency. Although research regarding ADR detection using social media has made progress, findings are based on single information sources, and no study has yet integrated drug safety evidence from both an SRS and Twitter. OBJECTIVE: The aim of this study was to combine signals from an SRS and Twitter to facilitate the detection of safety signals and compare the performance of the combined system with signals generated by individual data sources. METHODS: We extracted potential drug-ADR posts from Twitter, used Monte Carlo expectation maximization to generate drug safety signals from both the US FDA Adverse Event Reporting System and posts from Twitter, and then integrated these signals using a Bayesian hierarchical model. The results from the integrated system and two individual sources were evaluated using a reference standard derived from drug labels. Area under the receiver operating characteristics curve (AUC) was computed to measure performance. RESULTS: We observed a significant improvement in the AUC of the combined system when comparing it with Twitter alone, and no improvement when comparing with the SRS alone. The AUCs ranged from 0.587 to 0.637 for the combined SRS and Twitter, from 0.525 to 0.534 for Twitter alone, and from 0.612 to 0.642 for the SRS alone. The results varied because different preprocessing procedures were applied to Twitter. CONCLUSION: The accuracy of signal detection using social media can be improved by combining signals with those from SRSs. However, the combined system cannot achieve better AUC performance than data from FAERS alone, which may indicate that Twitter data are not ready to be integrated into a purely data-driven combination system.


Subject(s)
Adverse Drug Reaction Reporting Systems , Pharmacovigilance , Social Media , United States Food and Drug Administration , Humans , United States
8.
JAMA Netw Open ; 3(3): e200265, 2020 03 02.
Article in English | MEDLINE | ID: mdl-32119094

ABSTRACT

Importance: Mammography screening currently relies on subjective human interpretation. Artificial intelligence (AI) advances could be used to increase mammography screening accuracy by reducing missed cancers and false positives. Objective: To evaluate whether AI can overcome human mammography interpretation limitations with a rigorous, unbiased evaluation of machine learning algorithms. Design, Setting, and Participants: In this diagnostic accuracy study conducted between September 2016 and November 2017, an international, crowdsourced challenge was hosted to foster AI algorithm development focused on interpreting screening mammography. More than 1100 participants comprising 126 teams from 44 countries participated. Analysis began November 18, 2016. Main Outcomes and Measurements: Algorithms used images alone (challenge 1) or combined images, previous examinations (if available), and clinical and demographic risk factor data (challenge 2) and output a score that translated to cancer yes/no within 12 months. Algorithm accuracy for breast cancer detection was evaluated using area under the curve and algorithm specificity compared with radiologists' specificity with radiologists' sensitivity set at 85.9% (United States) and 83.9% (Sweden). An ensemble method aggregating top-performing AI algorithms and radiologists' recall assessment was developed and evaluated. Results: Overall, 144 231 screening mammograms from 85 580 US women (952 cancer positive ≤12 months from screening) were used for algorithm training and validation. A second independent validation cohort included 166 578 examinations from 68 008 Swedish women (780 cancer positive). The top-performing algorithm achieved an area under the curve of 0.858 (United States) and 0.903 (Sweden) and 66.2% (United States) and 81.2% (Sweden) specificity at the radiologists' sensitivity, lower than community-practice radiologists' specificity of 90.5% (United States) and 98.5% (Sweden). Combining top-performing algorithms and US radiologist assessments resulted in a higher area under the curve of 0.942 and achieved a significantly improved specificity (92.0%) at the same sensitivity. Conclusions and Relevance: While no single AI algorithm outperformed radiologists, an ensemble of AI algorithms combined with radiologist assessment in a single-reader screening environment improved overall accuracy. This study underscores the potential of using machine learning methods for enhancing mammography screening interpretation.


Subject(s)
Breast Neoplasms/diagnostic imaging , Deep Learning , Image Interpretation, Computer-Assisted/methods , Mammography/methods , Radiologists , Adult , Aged , Algorithms , Artificial Intelligence , Early Detection of Cancer , Female , Humans , Middle Aged , Radiology , Sensitivity and Specificity , Sweden , United States
9.
AMIA Annu Symp Proc ; 2020: 1325-1334, 2020.
Article in English | MEDLINE | ID: mdl-33936509

ABSTRACT

Recent research in predicting protein secondary structure populations (SSP) based on Nuclear Magnetic Resonance (NMR) chemical shifts has helped quantitatively characterise the structural conformational properties of intrinsically disordered proteins and regions (IDP/IDR). Different from protein secondary structure (SS) prediction, the SSP prediction assumes a dynamic assignment of secondary structures that seem correlate with disordered states. In this study, we designed a single-task deep learning framework to predict IDP/IDR and SSP respectively; and multitask deep learning frameworks to allow quantitative predictions of IDP/IDR evidenced by the simultaneously predicted SSP. According to independent test results, single-task deep learning models improve the prediction performance of shallow models for SSP and IDP/IDR. Also, the prediction performance was further improved for IDP/IDR prediction when SSP prediction was simultaneously predicted in multitask models. With p53 as a use case, we demonstrate how predicted SSP is used to explain the IDP/IDR predictions for each functional region.


Subject(s)
Deep Learning , Intrinsically Disordered Proteins/chemistry , Protein Structure, Secondary
10.
Bioinformatics ; 36(2): 611-620, 2020 01 15.
Article in English | MEDLINE | ID: mdl-31350561

ABSTRACT

MOTIVATION: A biochemical reaction, bio-event, depicts the relationships between participating entities. Current text mining research has been focusing on identifying bio-events from scientific literature. However, rare efforts have been dedicated to normalize bio-events extracted from scientific literature with the entries in the curated reaction databases, which could disambiguate the events and further support interconnecting events into biologically meaningful and complete networks. RESULTS: In this paper, we propose BioNorm, a novel method of normalizing bio-events extracted from scientific literature to entries in the bio-molecular reaction database, e.g. IntAct. BioNorm considers event normalization as a paraphrase identification problem. It represents an entry as a natural language statement by combining multiple types of information contained in it. Then, it predicts the semantic similarity between the natural language statement and the statements mentioning events in scientific literature using a long short-term memory recurrent neural network (LSTM). An event will be normalized to the entry if the two statements are paraphrase. To the best of our knowledge, this is the first attempt of event normalization in the biomedical text mining. The experiments have been conducted using the molecular interaction data from IntAct. The results demonstrate that the method could achieve F-score of 0.87 in normalizing event-containing statements. AVAILABILITY AND IMPLEMENTATION: The source code is available at the gitlab repository https://gitlab.com/BioAI/leen and BioASQvec Plus is available on figshare https://figshare.com/s/45896c31d10c3f6d857a.


Subject(s)
Data Mining , Deep Learning , Databases, Genetic , Neural Networks, Computer , Software
11.
Clin Colorectal Cancer ; 17(3): e569-e577, 2018 09.
Article in English | MEDLINE | ID: mdl-29980491

ABSTRACT

BACKGROUND: Multiple studies have defined the prognostic and potential predictive significance of the primary tumor side in metastatic colorectal cancer (CRC). However, the currently available data for early-stage disease are limited and inconsistent. MATERIALS AND METHODS: We explored the clinicopathologic, treatment, and outcome data from a multisite Australian CRC registry from 2003 to 2016. Tumors at and distal to the splenic flexure were considered a left primary (LP). RESULTS: For the 6547 patients identified, the median age at diagnosis was 69 years, 55% were men, and most (63%) had a LP. Comparing the outcomes for right primary (RP) versus LP, time-to-recurrence was similar for stage I and III disease, but longer for those with a stage II RP (hazard ratio [HR], 0.68; 95% confidence interval [CI], 0.52-0.90; P < .01). Adjuvant chemotherapy provided a consistent benefit in stage III disease, regardless of the tumor side. Overall survival (OS) was similar for those with stage I and II disease between LP and RP patients; however, those with stage III RP disease had poorer OS (HR, 1.30; 95% CI, 1.04-1.62; P < .05) and cancer-specific survival (HR, 1.55; 95% CI, 1.19-2.03; P < .01). Patients with stage IV RP, whether de novo metastatic (HR, 1.15; 95% CI, 0.95-1.39) or relapsed post-early-stage disease (HR, 1.35; 95% CI, 1.11-1.65; P < .01), had poorer OS. CONCLUSION: In early-stage CRC, the association of tumor side and effect on the time-to-recurrence and OS varies by stage. In stage III patients with an RP, poorer OS and cancer-specific survival outcomes are, in part, driven by inferior survival after recurrence, and tumor side did not influence adjuvant chemotherapy benefit.


Subject(s)
Antineoplastic Agents/therapeutic use , Colorectal Neoplasms/pathology , Neoplasm Recurrence, Local/epidemiology , Registries/statistics & numerical data , Aged , Australia/epidemiology , Chemotherapy, Adjuvant/methods , Colorectal Neoplasms/mortality , Colorectal Neoplasms/therapy , Disease-Free Survival , Female , Humans , Male , Middle Aged , Neoplasm Recurrence, Local/pathology , Neoplasm Staging , Prevalence , Prognosis , Proportional Hazards Models , Prospective Studies , Survival Analysis
12.
J Biomed Inform ; 73: 137-147, 2017 09.
Article in English | MEDLINE | ID: mdl-28797709

ABSTRACT

Word sense disambiguation helps identifying the proper sense of ambiguous words in text. With large terminologies such as the UMLS Metathesaurus ambiguities appear and highly effective disambiguation methods are required. Supervised learning algorithm methods are used as one of the approaches to perform disambiguation. Features extracted from the context of an ambiguous word are used to identify the proper sense of such a word. The type of features have an impact on machine learning methods, thus affect disambiguation performance. In this work, we have evaluated several types of features derived from the context of the ambiguous word and we have explored as well more global features derived from MEDLINE using word embeddings. Results show that word embeddings improve the performance of more traditional features and allow as well using recurrent neural network classifiers based on Long-Short Term Memory (LSTM) nodes. The combination of unigrams and word embeddings with an SVM sets a new state of the art performance with a macro accuracy of 95.97 in the MSH WSD data set.


Subject(s)
Natural Language Processing , Neural Networks, Computer , Unified Medical Language System , Algorithms , MEDLINE , Memory, Short-Term
13.
Article in English | MEDLINE | ID: mdl-29399672

ABSTRACT

Biomedical word sense disambiguation (WSD) is an important intermediate task in many natural language processing applications such as named entity recognition, syntactic parsing, and relation extraction. In this paper, we employ knowledge-based approaches that also exploit recent advances in neural word/concept embeddings to improve over the state-of-the-art in biomedical WSD using the public MSH WSD dataset [1] as the test set. Our methods involve weak supervision - we do not use any hand-labeled examples for WSD to build our prediction models; however, we employ an existing concept mapping program, MetaMap, to obtain our concept vectors. Over the MSH WSD dataset, our linear time (in terms of numbers of senses and words in the test instance) method achieves an accuracy of 92.24% which is a 3% improvement over the best known results [2] obtained via unsupervised means. A more expensive approach that we developed relies on a nearest neighbor framework and achieves accuracy of 94.34%, essentially cutting the error rate in half. Employing dense vector representations learned from unlabeled free text has been shown to benefit many language processing tasks recently and our efforts show that biomedical WSD is no exception to this trend. For a complex and rapidly evolving domain such as biomedicine, building labeled datasets for larger sets of ambiguous terms may be impractical. Here, we show that weak supervision that leverages recent advances in representation learning can rival supervised approaches in biomedical WSD. However, external knowledge bases (here sense inventories) play a key role in the improvements achieved.

14.
Stud Health Technol Inform ; 216: 643-7, 2015.
Article in English | MEDLINE | ID: mdl-26262130

ABSTRACT

Social media sites, such as Twitter, are a rich source of many kinds of information, including health-related information. Accurate detection of entities such as diseases, drugs, and symptoms could be used for biosurveillance (e.g. monitoring of flu) and identification of adverse drug events. However, a critical assessment of performance of current text mining technology on Twitter has not been done yet in the medical domain. Here, we study the development of a Twitter data set annotated with relevant medical entities which we have publicly released. The manual annotation results show that it is possible to perform high-quality annotation despite of the complexity of medical terminology and the lack of context in a tweet. Furthermore, we have evaluated the capability of state-of-the-art approaches to reproduce the annotations in the data set. The best methods achieve F-scores of 55-66%. The data analysis and the preliminary results provide valuable insights on identifying medical entities in Twitter for various applications.


Subject(s)
Data Mining/methods , Disease/classification , Pharmaceutical Preparations/classification , Social Media/classification , Symptom Assessment/classification , Natural Language Processing , Population Surveillance/methods , Terminology as Topic , Vocabulary, Controlled
15.
BMC Bioinformatics ; 16: 113, 2015 Apr 08.
Article in English | MEDLINE | ID: mdl-25887792

ABSTRACT

BACKGROUND: Research in biomedical text categorization has mostly used the bag-of-words representation. Other more sophisticated representations of text based on syntactic, semantic and argumentative properties have been less studied. In this paper, we evaluate the impact of different text representations of biomedical texts as features for reproducing the MeSH annotations of some of the most frequent MeSH headings. In addition to unigrams and bigrams, these features include noun phrases, citation meta-data, citation structure, and semantic annotation of the citations. RESULTS: Traditional features like unigrams and bigrams exhibit strong performance compared to other feature sets. Little or no improvement is obtained when using meta-data or citation structure. Noun phrases are too sparse and thus have lower performance compared to more traditional features. Conceptual annotation of the texts by MetaMap shows similar performance compared to unigrams, but adding concepts from the UMLS taxonomy does not improve the performance of using only mapped concepts. The combination of all the features performs largely better than any individual feature set considered. In addition, this combination improves the performance of a state-of-the-art MeSH indexer. Concerning the machine learning algorithms, we find that those that are more resilient to class imbalance largely obtain better performance. CONCLUSIONS: We conclude that even though traditional features such as unigrams and bigrams have strong performance compared to other features, it is possible to combine them to effectively improve the performance of the bag-of-words representation. We have also found that the combination of the learning algorithm and feature sets has an influence in the overall performance of the system. Moreover, using learning algorithms resilient to class imbalance largely improves performance. However, when using a large set of features, consideration needs to be taken with algorithms due to the risk of over-fitting. Specific combinations of learning algorithms and features for individual MeSH headings could further increase the performance of an indexing system.


Subject(s)
Abstracting and Indexing/methods , Algorithms , Information Storage and Retrieval , MEDLINE , Medical Subject Headings , Artificial Intelligence , Humans , Semantics
16.
J Biomed Inform ; 53: 300-7, 2015 Feb.
Article in English | MEDLINE | ID: mdl-25510606

ABSTRACT

Text mining of scientific literature has been essential for setting up large public biomedical databases, which are being widely used by the research community. In the biomedical domain, the existence of a large number of terminological resources and knowledge bases (KB) has enabled a myriad of machine learning methods for different text mining related tasks. Unfortunately, KBs have not been devised for text mining tasks but for human interpretation, thus performance of KB-based methods is usually lower when compared to supervised machine learning methods. The disadvantage of supervised methods though is they require labeled training data and therefore not useful for large scale biomedical text mining systems. KB-based methods do not have this limitation. In this paper, we describe a novel method to generate word-concept probabilities from a KB, which can serve as a basis for several text mining tasks. This method not only takes into account the underlying patterns within the descriptions contained in the KB but also those in texts available from large unlabeled corpora such as MEDLINE. The parameters of the model have been estimated without training data. Patterns from MEDLINE have been built using MetaMap for entity recognition and related using co-occurrences. The word-concept probabilities were evaluated on the task of word sense disambiguation (WSD). The results showed that our method obtained a higher degree of accuracy than other state-of-the-art approaches when evaluated on the MSH WSD data set. We also evaluated our method on the task of document ranking using MEDLINE citations. These results also showed an increase in performance over existing baseline retrieval approaches.


Subject(s)
Computational Biology/methods , Data Mining/methods , Semantics , Unified Medical Language System , Algorithms , Artificial Intelligence , Knowledge Bases , MEDLINE , Models, Statistical , Natural Language Processing , Probability
17.
PeerJ ; 2: e639, 2014.
Article in English | MEDLINE | ID: mdl-25374782

ABSTRACT

We present a method to assist in interpretation of the functional impact of intergenic disease-associated SNPs that is not limited to search strategies proximal to the SNP. The method builds on two sources of external knowledge: the growing understanding of three-dimensional spatial relationships in the genome, and the substantial repository of information about relationships among genetic variants, genes, and diseases captured in the published biomedical literature. We integrate chromatin conformation capture data (HiC) with literature support to rank putative target genes of intergenic disease-associated SNPs. We demonstrate that this hybrid method outperforms a genomic distance baseline on a small test set of expression quantitative trait loci, as well as either method individually. In addition, we show the potential for this method to uncover relationships between intergenic SNPs and target genes across chromosomes. With more extensive chromatin conformation capture data becoming readily available, this method provides a way forward towards functional interpretation of SNPs in the context of the three dimensional structure of the genome in the nucleus.

18.
F1000Res ; 3: 18, 2014.
Article in English | MEDLINE | ID: mdl-25285203

ABSTRACT

As the cost of genomic sequencing continues to fall, the amount of data being collected and studied for the purpose of understanding the genetic basis of disease is increasing dramatically. Much of the source information relevant to such efforts is available only from unstructured sources such as the scientific literature, and significant resources are expended in manually curating and structuring the information in the literature. As such, there have been a number of systems developed to target automatic extraction of mutations and other genetic variation from the literature using text mining tools. We have performed a broad survey of the existing publicly available tools for extraction of genetic variants from the scientific literature. We consider not just one tool but a number of different tools, individually and in combination, and apply the tools in two scenarios. First, they are compared in an intrinsic evaluation context, where the tools are tested for their ability to identify specific mentions of genetic variants in a corpus of manually annotated papers, the Variome corpus. Second, they are compared in an extrinsic evaluation context based on our previous study of text mining support for curation of the COSMIC and InSiGHT databases. Our results demonstrate that no single tool covers the full range of genetic variants mentioned in the literature. Rather, several tools have complementary coverage and can be used together effectively. In the intrinsic evaluation on the Variome corpus, the combined performance is above 0.93 in F-measure, while in the extrinsic evaluation the combined recall performance is above 0.71 for COSMIC and above 0.62 for InSiGHT, a substantial improvement over the performance of any individual tool. Based on the analysis of these results, we suggest several directions for the improvement of text mining tools for genetic variant extraction from the literature.

19.
Database (Oxford) ; 2014: bau003, 2014.
Article in English | MEDLINE | ID: mdl-24520105

ABSTRACT

A major focus of modern biological research is the understanding of how genomic variation relates to disease. Although there are significant ongoing efforts to capture this understanding in curated resources, much of the information remains locked in unstructured sources, in particular, the scientific literature. Thus, there have been several text mining systems developed to target extraction of mutations and other genetic variation from the literature. We have performed the first study of the use of text mining for the recovery of genetic variants curated directly from the literature. We consider two curated databases, COSMIC (Catalogue Of Somatic Mutations In Cancer) and InSiGHT (International Society for Gastro-intestinal Hereditary Tumours), that contain explicit links to the source literature for each included mutation. Our analysis shows that the recall of the mutations catalogued in the databases using a text mining tool is very low, despite the well-established good performance of the tool and even when the full text of the associated article is available for processing. We demonstrate that this discrepancy can be explained by considering the supplementary material linked to the published articles, not previously considered by text mining tools. Although it is anecdotally known that supplementary material contains 'all of the information', and some researchers have speculated about the role of supplementary material (Schenck et al. Extraction of genetic mutations associated with cancer from public literature. J Health Med Inform 2012;S2:2.), our analysis substantiates the significant extent to which this material is critical. Our results highlight the need for literature mining tools to consider not only the narrative content of a publication but also the full set of material related to a publication.


Subject(s)
Data Mining/methods , Genetic Variation , Publications , Genome, Human/genetics , Humans , Medical Subject Headings , Mutation/genetics , Software , Statistics as Topic
20.
J Biomed Semantics ; 4(1): 28, 2013 Oct 11.
Article in English | MEDLINE | ID: mdl-24112383

ABSTRACT

MOTIVATION: The identification of protein and gene names (PGNs) from the scientific literature requires semantic resources: Terminological and lexical resources deliver the term candidates into PGN tagging solutions and the gold standard corpora (GSC) train them to identify term parameters and contextual features. Ideally all three resources, i.e. corpora, lexica and taggers, cover the same domain knowledge, and thus support identification of the same types of PGNs and cover all of them. Unfortunately, none of the three serves as a predominant standard and for this reason it is worth exploring, how these three resources comply with each other. We systematically compare different PGN taggers against publicly available corpora and analyze the impact of the included lexical resource in their performance. In particular, we determine the performance gains through false positive filtering, which contributes to the disambiguation of identified PGNs. RESULTS: In general, machine learning approaches (ML-Tag) for PGN tagging show higher F1-measure performance against the BioCreative-II and Jnlpba GSCs (exact matching), whereas the lexicon based approaches (LexTag) in combination with disambiguation methods show better results on FsuPrge and PennBio. The ML-Tag solutions balance precision and recall, whereas the LexTag solutions have different precision and recall profiles at the same F1-measure across all corpora. Higher recall is achieved with larger lexical resources, which also introduce more noise (false positive results). The ML-Tag solutions certainly perform best, if the test corpus is from the same GSC as the training corpus. As expected, the false negative errors characterize the test corpora and - on the other hand - the profiles of the false positive mistakes characterize the tagging solutions. Lex-Tag solutions that are based on a large terminological resource in combination with false positive filtering produce better results, which, in addition, provide concept identifiers from a knowledge source in contrast to ML-Tag solutions. CONCLUSION: The standard ML-Tag solutions achieve high performance, but not across all corpora, and thus should be trained using several different corpora to reduce possible biases. The LexTag solutions have different profiles for their precision and recall performance, but with similar F1-measure. This result is surprising and suggests that they cover a portion of the most common naming standards, but cope differently with the term variability across the corpora. The false positive filtering applied to LexTag solutions does improve the results by increasing their precision without compromising significantly their recall. The harmonisation of the annotation schemes in combination with standardized lexical resources in the tagging solutions will enable their comparability and will pave the way for a shared standard.

SELECTION OF CITATIONS
SEARCH DETAIL
...