Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 50
Filter
1.
J Biomed Inform ; 153: 104640, 2024 May.
Article in English | MEDLINE | ID: mdl-38608915

ABSTRACT

Evidence-based medicine promises to improve the quality of healthcare by empowering medical decisions and practices with the best available evidence. The rapid growth of medical evidence, which can be obtained from various sources, poses a challenge in collecting, appraising, and synthesizing the evidential information. Recent advancements in generative AI, exemplified by large language models, hold promise in facilitating the arduous task. However, developing accountable, fair, and inclusive models remains a complicated undertaking. In this perspective, we discuss the trustworthiness of generative AI in the context of automated summarization of medical evidence.


Subject(s)
Artificial Intelligence , Evidence-Based Medicine , Humans , Trust , Natural Language Processing
2.
J Am Med Inform Assoc ; 31(4): 1009-1024, 2024 Apr 03.
Article in English | MEDLINE | ID: mdl-38366879

ABSTRACT

OBJECTIVES: Question answering (QA) systems have the potential to improve the quality of clinical care by providing health professionals with the latest and most relevant evidence. However, QA systems have not been widely adopted. This systematic review aims to characterize current medical QA systems, assess their suitability for healthcare, and identify areas of improvement. MATERIALS AND METHODS: We searched PubMed, IEEE Xplore, ACM Digital Library, ACL Anthology, and forward and backward citations on February 7, 2023. We included peer-reviewed journal and conference papers describing the design and evaluation of biomedical QA systems. Two reviewers screened titles, abstracts, and full-text articles. We conducted a narrative synthesis and risk of bias assessment for each study. We assessed the utility of biomedical QA systems. RESULTS: We included 79 studies and identified themes, including question realism, answer reliability, answer utility, clinical specialism, systems, usability, and evaluation methods. Clinicians' questions used to train and evaluate QA systems were restricted to certain sources, types and complexity levels. No system communicated confidence levels in the answers or sources. Many studies suffered from high risks of bias and applicability concerns. Only 8 studies completely satisfied any criterion for clinical utility, and only 7 reported user evaluations. Most systems were built with limited input from clinicians. DISCUSSION: While machine learning methods have led to increased accuracy, most studies imperfectly reflected real-world healthcare information needs. Key research priorities include developing more realistic healthcare QA datasets and considering the reliability of answer sources, rather than merely focusing on accuracy.


Subject(s)
Health Personnel , Point-of-Care Systems , Humans , Reproducibility of Results , PubMed , Machine Learning
3.
Proc Conf Assoc Comput Linguist Meet ; 2023: 15566-15589, 2023 Jul.
Article in English | MEDLINE | ID: mdl-37674787

ABSTRACT

Relation extraction (RE) is the core NLP task of inferring semantic relationships between entities from text. Standard supervised RE techniques entail training modules to tag tokens comprising entity spans and then predict the relationship between them. Recent work has instead treated the problem as a sequence-to-sequence task, linearizing relations between entities as target strings to be generated conditioned on the input. Here we push the limits of this approach, using larger language models (GPT-3 and Flan-T5 large) than considered in prior work and evaluating their performance on standard RE tasks under varying levels of supervision. We address issues inherent to evaluating generative approaches to RE by doing human evaluations, in lieu of relying on exact matching. Under this refined evaluation, we find that: (1) Few-shot prompting with GPT-3 achieves near SOTA performance, i.e., roughly equivalent to existing fully supervised models; (2) Flan-T5 is not as capable in the few-shot setting, but supervising and fine-tuning it with Chain-of-Thought (CoT) style explanations (generated via GPT-3) yields SOTA results. We release this model as a new baseline for RE tasks.

4.
Proc Conf Assoc Comput Linguist Meet ; 2023: 236-247, 2023 May.
Article in English | MEDLINE | ID: mdl-37483390

ABSTRACT

We present TrialsSummarizer, a system that aims to automatically summarize evidence presented in the set of randomized controlled trials most relevant to a given query. Building on prior work (Marshall et al., 2020), the system retrieves trial publications matching a query specifying a combination of condition, intervention(s), and outcome(s), and ranks these according to sample size and estimated study quality. The top-k such studies are passed through a neural multi-document summarization system, yielding a synopsis of these trials. We consider two architectures: A standard sequence-to-sequence model based on BART (Lewis et al., 2019), and a multi-headed architecture intended to provide greater transparency to end-users. Both models produce fluent and relevant summaries of evidence retrieved for queries, but their tendency to introduce unsupported statements render them inappropriate for use in this domain at present. The proposed architecture may help users verify outputs allowing users to trace generated tokens back to inputs. The demonstration video is available at: https://vimeo.com/735605060 The prototype, source code, and model weights are available at: https://sanjanaramprasad.github.io/trials-summarizer/.

5.
J Clin Epidemiol ; 153: 26-33, 2023 01.
Article in English | MEDLINE | ID: mdl-36150548

ABSTRACT

OBJECTIVES: The aim of this study is to describe and pilot a novel method for continuously identifying newly published trials relevant to a systematic review, enabled by combining artificial intelligence (AI) with human expertise. STUDY DESIGN AND SETTING: We used RobotReviewer LIVE to keep a review of COVID-19 vaccination trials updated from February to August 2021. We compared the papers identified by the system with those found by the conventional manual process by the review team. RESULTS: The manual update searches (last search date July 2021) retrieved 135 abstracts, of which 31 were included after screening (23% precision, 100% recall). By the same date, the automated system retrieved 56 abstracts, of which 31 were included after manual screening (55% precision, 100% recall). Key limitations of the system include that it is limited to searches of PubMed/MEDLINE, and considers only randomized controlled trial reports. We aim to address these limitations in future. The system is available as open-source software for further piloting and evaluation. CONCLUSION: Our system identified all relevant studies, reduced manual screening work, and enabled rolling updates on publication of new primary research.


Subject(s)
Artificial Intelligence , COVID-19 , Humans , Pilot Projects , COVID-19 Vaccines , COVID-19/epidemiology , COVID-19/prevention & control , PubMed
6.
Proc Conf Assoc Comput Linguist Meet ; 2022: 7331-7345, 2022 May.
Article in English | MEDLINE | ID: mdl-36404800

ABSTRACT

Automated simplification models aim to make input texts more readable. Such methods have the potential to make complex information accessible to a wider audience, e.g., providing access to recent medical literature which might otherwise be impenetrable for a lay reader. However, such models risk introducing errors into automatically simplified texts, for instance by inserting statements unsupported by the corresponding original text, or by omitting key information. Providing more readable but inaccurate versions of texts may in many cases be worse than providing no such access at all. The problem of factual accuracy (and the lack thereof) has received heightened attention in the context of summarization models, but the factuality of automatically simplified texts has not been investigated. We introduce a taxonomy of errors that we use to analyze both references drawn from standard simplification datasets and state-of-the-art model outputs. We find that errors often appear in both that are not captured by existing evaluation metrics, motivating a need for research into ensuring the factual accuracy of automated simplification models.

7.
Methods Mol Biol ; 2345: 17-40, 2022.
Article in English | MEDLINE | ID: mdl-34550582

ABSTRACT

Traditionally, literature identification for systematic reviews has relied on a two-step process: first, searching databases to identify potentially relevant citations, and then manually screening those citations. A number of tools have been developed to streamline and semi-automate this process, including tools to generate terms; to visualize and evaluate search queries; to trace citation linkages; to deduplicate, limit, or translate searches across databases; and to prioritize relevant abstracts for screening. Research is ongoing into tools that can unify searching and screening into a single step, and several protype tools have been developed. As this field grows, it is becoming increasingly important to develop and codify methods for evaluating the extent to which these tools fulfill their purpose.


Subject(s)
Databases, Factual , Automation , Mass Screening , Publications , Systematic Reviews as Topic
8.
Proc Conf Empir Methods Nat Lang Process ; 2022: 3626-3648, 2022 Dec.
Article in English | MEDLINE | ID: mdl-37103483

ABSTRACT

Pretraining multimodal models on Electronic Health Records (EHRs) provides a means of learning representations that can transfer to downstream tasks with minimal supervision. Recent multimodal models induce soft local alignments between image regions and sentences. This is of particular interest in the medical domain, where alignments might highlight regions in an image relevant to specific phenomena described in free-text. While past work has suggested that attention "heatmaps" can be interpreted in this manner, there has been little evaluation of such alignments. We compare alignments from a state-of-the-art multimodal (image and text) model for EHR with human annotations that link image regions to sentences. Our main finding is that the text has an often weak or unintuitive influence on attention; alignments do not consistently reflect basic anatomical information. Moreover, synthetic modifications - such as substituting "left" for "right" - do not substantially influence highlights. Simple techniques such as allowing the model to opt out of attending to the image and few-shot finetuning show promise in terms of their ability to improve alignments with very little or no supervision. We make our code and checkpoints open-source.

9.
Proc Conf Assoc Comput Linguist Meet ; 2022: 341-350, 2022 Nov.
Article in English | MEDLINE | ID: mdl-37484061

ABSTRACT

We provide a quantitative and qualitative analysis of self-repetition in the output of neural summarizers. We measure self-repetition as the number of n-grams of length four or longer that appear in multiple outputs of the same system. We analyze the behavior of three popular architectures (BART, T5 and Pegasus), fine-tuned on five datasets. In a regression analysis, we find that the three architectures have different propensities for repeating content across output summaries for inputs, with BART being particularly prone to self-repetition. Fine-tuning on more abstractive data, and on data featuring formulaic language, is associated with a higher rate of self-repetition. In qualitative analysis we find systems produce artefacts such as ads and disclaimers unrelated to the content being summarized, as well as formulaic phrases common in the fine-tuning domain. Our approach to corpus level analysis of self-repetition may help practitioners clean up training data for summarizers and ultimately support methods for minimizing the amount of self-repetition.

10.
AMIA Jt Summits Transl Sci Proc ; 2021: 485-494, 2021.
Article in English | MEDLINE | ID: mdl-34457164

ABSTRACT

The best evidence concerning comparative treatment effectiveness comes from clinical trials, the results of which are reported in unstructured articles. Medical experts must manually extract information from articles to inform decision-making, which is time-consuming and expensive. Here we consider the end-to-end task of both (a) extracting treatments and outcomes from full-text articles describing clinical trials (entity identification) and, (b) inferring the reported results for the former with respect to the latter (relation extraction). We introduce new data for this task, and evaluate models that have recently achieved state-of-the-art results on similar tasks in Natural Language Processing. We then propose a new method motivated by how trial results are typically presented that outperforms these purely data-driven baselines. Finally, we run a fielded evaluation of the model with a non-profit seeking to identify existing drugs that might be re-purposed for cancer, showing the potential utility of end-to-end evidence extraction systems.


Subject(s)
Natural Language Processing , Humans
11.
AMIA Jt Summits Transl Sci Proc ; 2021: 605-614, 2021.
Article in English | MEDLINE | ID: mdl-34457176

ABSTRACT

We consider the problem of automatically generating a narrative biomedical evidence summary from multiple trial reports. We evaluate modern neural models for abstractive summarization of relevant article abstracts from systematic reviews previously conducted by members of the Cochrane collaboration, using the authors conclusions section of the review abstract as our target. We enlist medical professionals to evaluate generated summaries, and we find that summarization systems yield consistently fluent and relevant synopses, but these often contain factual inaccuracies. We propose new approaches that capitalize on domain-specific models to inform summarization, e.g., by explicitly demarcating snippets of inputs that convey key findings, and emphasizing the reports of large and high-quality trials. We find that these strategies modestly improve the factual accuracy of generated summaries. Finally, we propose a new method for automatically evaluating the factuality of generated narrative evidence syntheses using models that infer the directionality of reported findings.

12.
BMJ Glob Health ; 6(1)2021 01.
Article in English | MEDLINE | ID: mdl-33402333

ABSTRACT

INTRODUCTION: Ideally, health conditions causing the greatest global disease burden should attract increased research attention. We conducted a comprehensive global study investigating the number of randomised controlled trials (RCTs) published on different health conditions, and how this compares with the global disease burden that they impose. METHODS: We use machine learning to monitor PubMed daily, and find and analyse RCT reports. We assessed RCTs investigating the leading causes of morbidity and mortality from the Global Burden of Disease study. Using regression models, we compared numbers of actual RCTs in different health conditions to numbers predicted from their global disease burden (disability-adjusted life years (DALYs)). We investigated whether RCT numbers differed for conditions disproportionately affecting countries with lower socioeconomic development. RESULTS: We estimate 463 000 articles describing RCTs (95% prediction interval 439 000 to 485 000) were published from 1990 to July 2020. RCTs recruited a median of 72 participants (IQR 32-195). 82% of RCTs were conducted by researchers in the top fifth of countries by socio-economic development. As DALYs increased for a particular health condition by 10%, the number of RCTs in the same year increased by 5% (3.2%-6.9%), but the association was weak (adjusted R2=0.13). Conditions disproportionately affecting countries with lower socioeconomic development, including respiratory infections and tuberculosis (7000 RCTs below predicted) and enteric infections (9700 RCTs below predicted), appear relatively under-researched for their disease burden. Each 10% shift in DALYs towards countries with low and middle socioeconomic development was associated with a 4% reduction in RCTs (3.7%-4.9%). These disparities have not changed substantially over time. CONCLUSION: Research priorities are not well optimised to reduce the global burden of disease. Most RCTs are produced by highly developed countries, and the health needs of these countries have been, on average, favoured.


Subject(s)
Disabled Persons , Respiratory Tract Infections , Global Burden of Disease , Global Health , Humans , Quality-Adjusted Life Years , Randomized Controlled Trials as Topic
13.
Article in English | MEDLINE | ID: mdl-35663506

ABSTRACT

Medical question answering (QA) systems have the potential to answer clinicians' uncertainties about treatment and diagnosis on-demand, informed by the latest evidence. However, despite the significant progress in general QA made by the NLP community, medical QA systems are still not widely used in clinical environments. One likely reason for this is that clinicians may not readily trust QA system outputs, in part because transparency, trustworthiness, and provenance have not been key considerations in the design of such models. In this paper we discuss a set of criteria that, if met, we argue would likely increase the utility of biomedical QA systems, which may in turn lead to adoption of such systems in practice. We assess existing models, tasks, and datasets with respect to these criteria, highlighting shortcomings of previously proposed approaches and pointing toward what might be more usable QA systems.

14.
Proc Conf ; 2021: 4972-4984, 2021 Jun.
Article in English | MEDLINE | ID: mdl-35663507

ABSTRACT

We consider the problem of learning to simplify medical texts. This is important because most reliable, up-to-date information in biomedicine is dense with jargon and thus practically inaccessible to the lay audience. Furthermore, manual simplification does not scale to the rapidly growing body of biomedical literature, motivating the need for automated approaches. Unfortunately, there are no large-scale resources available for this task. In this work we introduce a new corpus of parallel texts in English comprising technical and lay summaries of all published evidence pertaining to different clinical topics. We then propose a new metric based on likelihood scores from a masked language model pretrained on scientific texts. We show that this automated measure better differentiates between technical and lay summaries than existing heuristics. We introduce and evaluate baseline encoder-decoder Transformer models for simplification and propose a novel augmentation to these in which we explicitly penalize the decoder for producing 'jargon' terms; we find that this yields improvements over baselines in terms of readability.

15.
J Am Med Inform Assoc ; 27(12): 1903-1912, 2020 12 09.
Article in English | MEDLINE | ID: mdl-32940710

ABSTRACT

OBJECTIVE: Randomized controlled trials (RCTs) are the gold standard method for evaluating whether a treatment works in health care but can be difficult to find and make use of. We describe the development and evaluation of a system to automatically find and categorize all new RCT reports. MATERIALS AND METHODS: Trialstreamer continuously monitors PubMed and the World Health Organization International Clinical Trials Registry Platform, looking for new RCTs in humans using a validated classifier. We combine machine learning and rule-based methods to extract information from the RCT abstracts, including free-text descriptions of trial PICO (populations, interventions/comparators, and outcomes) elements and map these snippets to normalized MeSH (Medical Subject Headings) vocabulary terms. We additionally identify sample sizes, predict the risk of bias, and extract text conveying key findings. We store all extracted data in a database, which we make freely available for download, and via a search portal, which allows users to enter structured clinical queries. Results are ranked automatically to prioritize larger and higher-quality studies. RESULTS: As of early June 2020, we have indexed 673 191 publications of RCTs, of which 22 363 were published in the first 5 months of 2020 (142 per day). We additionally include 304 111 trial registrations from the International Clinical Trials Registry Platform. The median trial sample size was 66. CONCLUSIONS: We present an automated system for finding and categorizing RCTs. This yields a novel resource: a database of structured information automatically extracted for all published RCTs in humans. We make daily updates of this database available on our website (https://trialstreamer.robotreviewer.net).


Subject(s)
Data Curation , Data Management , Databases, Factual , Randomized Controlled Trials as Topic , Bias , Evidence-Based Medicine , Humans , Medical Subject Headings
16.
Health Psychol Rev ; 14(1): 145-158, 2020 03.
Article in English | MEDLINE | ID: mdl-31941434

ABSTRACT

The evidence base in health psychology is vast and growing rapidly. These factors make it difficult (and sometimes practically impossible) to consider all available evidence when making decisions about the state of knowledge on a given phenomenon (e.g., associations of variables, effects of interventions on particular outcomes). Systematic reviews, meta-analyses, and other rigorous syntheses of the research mitigate this problem by providing concise, actionable summaries of knowledge in a given area of study. Yet, conducting these syntheses has grown increasingly laborious owing to the fast accumulation of new evidence; existing, manual methods for synthesis do not scale well. In this article, we discuss how semi-automation via machine learning and natural language processing methods may help researchers and practitioners to review evidence more efficiently. We outline concrete examples in health psychology, highlighting practical, open-source technologies available now. We indicate the potential of more advanced methods and discuss how to avoid the pitfalls of automated reviews.


Subject(s)
Behavioral Medicine , Machine Learning , Natural Language Processing , Systematic Reviews as Topic , Humans
17.
Proc Conf ; 2020: 63-69, 2020 Jul.
Article in English | MEDLINE | ID: mdl-34136886

ABSTRACT

We introduce Trialstreamer, a living database of clinical trial reports. Here we mainly describe the evidence extraction component; this extracts from biomedical abstracts key pieces of information that clinicians need when appraising the literature, and also the relations between these. Specifically, the system extracts descriptions of trial participants, the treatments compared in each arm (the interventions), and which outcomes were measured. The system then attempts to infer which interventions were reported to work best by determining their relationship with identified trial outcome measures. In addition to summarizing individual trials, these extracted data elements allow automatic synthesis of results across many trials on the same topic. We apply the system at scale to all reports of randomized controlled trials indexed in MEDLINE, powering the automatic generation of evidence maps, which provide a global view of the efficacy of different interventions combining data from all relevant clinical trials on a topic. We make all code and models freely available alongside a demonstration of the web interface.

18.
Article in English | MEDLINE | ID: mdl-34308444

ABSTRACT

Systematic review (SR) is an essential process to identify, evaluate, and summarize the findings of all relevant individual studies concerning health-related questions. However, conducting a SR is labor-intensive, as identifying relevant studies is a daunting process that entails multiple researchers screening thousands of articles for relevance. In this paper, we propose MMiDaS-AE, a Multi-modal Missing Data aware Stacked Autoencoder, for semi-automating screening for SRs. We use a multi-modal view that exploits three representations, of: 1) documents, 2) topics, and 3) citation networks. Documents that contain similar words will be nearby in the document embedding space. Models can also exploit the relationship between documents and the associated SR MeSH terms to capture article relevancy. Finally, related works will likely share the same citations, and thus closely related articles would, intuitively, be trained to be close to each other in the embedding space. However, using all three learned representations as features directly result in an unwieldy number of parameters. Thus, motivated by recent work on multi-modal auto-encoders, we adopt a multi-modal stacked autoencoder that can learn a shared representation encoding all three representations in a compressed space. However, in practice one or more of these modalities may be missing for an article (e.g., if we cannot recover citation information). Therefore, we propose to learn to impute the shared representation even when specific inputs are missing. We find this new model significantly improves performance on a dataset consisting of 15 SRs compared to existing approaches.

19.
Syst Rev ; 8(1): 163, 2019 07 11.
Article in English | MEDLINE | ID: mdl-31296265

ABSTRACT

Technologies and methods to speed up the production of systematic reviews by reducing the manual labour involved have recently emerged. Automation has been proposed or used to expedite most steps of the systematic review process, including search, screening, and data extraction. However, how these technologies work in practice and when (and when not) to use them is often not clear to practitioners. In this practical guide, we provide an overview of current machine learning methods that have been proposed to expedite evidence synthesis. We also offer guidance on which of these are ready for use, their strengths and weaknesses, and how a systematic review team might go about using them in practice.


Subject(s)
Automation/standards , Evidence-Based Medicine/organization & administration , Machine Learning/statistics & numerical data , Practice Guidelines as Topic , Systematic Reviews as Topic , Humans
20.
J Clin Epidemiol ; 115: 77-89, 2019 11.
Article in English | MEDLINE | ID: mdl-31302205

ABSTRACT

OBJECTIVES: Data Abstraction Assistant (DAA) is a software for linking items abstracted into a data collection form for a systematic review to their locations in a study report. We conducted a randomized cross-over trial that compared DAA-facilitated single-data abstraction plus verification ("DAA verification"), single data abstraction plus verification ("regular verification"), and independent dual data abstraction plus adjudication ("independent abstraction"). STUDY DESIGN AND SETTING: This study is an online randomized cross-over trial with 26 pairs of data abstractors. Each pair abstracted data from six articles, two per approach. Outcomes were the proportion of errors and time taken. RESULTS: Overall proportion of errors was 17% for DAA verification, 16% for regular verification, and 15% for independent abstraction. DAA verification was associated with higher odds of errors when compared with regular verification (adjusted odds ratio [OR] = 1.08; 95% confidence interval [CI]: 0.99-1.17) or independent abstraction (adjusted OR = 1.12; 95% CI: 1.03-1.22). For each article, DAA verification took 20 minutes (95% CI: 1-40) longer than regular verification, but 46 minutes (95% CI: 26 to 66) shorter than independent abstraction. CONCLUSION: Independent abstraction may only be necessary for complex data items. DAA provides an audit trail that is crucial for reproducible research.


Subject(s)
Abstracting and Indexing/methods , Systematic Reviews as Topic , Cross-Over Studies , Data Collection , Humans , Odds Ratio , Random Allocation , Software , Young Adult
SELECTION OF CITATIONS
SEARCH DETAIL
...