Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 10 de 10
Filter
1.
JMIR Med Inform ; 10(8): e38122, 2022 Aug 24.
Article in English | MEDLINE | ID: mdl-36001371

ABSTRACT

BACKGROUND: As more health care organizations transition to using electronic health record (EHR) systems, it is important for these organizations to maximize the secondary use of their data to support service improvement and clinical research. These organizations will find it challenging to have systems capable of harnessing the unstructured data fields in the record (clinical notes, letters, etc) and more practically have such systems interact with all of the hospital data systems (legacy and current). OBJECTIVE: We describe the deployment of the EHR interfacing information extraction and retrieval platform CogStack at University College London Hospitals (UCLH). METHODS: At UCLH, we have deployed the CogStack platform, an information retrieval platform with natural language processing capabilities. The platform addresses the problem of data ingestion and harmonization from multiple data sources using the Apache NiFi module for managing complex data flows. The platform also facilitates the extraction of structured data from free-text records through use of the MedCAT natural language processing library. Finally, data science tools are made available to support data scientists and the development of downstream applications dependent upon data ingested and analyzed by CogStack. RESULTS: The platform has been deployed at the hospital, and in particular, it has facilitated a number of research and service evaluation projects. To date, we have processed over 30 million records, and the insights produced from CogStack have informed a number of clinical research use cases at the hospital. CONCLUSIONS: The CogStack platform can be configured to handle the data ingestion and harmonization challenges faced by a hospital. More importantly, the platform enables the hospital to unlock important clinical information from the unstructured portion of the record using natural language processing technology.

2.
Artif Intell Med ; 117: 102083, 2021 07.
Article in English | MEDLINE | ID: mdl-34127232

ABSTRACT

Electronic health records (EHR) contain large volumes of unstructured text, requiring the application of information extraction (IE) technologies to enable clinical analysis. We present the open source Medical Concept Annotation Toolkit (MedCAT) that provides: (a) a novel self-supervised machine learning algorithm for extracting concepts using any concept vocabulary including UMLS/SNOMED-CT; (b) a feature-rich annotation interface for customizing and training IE models; and (c) integrations to the broader CogStack ecosystem for vendor-agnostic health system deployment. We show improved performance in extracting UMLS concepts from open datasets (F1:0.448-0.738 vs 0.429-0.650). Further real-world validation demonstrates SNOMED-CT extraction at 3 large London hospitals with self-supervised training over ∼8.8B words from ∼17M clinical records and further fine-tuning with ∼6K clinician annotated examples. We show strong transferability (F1 > 0.94) between hospitals, datasets and concept types indicating cross-domain EHR-agnostic utility for accelerated clinical and research use cases.


Subject(s)
Natural Language Processing , Systematized Nomenclature of Medicine , Electronic Health Records , Information Storage and Retrieval , Unified Medical Language System
3.
BMC Med ; 19(1): 23, 2021 01 21.
Article in English | MEDLINE | ID: mdl-33472631

ABSTRACT

BACKGROUND: The National Early Warning Score (NEWS2) is currently recommended in the UK for the risk stratification of COVID-19 patients, but little is known about its ability to detect severe cases. We aimed to evaluate NEWS2 for the prediction of severe COVID-19 outcome and identify and validate a set of blood and physiological parameters routinely collected at hospital admission to improve upon the use of NEWS2 alone for medium-term risk stratification. METHODS: Training cohorts comprised 1276 patients admitted to King's College Hospital National Health Service (NHS) Foundation Trust with COVID-19 disease from 1 March to 30 April 2020. External validation cohorts included 6237 patients from five UK NHS Trusts (Guy's and St Thomas' Hospitals, University Hospitals Southampton, University Hospitals Bristol and Weston NHS Foundation Trust, University College London Hospitals, University Hospitals Birmingham), one hospital in Norway (Oslo University Hospital), and two hospitals in Wuhan, China (Wuhan Sixth Hospital and Taikang Tongji Hospital). The outcome was severe COVID-19 disease (transfer to intensive care unit (ICU) or death) at 14 days after hospital admission. Age, physiological measures, blood biomarkers, sex, ethnicity, and comorbidities (hypertension, diabetes, cardiovascular, respiratory and kidney diseases) measured at hospital admission were considered in the models. RESULTS: A baseline model of 'NEWS2 + age' had poor-to-moderate discrimination for severe COVID-19 infection at 14 days (area under receiver operating characteristic curve (AUC) in training cohort = 0.700, 95% confidence interval (CI) 0.680, 0.722; Brier score = 0.192, 95% CI 0.186, 0.197). A supplemented model adding eight routinely collected blood and physiological parameters (supplemental oxygen flow rate, urea, age, oxygen saturation, C-reactive protein, estimated glomerular filtration rate, neutrophil count, neutrophil/lymphocyte ratio) improved discrimination (AUC = 0.735; 95% CI 0.715, 0.757), and these improvements were replicated across seven UK and non-UK sites. However, there was evidence of miscalibration with the model tending to underestimate risks in most sites. CONCLUSIONS: NEWS2 score had poor-to-moderate discrimination for medium-term COVID-19 outcome which raises questions about its use as a screening tool at hospital admission. Risk stratification was improved by including readily available blood and physiological parameters measured at hospital admission, but there was evidence of miscalibration in external sites. This highlights the need for a better understanding of the use of early warning scores for COVID.


Subject(s)
COVID-19/diagnosis , Early Warning Score , Aged , COVID-19/epidemiology , COVID-19/virology , Cohort Studies , Electronic Health Records , Female , Humans , Male , Middle Aged , Pandemics , Prognosis , SARS-CoV-2/isolation & purification , State Medicine , United Kingdom/epidemiology
4.
J Vis Exp ; (159)2020 05 15.
Article in English | MEDLINE | ID: mdl-32478737

ABSTRACT

Recent studies have shown that an automated, lifespan-inclusive, transdiagnostic, and clinically based, individualized risk calculator provides a powerful system for supporting the early detection of individuals at-risk of psychosis at a large scale, by leveraging electronic health records (EHRs). This risk calculator has been externally validated twice and is undergoing feasibility testing for clinical implementation. Integration of this risk calculator in clinical routine should be facilitated by prospective feasibility studies, which are required to address pragmatic challenges, such as missing data, and the usability of this risk calculator in a real-world and routine clinical setting. Here, we present an approach for a prospective implementation of a real-time psychosis risk detection and alerting service in a real-world EHR system. This method leverages the CogStack platform, which is an open-source, lightweight, and distributed information retrieval and text extraction system. The CogStack platform incorporates a set of services that allow for full-text search of clinical data, lifespan-inclusive, real-time calculation of psychosis risk, early risk-alerting to clinicians, and the visual monitoring of patients over time. Our method includes: 1) ingestion and synchronization of data from multiple sources into the CogStack platform, 2) implementation of a risk calculator, whose algorithm was previously developed and validated, for timely computation of a patient's risk of psychosis, 3) creation of interactive visualizations and dashboards to monitor patients' health status over time, and 4) building automated alerting systems to ensure that clinicians are notified of patients at-risk, so that appropriate actions can be pursued. This is the first ever study that has developed and implemented a similar detection and alerting system in clinical routine for early detection of psychosis.


Subject(s)
Electronic Health Records/standards , Information Storage and Retrieval/standards , Psychotic Disorders/diagnosis , Algorithms , Humans , Prospective Studies , Risk Assessment
5.
Eur J Heart Fail ; 22(6): 967-974, 2020 06.
Article in English | MEDLINE | ID: mdl-32485082

ABSTRACT

AIMS: The SARS-CoV-2 virus binds to the angiotensin-converting enzyme 2 (ACE2) receptor for cell entry. It has been suggested that angiotensin-converting enzyme inhibitors (ACEi) and angiotensin II receptor blockers (ARB), which are commonly used in patients with hypertension or diabetes and may raise tissue ACE2 levels, could increase the risk of severe COVID-19 infection. METHODS AND RESULTS: We evaluated this hypothesis in a consecutive cohort of 1200 acute inpatients with COVID-19 at two hospitals with a multi-ethnic catchment population in London (UK). The mean age was 68 ± 17 years (57% male) and 74% of patients had at least one comorbidity. Overall, 415 patients (34.6%) reached the primary endpoint of death or transfer to a critical care unit for organ support within 21 days of symptom onset. A total of 399 patients (33.3%) were taking ACEi or ARB. Patients on ACEi/ARB were significantly older and had more comorbidities. The odds ratio for the primary endpoint in patients on ACEi and ARB, after adjustment for age, sex and co-morbidities, was 0.63 (95% confidence interval 0.47-0.84, P < 0.01). CONCLUSIONS: There was no evidence for increased severity of COVID-19 in hospitalised patients on chronic treatment with ACEi or ARB. A trend towards a beneficial effect of ACEi/ARB requires further evaluation in larger meta-analyses and randomised clinical trials.


Subject(s)
Angiotensin Receptor Antagonists/therapeutic use , Betacoronavirus , Coronavirus Infections/epidemiology , Heart Failure/drug therapy , Pneumonia, Viral/epidemiology , Aged , Angiotensin-Converting Enzyme Inhibitors/therapeutic use , COVID-19 , Comorbidity , Coronavirus Infections/drug therapy , Disease Progression , Female , Follow-Up Studies , Heart Failure/epidemiology , Humans , Male , Pandemics , Pneumonia, Viral/drug therapy , SARS-CoV-2 , Severity of Illness Index , Treatment Outcome , United Kingdom/epidemiology
6.
IEEE J Biomed Health Inform ; 24(10): 2950-2959, 2020 10.
Article in English | MEDLINE | ID: mdl-32149659

ABSTRACT

Clinical trials often fail to recruit an adequate number of appropriate patients. Identifying eligible trial participants is resource-intensive when relying on manual review of clinical notes, particularly in critical care settings where the time window is short. Automated review of electronic health records (EHR) may help, but much of the information is in free text rather than a computable form. We applied natural language processing (NLP) to free text EHR data using the CogStack platform to simulate recruitment into the LeoPARDS study, a clinical trial aiming to reduce organ dysfunction in septic shock. We applied an algorithm to identify eligible patients using a moving 1-hour time window, and compared patients identified by our approach with those actually screened and recruited for the trial, for the time period that data were available. We manually reviewed records of a random sample of patients identified by the algorithm but not screened in the original trial. Our method identified 376 patients, including 34 patients with EHR data available who were actually recruited to LeoPARDS in our centre. The sensitivity of CogStack for identifying patients screened was 90% (95% CI 85%, 93%). Of the 203 patients identified by both manual screening and CogStack, the index date matched in 95 (47%) and CogStack was earlier in 94 (47%). In conclusion, analysis of EHR data using NLP could effectively replicate recruitment in a critical care trial, and identify some eligible patients at an earlier stage, potentially improving trial recruitment if implemented in real time.


Subject(s)
Clinical Trials as Topic , Data Mining/methods , Electronic Health Records , Natural Language Processing , Patient Selection , Adult , Computer Simulation , Critical Care , Female , Humans , Male
7.
Bioinformatics ; 34(16): 2748-2756, 2018 08 15.
Article in English | MEDLINE | ID: mdl-29617939

ABSTRACT

Motivation: The affordability of DNA sequencing has led to the generation of unprecedented volumes of raw sequencing data. These data must be stored, processed and transmitted, which poses significant challenges. To facilitate this effort, we introduce FaStore, a specialized compressor for FASTQ files. FaStore does not use any reference sequences for compression and permits the user to choose from several lossy modes to improve the overall compression ratio, depending on the specific needs. Results: FaStore in the lossless mode achieves a significant improvement in compression ratio with respect to previously proposed algorithms. We perform an analysis on the effect that the different lossy modes have on variant calling, the most widely used application for clinical decision making, especially important in the era of precision medicine. We show that lossy compression can offer significant compression gains, while preserving the essential genomic information and without affecting the variant calling performance. Availability and implementation: FaStore can be downloaded from https://github.com/refresh-bio/FaStore. Supplementary information: Supplementary data are available at Bioinformatics online.


Subject(s)
Data Compression/methods , Genomics/methods , High-Throughput Nucleotide Sequencing/methods , Sequence Analysis, DNA/methods , Software , Algorithms , Humans
8.
Nucleic Acids Res ; 44(12): e114, 2016 07 08.
Article in English | MEDLINE | ID: mdl-27131376

ABSTRACT

The recent super-exponential growth in the amount of sequencing data generated worldwide has put techniques for compressed storage into the focus. Most available solutions, however, are strictly tied to specific bioinformatics formats, sometimes inheriting from them suboptimal design choices; this hinders flexible and effective data sharing. Here, we present CARGO (Compressed ARchiving for GenOmics), a high-level framework to automatically generate software systems optimized for the compressed storage of arbitrary types of large genomic data collections. Straightforward applications of our approach to FASTQ and SAM archives require a few lines of code, produce solutions that match and sometimes outperform specialized format-tailored compressors and scale well to multi-TB datasets. All CARGO software components can be freely downloaded for academic and non-commercial use from http://bio-cargo.sourceforge.net.


Subject(s)
Computational Biology/methods , Genome , Information Storage and Retrieval/methods , Algorithms , Data Compression/methods , Genomics , Software
9.
Bioinformatics ; 31(9): 1389-95, 2015 May 01.
Article in English | MEDLINE | ID: mdl-25536966

ABSTRACT

MOTIVATION: High-coverage sequencing data have significant, yet hard to exploit, redundancy. Most FASTQ compressors cannot efficiently compress the DNA stream of large datasets, since the redundancy between overlapping reads cannot be easily captured in the (relatively small) main memory. More interesting solutions for this problem are disk based, where the better of these two, from Cox et al. (2012), is based on the Burrows-Wheeler transform (BWT) and achieves 0.518 bits per base for a 134.0 Gbp human genome sequencing collection with almost 45-fold coverage. RESULTS: We propose overlapping reads compression with minimizers, a compression algorithm dedicated to sequencing reads (DNA only). Our method makes use of a conceptually simple and easily parallelizable idea of minimizers, to obtain 0.317 bits per base as the compression ratio, allowing to fit the 134.0 Gbp dataset into only 5.31 GB of space. AVAILABILITY AND IMPLEMENTATION: http://sun.aei.polsl.pl/orcom under a free license. CONTACT: sebastian.deorowicz@polsl.pl SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Data Compression , Genomics/methods , Sequence Analysis, DNA/methods , Algorithms , Animals , Chickens/genetics , Genome, Human , Humans
10.
Bioinformatics ; 30(15): 2213-5, 2014 Aug 01.
Article in English | MEDLINE | ID: mdl-24747219

ABSTRACT

SUMMARY: Modern sequencing platforms produce huge amounts of data. Archiving them raises major problems but is crucial for reproducibility of results, one of the most fundamental principles of science. The widely used gzip compressor, used for reduction of storage and transfer costs, is not a perfect solution, so a few specialized FASTQ compressors were proposed recently. Unfortunately, they are often impractical because of slow processing, lack of support for some variants of FASTQ files or instability. We propose DSRC 2 that offers compression ratios comparable with the best existing solutions, while being a few times faster and more flexible. AVAILABILITY AND IMPLEMENTATION: DSRC 2 is freely available at http://sun.aei.polsl.pl/dsrc. The package contains command-line compressor, C and Python libraries for easy integration with existing software and technical documentation with examples of usage. CONTACT: sebastian.deorowicz@polsl.pl SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Data Compression/methods , Genomics , Industry , Sequence Analysis, DNA , Algorithms , Documentation , Software , Time Factors
SELECTION OF CITATIONS
SEARCH DETAIL
...