Search | VHL Regional Portal

1.

Summary of the National Cancer Institute 2023 Virtual Workshop on Medical Image De-identification-Part 2: Pathology Whole Slide Image De-identification, De-facing, the Role of AI in Image De-identification, and the NCI MIDI Datasets and Pipeline.

Clunie, David; Taylor, Adam; Bisson, Tom; Gutman, David; Xiao, Ying; Schwarz, Christopher G; Greve, Douglas; Gichoya, Judy; Shih, George; Kline, Adrienne; Kopchick, Ben; Farahani, Keyvan.

J Imaging Inform Med ; 2024 Jul 09.

Article in English | MEDLINE | ID: mdl-38980626

ABSTRACT

De-identification of medical images intended for research is a core requirement for data sharing initiatives, particularly as the demand for data for artificial intelligence (AI) applications grows. The Center for Biomedical Informatics and Information Technology (CBIIT) of the United States National Cancer Institute (NCI) convened a two half-day virtual workshop with the intent of summarizing the state of the art in de-identification technology and processes and exploring interesting aspects of the subject. This paper summarizes the highlights of the second day of the workshop, the recordings and presentations of which are publicly available for review. The topics covered included pathology whole slide image de-identification, de-facing, the role of AI in image de-identification, and the NCI Medical Image De-Identification Initiative (MIDI) datasets and pipeline.

2.

End-to-end pseudonymization of fine-tuned clinical BERT models : Privacy preservation with maintained data utility.

Vakili, Thomas; Henriksson, Aron; Dalianis, Hercules.

BMC Med Inform Decis Mak ; 24(1): 162, 2024 Jun 12.

Article in English | MEDLINE | ID: mdl-38915012

ABSTRACT

Many state-of-the-art results in natural language processing (NLP) rely on large pre-trained language models (PLMs). These models consist of large amounts of parameters that are tuned using vast amounts of training data. These factors cause the models to memorize parts of their training data, making them vulnerable to various privacy attacks. This is cause for concern, especially when these models are applied in the clinical domain, where data are very sensitive. Training data pseudonymization is a privacy-preserving technique that aims to mitigate these problems. This technique automatically identifies and replaces sensitive entities with realistic but non-sensitive surrogates. Pseudonymization has yielded promising results in previous studies. However, no previous study has applied pseudonymization to both the pre-training data of PLMs and the fine-tuning data used to solve clinical NLP tasks. This study evaluates the effects on the predictive performance of end-to-end pseudonymization of Swedish clinical BERT models fine-tuned for five clinical NLP tasks. A large number of statistical tests are performed, revealing minimal harm to performance when using pseudonymized fine-tuning data. The results also find no deterioration from end-to-end pseudonymization of pre-training and fine-tuning data. These results demonstrate that pseudonymizing training data to reduce privacy risks can be done without harming data utility for training PLMs.

Subject(s)

Natural Language Processing , Humans , Privacy , Sweden , Anonyms and Pseudonyms , Computer Security/standards , Confidentiality/standards , Electronic Health Records/standards

3.

PyFaceWipe: a new defacing tool for almost any MRI contrast.

Mitew, Stanislaw; Yeow, Ling Yun; Ho, Chi Long; Bhanu, Prakash K N; Nickalls, Oliver James.

MAGMA ; 2024 Jun 21.

Article in English | MEDLINE | ID: mdl-38904745

ABSTRACT

RATIONALE AND OBJECTIVES: Defacing research MRI brain scans is often a mandatory step. With current defacing software, there are issues with Windows compatibility and researcher doubt regarding the adequacy of preservation of brain voxels in non-T1w scans. To address this, we developed PyFaceWipe, a multiplatform software for multiple MRI contrasts, which was evaluated based on its anonymisation ability and effect on downstream processing. MATERIALS AND METHODS: Multiple MRI brain scan contrasts from the OASIS-3 dataset were defaced with PyFaceWipe and PyDeface and manually assessed for brain voxel preservation, remnant facial features and effect on automated face detection. Original and PyFaceWipe-defaced data from locally acquired T1w structural scans underwent volumetry with FastSurfer and brain atlas generation with ANTS. RESULTS: 214 MRI scans of several contrasts from OASIS-3 were successfully processed with both PyFaceWipe and PyDeface. PyFaceWipe maintained complete brain voxel preservation in all tested contrasts except ASL (45%) and DWI (90%), and PyDeface in all tested contrasts except ASL (95%), BOLD (25%), DWI (40%) and T2* (25%). Manual review of PyFaceWipe showed no failures of facial feature removal. Pinna removal was less successful (6% of T1 scans showed residual complete pinna). PyDeface achieved 5.1% failure rate. Automated detection found no faces in PyFaceWipe-defaced scans, 19 faces in PyDeface scans compared with 78 from the 224 original scans. Brain atlas generation showed no significant difference between atlases created from original and defaced data in both young adulthood and late elderly cohorts. Structural volumetry dice scores were ≥ 0.98 for all structures except for grey matter which had 0.93. PyFaceWipe output was identical across the tested operating systems. CONCLUSION: PyFaceWipe is a promising multiplatform defacing tool, demonstrating excellent brain voxel preservation and competitive defacing in multiple MRI contrasts, performing favourably against PyDeface. ASL, BOLD, DWI and T2* scans did not produce recognisable 3D renders and hence should not require defacing. Structural volumetry dice scores (≥ 0.98) were higher than previously published FreeSurfer results, except for grey matter which were comparable. The effect is measurable and care should be exercised during studies. ANTS atlas creation showed no significant effect from PyFaceWipe defacing.

4.

Fast refacing of MR images with a generative neural network lowers re-identification risk and preserves volumetric consistency.

Molchanova, Nataliia; Maréchal, Bénédicte; Thiran, Jean-Philippe; Kober, Tobias; Huelnhagen, Till; Richiardi, Jonas.

Hum Brain Mapp ; 45(9): e26721, 2024 Jun 15.

Article in English | MEDLINE | ID: mdl-38899549

ABSTRACT

With the rise of open data, identifiability of individuals based on 3D renderings obtained from routine structural magnetic resonance imaging (MRI) scans of the head has become a growing privacy concern. To protect subject privacy, several algorithms have been developed to de-identify imaging data using blurring, defacing or refacing. Completely removing facial structures provides the best re-identification protection but can significantly impact post-processing steps, like brain morphometry. As an alternative, refacing methods that replace individual facial structures with generic templates have a lower effect on the geometry and intensity distribution of original scans, and are able to provide more consistent post-processing results by the price of higher re-identification risk and computational complexity. In the current study, we propose a novel method for anonymized face generation for defaced 3D T1-weighted scans based on a 3D conditional generative adversarial network. To evaluate the performance of the proposed de-identification tool, a comparative study was conducted between several existing defacing and refacing tools, with two different segmentation algorithms (FAST and Morphobox). The aim was to evaluate (i) impact on brain morphometry reproducibility, (ii) re-identification risk, (iii) balance between (i) and (ii), and (iv) the processing time. The proposed method takes 9 s for face generation and is suitable for recovering consistent post-processing results after defacing.

Subject(s)

Magnetic Resonance Imaging , Humans , Magnetic Resonance Imaging/methods , Adult , Brain/diagnostic imaging , Brain/anatomy & histology , Male , Female , Neural Networks, Computer , Imaging, Three-Dimensional/methods , Neuroimaging/methods , Neuroimaging/standards , Data Anonymization , Young Adult , Image Processing, Computer-Assisted/methods , Image Processing, Computer-Assisted/standards , Algorithms

5.

An Extensible Evaluation Framework Applied to Clinical Text Deidentification Natural Language Processing Tools: Multisystem and Multicorpus Study.

Heider, Paul M; Meystre, Stéphane M.

J Med Internet Res ; 26: e55676, 2024 May 28.

Article in English | MEDLINE | ID: mdl-38805692

ABSTRACT

BACKGROUND: Clinical natural language processing (NLP) researchers need access to directly comparable evaluation results for applications such as text deidentification across a range of corpus types and the means to easily test new systems or corpora within the same framework. Current systems, reported metrics, and the personally identifiable information (PII) categories evaluated are not easily comparable. OBJECTIVE: This study presents an open-source and extensible end-to-end framework for comparing clinical NLP system performance across corpora even when the annotation categories do not align. METHODS: As a use case for this framework, we use 6 off-the-shelf text deidentification systems (ie, CliniDeID, deid from PhysioNet, MITRE Identity Scrubber Toolkit [MIST], NeuroNER, National Library of Medicine [NLM] Scrubber, and Philter) across 3 standard clinical text corpora for the task (2 of which are publicly available) and 1 private corpus (all in English), with annotation categories that are not directly analogous. The framework is built on shell scripts that can be extended to include new systems, corpora, and performance metrics. We present this open tool, multiple means for aligning PII categories during evaluation, and our initial timing and performance metric findings. Code for running this framework with all settings needed to run all pairs are available via Codeberg and GitHub. RESULTS: From this case study, we found large differences in processing speed between systems. The fastest system (ie, MIST) processed an average of 24.57 (SD 26.23) notes per second, while the slowest (ie, CliniDeID) processed an average of 1.00 notes per second. No system uniformly outperformed the others at identifying PII across corpora and categories. Instead, a rich tapestry of performance trade-offs emerged for PII categories. CliniDeID and Philter prioritize recall over precision (with an average recall 6.9 and 11.2 points higher, respectively, for partially matching spans of text matching any PII category), while the other 4 systems consistently have higher precision (with MIST's precision scoring 20.2 points higher, NLM Scrubber scoring 4.4 points higher, NeuroNER scoring 7.2 points higher, and deid scoring 17.1 points higher). The macroaverage recall across corpora for identifying names, one of the more sensitive PII categories, included deid (48.8%) and MIST (66.9%) at the low end and NeuroNER (84.1%), NLM Scrubber (88.1%), and CliniDeID (95.9%) at the high end. A variety of metrics across categories and corpora are reported with a wider variety (eg, F2-score) available via the tool. CONCLUSIONS: NLP systems in general and deidentification systems and corpora in our use case tend to be evaluated in stand-alone research articles that only include a limited set of comparators. We hold that a single evaluation pipeline across multiple systems and corpora allows for more nuanced comparisons. Our open pipeline should reduce barriers to evaluation and system advancement.

Subject(s)

Natural Language Processing

6.

Exploring the tradeoff between data privacy and utility with a clinical data analysis use case.

Im, Eunyoung; Kim, Hyeoneui; Lee, Hyungbok; Jiang, Xiaoqian; Kim, Ju Han.

BMC Med Inform Decis Mak ; 24(1): 147, 2024 May 30.

Article in English | MEDLINE | ID: mdl-38816848

ABSTRACT

BACKGROUND: Securing adequate data privacy is critical for the productive utilization of data. De-identification, involving masking or replacing specific values in a dataset, could damage the dataset's utility. However, finding a reasonable balance between data privacy and utility is not straightforward. Nonetheless, few studies investigated how data de-identification efforts affect data analysis results. This study aimed to demonstrate the effect of different de-identification methods on a dataset's utility with a clinical analytic use case and assess the feasibility of finding a workable tradeoff between data privacy and utility. METHODS: Predictive modeling of emergency department length of stay was used as a data analysis use case. A logistic regression model was developed with 1155 patient cases extracted from a clinical data warehouse of an academic medical center located in Seoul, South Korea. Nineteen de-identified datasets were generated based on various de-identification configurations using ARX, an open-source software for anonymizing sensitive personal data. The variable distributions and prediction results were compared between the de-identified datasets and the original dataset. We examined the association between data privacy and utility to determine whether it is feasible to identify a viable tradeoff between the two. RESULTS: All 19 de-identification scenarios significantly decreased re-identification risk. Nevertheless, the de-identification processes resulted in record suppression and complete masking of variables used as predictors, thereby compromising dataset utility. A significant correlation was observed only between the re-identification reduction rates and the ARX utility scores. CONCLUSIONS: As the importance of health data analysis increases, so does the need for effective privacy protection methods. While existing guidelines provide a basis for de-identifying datasets, achieving a balance between high privacy and utility is a complex task that requires understanding the data's intended use and involving input from data users. This approach could help find a suitable compromise between data privacy and utility.

Subject(s)

Confidentiality , Data Anonymization , Humans , Confidentiality/standards , Emergency Service, Hospital , Length of Stay , Republic of Korea , Male

7.

Human-Unrecognizable Differential Private Noised Image Generation Method.

Kim, Hyeong-Geon; Shin, Jinmyeong; Choi, Yoon-Ho.

Sensors (Basel) ; 24(10)2024 May 16.

Article in English | MEDLINE | ID: mdl-38794019

ABSTRACT

Differential privacy has emerged as a practical technique for privacy-preserving deep learning. However, recent studies on privacy attacks have demonstrated vulnerabilities in the existing differential privacy implementations for deep models. While encryption-based methods offer robust security, their computational overheads are often prohibitive. To address these challenges, we propose a novel differential privacy-based image generation method. Our approach employs two distinct noise types: one makes the image unrecognizable to humans, preserving privacy during transmission, while the other maintains features essential for machine learning analysis. This allows the deep learning service to provide accurate results, without compromising data privacy. We demonstrate the feasibility of our method on the CIFAR100 dataset, which offers a realistic complexity for evaluation.

8.

A Method for Efficient De-identification of DICOM Metadata and Burned-in Pixel Text.

Macdonald, Jacob A; Morgan, Katelyn R; Konkel, Brandon; Abdullah, Kulsoom; Martin, Mark; Ennis, Cory; Lo, Joseph Y; Stroo, Marissa; Snyder, Denise C; Bashir, Mustafa R.

J Imaging Inform Med ; 2024 Apr 08.

Article in English | MEDLINE | ID: mdl-38587767

ABSTRACT

De-identification of DICOM images is an essential component of medical image research. While many established methods exist for the safe removal of protected health information (PHI) in DICOM metadata, approaches for the removal of PHI "burned-in" to image pixel data are typically manual, and automated high-throughput approaches are not well validated. Emerging optical character recognition (OCR) models can potentially detect and remove PHI-bearing text from medical images but are very time-consuming to run on the high volume of images found in typical research studies. We present a data processing method that performs metadata de-identification for all images combined with a targeted approach to only apply OCR to images with a high likelihood of burned-in text. The method was validated on a dataset of 415,182 images across ten modalities representative of the de-identification requests submitted at our institution over a 20-year span. Of the 12,578 images in this dataset with burned-in text of any kind, only 10 passed undetected with the method. OCR was only required for 6050 images (1.5% of the dataset).

9.

De-identification of clinical free text using natural language processing: A systematic review of current approaches.

Kovacevic, Aleksandar; Basaragin, Bojana; Milosevic, Nikola; Nenadic, Goran.

Artif Intell Med ; 151: 102845, 2024 May.

Article in English | MEDLINE | ID: mdl-38555848

ABSTRACT

BACKGROUND: Electronic health records (EHRs) are a valuable resource for data-driven medical research. However, the presence of protected health information (PHI) makes EHRs unsuitable to be shared for research purposes. De-identification, i.e. the process of removing PHI is a critical step in making EHR data accessible. Natural language processing has repeatedly demonstrated its feasibility in automating the de-identification process. OBJECTIVES: Our study aims to provide systematic evidence on how the de-identification of clinical free text written in English has evolved in the last thirteen years, and to report on the performances and limitations of the current state-of-the-art systems for the English language. In addition, we aim to identify challenges and potential research opportunities in this field. METHODS: A systematic search in PubMed, Web of Science, and the DBLP was conducted for studies published between January 2010 and February 2023. Titles and abstracts were examined to identify the relevant studies. Selected studies were then analysed in-depth, and information was collected on de-identification methodologies, data sources, and measured performance. RESULTS: A total of 2125 publications were identified for the title and abstract screening. 69 studies were found to be relevant. Machine learning (37 studies) and hybrid (26 studies) approaches are predominant, while six studies relied only on rules. The majority of the approaches were trained and evaluated on public corpora. The 2014 i2b2/UTHealth corpus is the most frequently used (36 studies), followed by the 2006 i2b2 (18 studies) and 2016 CEGS N-GRID (10 studies) corpora. CONCLUSION: Earlier de-identification approaches aimed at English were mainly rule and machine learning hybrids with extensive feature engineering and post-processing, while more recent performance improvements are due to feature-inferring recurrent neural networks. Current leading performance is achieved using attention-based neural models. Recent studies report state-of-the-art F1-scores (over 98 %) when evaluated in the manner usually adopted by the clinical natural language processing community. However, their performance needs to be more thoroughly assessed with different measures to judge their reliability to safely de-identify data in a real-world setting. Without additional manually labeled training data, state-of-the-art systems fail to generalise well across a wide range of clinical sub-domains.

Subject(s)

Electronic Health Records , Natural Language Processing , Humans , Machine Learning

10.

Automatic de-identification of French electronic health records: a cost-effective approach exploiting distant supervision and deep learning models.

Azzouzi, Mohamed El; Coatrieux, Gouenou; Bellafqira, Reda; Delamarre, Denis; Riou, Christine; Oubenali, Naima; Cabon, Sandie; Cuggia, Marc; Bouzillé, Guillaume.

BMC Med Inform Decis Mak ; 24(1): 54, 2024 Feb 16.

Article in English | MEDLINE | ID: mdl-38365677

ABSTRACT

BACKGROUND: Electronic health records (EHRs) contain valuable information for clinical research; however, the sensitive nature of healthcare data presents security and confidentiality challenges. De-identification is therefore essential to protect personal data in EHRs and comply with government regulations. Named entity recognition (NER) methods have been proposed to remove personal identifiers, with deep learning-based models achieving better performance. However, manual annotation of training data is time-consuming and expensive. The aim of this study was to develop an automatic de-identification pipeline for all kinds of clinical documents based on a distant supervised method to significantly reduce the cost of manual annotations and to facilitate the transfer of the de-identification pipeline to other clinical centers. METHODS: We proposed an automated annotation process for French clinical de-identification, exploiting data from the eHOP clinical data warehouse (CDW) of the CHU de Rennes and national knowledge bases, as well as other features. In addition, this paper proposes an assisted data annotation solution using the Prodigy annotation tool. This approach aims to reduce the cost required to create a reference corpus for the evaluation of state-of-the-art NER models. Finally, we evaluated and compared the effectiveness of different NER methods. RESULTS: A French de-identification dataset was developed in this work, based on EHRs provided by the eHOP CDW at Rennes University Hospital, France. The dataset was rich in terms of personal information, and the distribution of entities was quite similar in the training and test datasets. We evaluated a Bi-LSTM + CRF sequence labeling architecture, combined with Flair + FastText word embeddings, on a test set of manually annotated clinical reports. The model outperformed the other tested models with a significant F1 score of 96,96%, demonstrating the effectiveness of our automatic approach for deidentifying sensitive information. CONCLUSIONS: This study provides an automatic de-identification pipeline for clinical notes, which can facilitate the reuse of EHRs for secondary purposes such as clinical research. Our study highlights the importance of using advanced NLP techniques for effective de-identification, as well as the need for innovative solutions such as distant supervision to overcome the challenge of limited annotated data in the medical domain.

Subject(s)

Deep Learning , Humans , Data Anonymization , Electronic Health Records , Cost-Benefit Analysis , Confidentiality , Natural Language Processing

11.

High Accuracy Open-Source Clinical Data De-Identification: The CliniDeID Solution.

Meystre, Stéphane; Heider, Paul.

Stud Health Technol Inform ; 310: 1370-1371, 2024 Jan 25.

Article in English | MEDLINE | ID: mdl-38270048

ABSTRACT

Clinical data de-identification offers patient data privacy protection and eases reuse of clinical data. As an open-source solution to de-identify unstructured clinical text with high accuracy, CliniDeID applies an ensemble method combining deep and shallow machine learning with rule-based algorithms. It reached high recall and precision when recently evaluated with a selection of clinical text corpora.

Subject(s)

Algorithms , Machine Learning , Humans

12.

Exchanging words: Engaging the challenges of sharing qualitative research data.

DuBois, James M; Mozersky, Jessica; Parsons, Meredith; Walsh, Heidi A; Friedrich, Annie; Pienta, Amy.

Proc Natl Acad Sci U S A ; 120(43): e2206981120, 2023 Oct 24.

Article in English | MEDLINE | ID: mdl-37831745

ABSTRACT

In January 2023, a new NIH policy on data sharing went into effect. The policy applies to both quantitative and qualitative research (QR) data such as data from interviews or focus groups. QR data are often sensitive and difficult to deidentify, and thus have rarely been shared in the United States. Over the past 5 y, our research team has engaged stakeholders on QR data sharing, developed software to support data deidentification, produced guidance, and collaborated with the ICPSR data repository to pilot the deposit of 30 QR datasets. In this perspective article, we share important lessons learned by addressing eight clusters of questions on issues such as where, when, and what to share; how to deidentify data and support high-quality secondary use; budgeting for data sharing; and the permissions needed to share data. We also offer a brief assessment of the state of preparedness of data repositories, QR journals, and QR textbooks to support data sharing. While QR data sharing could yield important benefits to the research community, we quickly need to develop enforceable standards, expertise, and resources to support responsible QR data sharing. Absent these resources, we risk violating participant confidentiality and wasting a significant amount of time and funding on data that are not useful for either secondary use or data transparency and verification.

13.

OBIA: An Open Biomedical Imaging Archive.

Jin, Enhui; Zhao, Dongli; Wu, Gangao; Zhu, Junwei; Wang, Zhonghuang; Wei, Zhiyao; Zhang, Sisi; Wang, Anke; Tang, Bixia; Chen, Xu; Sun, Yanling; Zhang, Zhe; Zhao, Wenming; Meng, Yuanguang.

Genomics Proteomics Bioinformatics ; 21(5): 1059-1065, 2023 Oct.

Article in English | MEDLINE | ID: mdl-37806555

ABSTRACT

With the development of artificial intelligence (AI) technologies, biomedical imaging data play an important role in scientific research and clinical application, but the available resources are limited. Here we present Open Biomedical Imaging Archive (OBIA), a repository for archiving biomedical imaging and related clinical data. OBIA adopts five data objects (Collection, Individual, Study, Series, and Image) for data organization, and accepts the submission of biomedical images of multiple modalities, organs, and diseases. In order to protect personal privacy, OBIA has formulated a unified de-identification and quality control process. In addition, OBIA provides friendly and intuitive web interfaces for data submission, browsing, and retrieval, as well as image retrieval. As of September 2023, OBIA has housed data for a total of 937 individuals, 4136 studies, 24,701 series, and 1,938,309 images covering 9 modalities and 30 anatomical sites. Collectively, OBIA provides a reliable platform for biomedical imaging data management and offers free open access to all publicly available data to support research activities throughout the world. OBIA can be accessed at https://ngdc.cncb.ac.cn/obia.

Subject(s)

Artificial Intelligence , Humans

14.

Effects of de-facing software mri_reface on utility of imaging biomarkers used in Alzheimer's disease research.

Schwarz, Christopher G; Kremers, Walter K; Weigand, Stephen D; Prakaashana, Carl M; Senjem, Matthew L; Przybelski, Scott A; Lowe, Val J; Gunter, Jeffrey L; Kantarci, Kejal; Vemuri, Prashanthi; Graff-Radford, Jonathan; Petersen, Ronald C; Knopman, David S; Jack, Clifford R.

Neuroimage Clin ; 40: 103507, 2023.

Article in English | MEDLINE | ID: mdl-37703605

ABSTRACT

Brain imaging research studies increasingly use "de-facing" software to remove or replace facial imagery before public data sharing. Several works have studied the effects of de-facing software on brain imaging biomarkers by directly comparing automated measurements from unmodified vs de-faced images, but most research brain images are used in analyses of correlations with cognitive measurements or clinical statuses, and the effects of de-facing on these types of imaging-to-cognition correlations has not been measured. In this work, we focused on brain imaging measures of amyloid (A), tau (T), neurodegeneration (N), and vascular (V) measures used in Alzheimer's Disease (AD) research. We created a retrospective sample of participants from three age- and sex-matched clinical groups (cognitively unimpaired, mild cognitive impairment, and AD dementia, and we performed region- and voxel-wise analyses of: hippocampal volume (N), white matter hyperintensity volume (V), amyloid PET (A), and tau PET (T) measures, each from multiple software pipelines, on their ability to separate cognitively defined groups and their degrees of correlation with age and Clinical Dementia Rating (CDR)-Sum of Boxes (CDR-SB). We performed each of these analyses twice: once with unmodified images and once with images de-faced with leading de-facing software mri_reface, and we directly compared the findings and their statistical strengths between the original vs. the de-faced images. Analyses with original and with de-faced images had very high agreement. There were no significant differences between any voxel-wise comparisons. Among region-wise comparisons, only three out of 55 correlations were significantly different between original and de-faced images, and these were not significant after correction for multiple comparisons. Overall, the statistical power of the imaging data for AD biomarkers was almost identical between unmodified and de-faced images, and their analyses results were extremely consistent.

Subject(s)

Alzheimer Disease , Cognitive Dysfunction , Humans , Alzheimer Disease/diagnostic imaging , Retrospective Studies , Brain/diagnostic imaging , Brain/metabolism , Cognitive Dysfunction/diagnostic imaging , Positron-Emission Tomography/methods , Biomarkers , Amyloid beta-Peptides/metabolism , Magnetic Resonance Imaging , tau Proteins

15.

A certified de-identification system for all clinical text documents for information extraction at scale.

Radhakrishnan, Lakshmi; Schenk, Gundolf; Muenzen, Kathleen; Oskotsky, Boris; Ashouri Choshali, Habibeh; Plunkett, Thomas; Israni, Sharat; Butte, Atul J.

JAMIA Open ; 6(3): ooad045, 2023 Oct.

Article in English | MEDLINE | ID: mdl-37416449

ABSTRACT

Objectives: Clinical notes are a veritable treasure trove of information on a patient's disease progression, medical history, and treatment plans, yet are locked in secured databases accessible for research only after extensive ethics review. Removing personally identifying and protected health information (PII/PHI) from the records can reduce the need for additional Institutional Review Boards (IRB) reviews. In this project, our goals were to: (1) develop a robust and scalable clinical text de-identification pipeline that is compliant with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule for de-identification standards and (2) share routinely updated de-identified clinical notes with researchers. Materials and Methods: Building on our open-source de-identification software called Philter, we added features to: (1) make the algorithm and the de-identified data HIPAA compliant, which also implies type 2 error-free redaction, as certified via external audit; (2) reduce over-redaction errors; and (3) normalize and shift date PHI. We also established a streamlined de-identification pipeline using MongoDB to automatically extract clinical notes and provide truly de-identified notes to researchers with periodic monthly refreshes at our institution. Results: To the best of our knowledge, the Philter V1.0 pipeline is currently the first and only certified, de-identified redaction pipeline that makes clinical notes available to researchers for nonhuman subjects' research, without further IRB approval needed. To date, we have made over 130 million certified de-identified clinical notes available to over 600 UCSF researchers. These notes were collected over the past 40 years, and represent data from 2757016 UCSF patients.

16.

The way to data: opinions and recommendations for the provision of health data for secondary use.

Tichopád, Ales; Augustynek, Martin; Benes, Jirí; Dlouhý, Martin; Dolezal, Tomás; Horáková, Dana; Krsek, Michal; Lhotska, Lenka; Panzner, Petr; Penhaker, Marek; Petr, Miroslav; Pitha, Jan; Popesko, Boris; Rozánek, Martin; Táborský, Milos; Vrablík, Michal.

Cas Lek Cesk ; 162(2-3): 61-66, 2023.

Article in English | MEDLINE | ID: mdl-37474288

ABSTRACT

Healthcare data held by state-run organisations is a valuable intangible asset for society. Its use should be a priority for its administrators and the state. A completely paternalistic approach by administrators and the state is undesirable, however much it aims to protect the privacy rights of persons registered in databases. In line with European policies and the global trend, these measures should not outweigh the social benefit that arises from the analysis of these data if the technical possibilities exist to sufficiently protect the privacy rights of individuals. Czech society is having an intense discussion on the topic, but according to the authors, it is insufficiently based on facts and lacks clearly articulated opinions of the expert public. The aim of this article is to fill these gaps. Data anonymization techniques provide a solution to protect individuals' privacy rights while preserving the scientific value of the data. The risk of identifying individuals in anonymised data sets is scalable and can be minimised depending on the type and content of the data and its use by the specific applicant. Finding the optimal form and scope of deidentified data requires competence and knowledge on the part of both the applicant and the administrator. It is in the interest of the applicant, the administrator, as well as the protected persons in the databases that both parties show willingness and have the ability and expertise to communicate during the application and its processing.

Subject(s)

Confidentiality , Data Anonymization , Humans , Privacy

17.

A face-off of MRI research sequences by their need for de-facing.

Schwarz, Christopher G; Kremers, Walter K; Arani, Arvin; Savvides, Marios; Reid, Robert I; Gunter, Jeffrey L; Senjem, Matthew L; Cogswell, Petrice M; Vemuri, Prashanthi; Kantarci, Kejal; Knopman, David S; Petersen, Ronald C; Jack, Clifford R.

Neuroimage ; 276: 120199, 2023 08 01.

Article in English | MEDLINE | ID: mdl-37269958

ABSTRACT

It is now widely known that research brain MRI, CT, and PET images may potentially be re-identified using face recognition, and this potential can be reduced by applying face-deidentification ("de-facing") software. However, for research MRI sequences beyond T1-weighted (T1-w) and T2-FLAIR structural images, the potential for re-identification and quantitative effects of de-facing are both unknown, and the effects of de-facing T2-FLAIR are also unknown. In this work we examine these questions (where applicable) for T1-w, T2-w, T2*-w, T2-FLAIR, diffusion MRI (dMRI), functional MRI (fMRI), and arterial spin labelling (ASL) sequences. Among current-generation, vendor-product research-grade sequences, we found that 3D T1-w, T2-w, and T2-FLAIR were highly re-identifiable (96-98%). 2D T2-FLAIR and 3D multi-echo GRE (ME-GRE) were also moderately re-identifiable (44-45%), and our derived T2* from ME-GRE (comparable to a typical 2D T2*) matched at only 10%. Finally, diffusion, functional and ASL images were each minimally re-identifiable (0-8%). Applying de-facing with mri_reface version 0.3 reduced successful re-identification to ≤8%, while differential effects on popular quantitative pipelines for cortical volumes and thickness, white matter hyperintensities (WMH), and quantitative susceptibility mapping (QSM) measurements were all either comparable with or smaller than scan-rescan estimates. Consequently, high-quality de-facing software can greatly reduce the risk of re-identification for identifiable MRI sequences with only negligible effects on automated intracranial measurements. The current-generation echo-planar and spiral sequences (dMRI, fMRI, and ASL) each had minimal match rates, suggesting that they have a low risk of re-identification and can be shared without de-facing, but this conclusion should be re-evaluated if they are acquired without fat suppression, with a full-face scan coverage, or if newer developments reduce the current levels of artifacts and distortion around the face.

Subject(s)

Diffusion Magnetic Resonance Imaging , Magnetic Resonance Imaging , Humans , Magnetic Resonance Imaging/methods , Diffusion Magnetic Resonance Imaging/methods , Neuroimaging , Artifacts , Spin Labels

18.

De-Identification Technique with Facial Deformation in Head CT Images.

Uchida, Tatsuya; Kin, Taichi; Saito, Toki; Shono, Naoyuki; Kiyofuji, Satoshi; Koike, Tsukasa; Sato, Katsuya; Niwa, Ryoko; Takashima, Ikumi; Oyama, Hiroshi; Saito, Nobuhito.

Neuroinformatics ; 21(3): 575-587, 2023 07.

Article in English | MEDLINE | ID: mdl-37226013

ABSTRACT

Head CT, which includes the facial region, can visualize faces using 3D reconstruction, raising concern that individuals may be identified. We developed a new de-identification technique that distorts the faces of head CT images. Head CT images that were distorted were labeled as "original images" and the others as "reference images." Reconstructed face models of both were created, with 400 control points on the facial surfaces. All voxel positions in the original image were moved and deformed according to the deformation vectors required to move to corresponding control points on the reference image. Three face detection and identification programs were used to determine face detection rates and match confidence scores. Intracranial volume equivalence tests were performed before and after deformation, and correlation coefficients between intracranial pixel value histograms were calculated. Output accuracy of the deep learning model for intracranial segmentation was determined using Dice Similarity Coefficient before and after deformation. The face detection rate was 100%, and match confidence scores were < 90. Equivalence testing of the intracranial volume revealed statistical equivalence before and after deformation. The median correlation coefficient between intracranial pixel value histograms before and after deformation was 0.9965, indicating high similarity. Dice Similarity Coefficient values of original and deformed images were statistically equivalent. We developed a technique to de-identify head CT images while maintaining the accuracy of deep-learning models. The technique involves deforming images to prevent face identification, with minimal changes to the original information.

Subject(s)

Data Anonymization , Image Processing, Computer-Assisted , Humans , Image Processing, Computer-Assisted/methods , Tomography, X-Ray Computed/methods , Head/diagnostic imaging , Algorithms

19.

Utility-Preserving Anonymization in a Real-World Scenario: Evidence from the German Chronic Kidney Disease (GCKD) Study.

Pilgram, Lisa; Schäffner, Elke; Eckardt, Kai-Uwe; Prasser, Fabian.

Stud Health Technol Inform ; 302: 28-32, 2023 May 18.

Article in English | MEDLINE | ID: mdl-37203603

ABSTRACT

Data sharing provides benefits in terms of transparency and innovation. Privacy concerns in this context can be addressed by anonymization techniques. In our study, we evaluated anonymization approaches which transform structured data in a real-world scenario of a chronic kidney disease cohort study and checked for replicability of research results via 95% CI overlap in two differently anonymized datasets with different protection degrees. Calculated 95% CI overlapped in both applied anonymization approaches and visual comparison presented similar results. Thus, in our use case scenario, research results were not relevantly impacted by anonymization, which adds to the growing evidence of utility-preserving anonymization techniques.

Subject(s)

Data Anonymization , Privacy , Humans , Cohort Studies , Information Dissemination , Organizations

20.

De-identified Bayesian personal identity matching for privacy-preserving record linkage despite errors: development and validation.

Cardinal, Rudolf N; Moore, Anna; Burchell, Martin; Lewis, Jonathan R.

BMC Med Inform Decis Mak ; 23(1): 85, 2023 05 05.

Article in English | MEDLINE | ID: mdl-37147600

ABSTRACT

BACKGROUND: Epidemiological research may require linkage of information from multiple organizations. This can bring two problems: (1) the information governance desirability of linkage without sharing direct identifiers, and (2) a requirement to link databases without a common person-unique identifier. METHODS: We develop a Bayesian matching technique to solve both. We provide an open-source software implementation capable of de-identified probabilistic matching despite discrepancies, via fuzzy representations and complete mismatches, plus de-identified deterministic matching if required. We validate the technique by testing linkage between multiple medical records systems in a UK National Health Service Trust, examining the effects of decision thresholds on linkage accuracy. We report demographic factors associated with correct linkage. RESULTS: The system supports dates of birth (DOBs), forenames, surnames, three-state gender, and UK postcodes. Fuzzy representations are supported for all except gender, and there is support for additional transformations, such as accent misrepresentation, variation for multi-part surnames, and name re-ordering. Calculated log odds predicted a proband's presence in the sample database with an area under the receiver operating curve of 0.997-0.999 for non-self database comparisons. Log odds were converted to a decision via a consideration threshold Î¸ and a leader advantage threshold Î´. Defaults were chosen to penalize misidentification 20-fold versus linkage failure. By default, complete DOB mismatches were disallowed for computational efficiency. At these settings, for non-self database comparisons, the mean probability of a proband being correctly declared to be in the sample was 0.965 (range 0.931-0.994), and the misidentification rate was 0.00249 (range 0.00123-0.00429). Correct linkage was positively associated with male gender, Black or mixed ethnicity, and the presence of diagnostic codes for severe mental illnesses or other mental disorders, and negatively associated with birth year, unknown ethnicity, residential area deprivation, and presence of a pseudopostcode (e.g. indicating homelessness). Accuracy rates would be improved further if person-unique identifiers were also used, as supported by the software. Our two largest databases were linked in 44 min via an interpreted programming language. CONCLUSIONS: Fully de-identified matching with high accuracy is feasible without a person-unique identifier and appropriate software is freely available.

Subject(s)

Medical Record Linkage , Privacy , Humans , Male , Bayes Theorem , State Medicine , Software

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL