Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 15 de 15
Filtrar
1.
JMIR AI ; 3: e52615, 2024 Apr 22.
Artigo em Inglês | MEDLINE | ID: mdl-38875595

RESUMO

Synthetic electronic health record (EHR) data generation has been increasingly recognized as an important solution to expand the accessibility and maximize the value of private health data on a large scale. Recent advances in machine learning have facilitated more accurate modeling for complex and high-dimensional data, thereby greatly enhancing the data quality of synthetic EHR data. Among various approaches, generative adversarial networks (GANs) have become the main technical path in the literature due to their ability to capture the statistical characteristics of real data. However, there is a scarcity of detailed guidance within the domain regarding the development procedures of synthetic EHR data. The objective of this tutorial is to present a transparent and reproducible process for generating structured synthetic EHR data using a publicly accessible EHR data set as an example. We cover the topics of GAN architecture, EHR data types and representation, data preprocessing, GAN training, synthetic data generation and postprocessing, and data quality evaluation. We conclude this tutorial by discussing multiple important issues and future opportunities in this domain. The source code of the entire process has been made publicly available.

2.
Annu Rev Biomed Data Sci ; 6: 443-464, 2023 08 10.
Artigo em Inglês | MEDLINE | ID: mdl-37561600

RESUMO

The All of Us Research Program's Data and Research Center (DRC) was established to help acquire, curate, and provide access to one of the world's largest and most diverse datasets for precision medicine research. Already, over 500,000 participants are enrolled in All of Us, 80% of whom are underrepresented in biomedical research, and data are being analyzed by a community of over 2,300 researchers. The DRC created this thriving data ecosystem by collaborating with engaged participants, innovative program partners, and empowered researchers. In this review, we first describe how the DRC is organized to meet the needs of this broad group of stakeholders. We then outline guiding principles, common challenges, and innovative approaches used to build the All of Us data ecosystem. Finally, we share lessons learned to help others navigate important decisions and trade-offs in building a modern biomedical data platform.


Assuntos
Pesquisa Biomédica , Saúde da População , Humanos , Ecossistema , Medicina de Precisão
3.
J Am Med Inform Assoc ; 30(5): 907-914, 2023 04 19.
Artigo em Inglês | MEDLINE | ID: mdl-36809550

RESUMO

OBJECTIVE: The All of Us Research Program makes individual-level data available to researchers while protecting the participants' privacy. This article describes the protections embedded in the multistep access process, with a particular focus on how the data was transformed to meet generally accepted re-identification risk levels. METHODS: At the time of the study, the resource consisted of 329 084 participants. Systematic amendments were applied to the data to mitigate re-identification risk (eg, generalization of geographic regions, suppression of public events, and randomization of dates). We computed the re-identification risk for each participant using a state-of-the-art adversarial model specifically assuming that it is known that someone is a participant in the program. We confirmed the expected risk is no greater than 0.09, a threshold that is consistent with guidelines from various US state and federal agencies. We further investigated how risk varied as a function of participant demographics. RESULTS: The results indicated that 95th percentile of the re-identification risk of all the participants is below current thresholds. At the same time, we observed that risk levels were higher for certain race, ethnic, and genders. CONCLUSIONS: While the re-identification risk was sufficiently low, this does not imply that the system is devoid of risk. Rather, All of Us uses a multipronged data protection strategy that includes strong authentication practices, active monitoring of data misuse, and penalization mechanisms for users who violate terms of service.


Assuntos
Saúde da População , Humanos , Masculino , Feminino , Privacidade , Gestão de Riscos , Segurança Computacional , Pesquisadores
4.
J Am Med Inform Assoc ; 29(9): 1584-1592, 2022 08 16.
Artigo em Inglês | MEDLINE | ID: mdl-35641135

RESUMO

OBJECTIVE: Deep learning models for clinical event forecasting (CEF) based on a patient's medical history have improved significantly over the past decade. However, their transition into practice has been limited, particularly for diseases with very low prevalence. In this paper, we introduce CEF-CL, a novel method based on contrastive learning to forecast in the face of a limited number of positive training instances. MATERIALS AND METHODS: CEF-CL consists of two primary components: (1) unsupervised contrastive learning for patient representation and (2) supervised transfer learning over the derived representation. We evaluate the new method along with state-of-the-art model architectures trained in a supervised manner with electronic health records data from Vanderbilt University Medical Center and the All of Us Research Program, covering 48 000 and 16 000 patients, respectively. We assess forecasting for over 100 diagnosis codes with respect to their area under the receiver operator characteristic curve (AUROC) and area under the precision-recall curve (AUPRC). We investigate the correlation between forecasting performance improvement and code prevalence via a Wald Test. RESULTS: CEF-CL achieved an average AUROC and AUPRC performance improvement over the state-of-the-art of 8.0%-9.3% and 11.7%-32.0%, respectively. The improvement in AUROC was negatively correlated with the number of positive training instances (P < .001). CONCLUSION: This investigation indicates that clinical event forecasting can be improved significantly through contrastive representation learning, especially when the number of positive training instances is small.


Assuntos
Saúde da População , Registros Eletrônicos de Saúde , Previsões , Humanos
5.
AMIA Jt Summits Transl Sci Proc ; 2021: 132-141, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-34457127

RESUMO

Deep learning architectures have an extremely high-capacity for modeling complex data in a wide variety of domains. However, these architectures have been limited in their ability to support complex prediction problems using insurance claims data, such as readmission at 30 days, mainly due to data sparsity issue. Consequently, classical machine learning methods, especially those that embed domain knowledge in handcrafted features, are often on par with, and sometimes outperform, deep learning approaches. In this paper, we illustrate how the potential of deep learning can be achieved by blending domain knowledge within deep learning architectures to predict adverse events at hospital discharge, including readmissions. More specifically, we introduce a learning architecture that fuses a representation of patient data computed by a self-attention based recurrent neural network, with clinically relevant features. We conduct extensive experiments on a large claims dataset and show that the blended method outperforms the standard machine learning approaches.


Assuntos
Aprendizado de Máquina , Alta do Paciente , Hospitais , Humanos , Redes Neurais de Computação
6.
J Am Med Inform Assoc ; 28(4): 744-752, 2021 03 18.
Artigo em Inglês | MEDLINE | ID: mdl-33448306

RESUMO

OBJECTIVE: Re-identification risk methods for biomedical data often assume a worst case, in which attackers know all identifiable features (eg, age and race) about a subject. Yet, worst-case adversarial modeling can overestimate risk and induce heavy editing of shared data. The objective of this study is to introduce a framework for assessing the risk considering the attacker's resources and capabilities. MATERIALS AND METHODS: We integrate 3 established risk measures (ie, prosecutor, journalist, and marketer risks) and compute re-identification probabilities for data subjects. This probability is dependent on an attacker's capabilities (eg, ability to obtain external identified resources) and the subject's decision on whether to reveal their participation in a dataset. We illustrate the framework through case studies using data from over 1 000 000 patients from Vanderbilt University Medical Center and show how re-identification risk changes when attackers are pragmatic and use 2 known resources for attack: (1) voter registration lists and (2) social media posts. RESULTS: Our framework illustrates that the risk is substantially smaller in the pragmatic scenarios than in the worst case. Our experiments yield a median worst-case risk of 0.987 (where 0 is least risky and 1 is most risky); however, the median reduction in risk was 90.1% in the voter registration scenario and 100% in the social media posts scenario. Notably, these observations hold true for a wide range of adversarial capabilities. CONCLUSIONS: This research illustrates that re-identification risk is situationally dependent and that appropriate adversarial modeling may permit biomedical data sharing on a wider scale than is currently the case.


Assuntos
Segurança Computacional , Confidencialidade , Anonimização de Dados , Probabilidade , Humanos , Risco , Medição de Risco
7.
J Am Med Inform Assoc ; 27(9): 1374-1382, 2020 07 01.
Artigo em Inglês | MEDLINE | ID: mdl-32930712

RESUMO

OBJECTIVE: Effective, scalable de-identification of personally identifying information (PII) for information-rich clinical text is critical to support secondary use, but no method is 100% effective. The hiding-in-plain-sight (HIPS) approach attempts to solve this "residual PII problem." HIPS replaces PII tagged by a de-identification system with realistic but fictitious (resynthesized) content, making it harder to detect remaining unredacted PII. MATERIALS AND METHODS: Using 2000 representative clinical documents from 2 healthcare settings (4000 total), we used a novel method to generate 2 de-identified 100-document corpora (200 documents total) in which PII tagged by a typical automated machine-learned tagger was replaced by HIPS-resynthesized content. Four readers conducted aggressive reidentification attacks to isolate leaked PII: 2 readers from within the originating institution and 2 external readers. RESULTS: Overall, mean recall of leaked PII was 26.8% and mean precision was 37.2%. Mean recall was 9% (mean precision = 37%) for patient ages, 32% (mean precision = 26%) for dates, 25% (mean precision = 37%) for doctor names, 45% (mean precision = 55%) for organization names, and 23% (mean precision = 57%) for patient names. Recall was 32% (precision = 40%) for internal and 22% (precision =33%) for external readers. DISCUSSION AND CONCLUSIONS: Approximately 70% of leaked PII "hiding" in a corpus de-identified with HIPS resynthesis is resilient to detection by human readers in a realistic, aggressive reidentification attack scenario-more than double the rate reported in previous studies but less than the rate reported for an attack assisted by machine learning methods.


Assuntos
Confidencialidade , Anonimização de Dados , Registros Eletrônicos de Saúde , Segurança Computacional , Humanos , Processamento de Linguagem Natural
8.
AMIA Annu Symp Proc ; 2020: 1335-1344, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-33936510

RESUMO

Sharing electronic health records (EHRs) on a large scale may lead to privacy intrusions. Recent research has shown that risks may be mitigated by simulating EHRs through generative adversarial network (GAN) frameworks. Yet the methods developed to date are limited because they 1) focus on generating data of a single type (e.g., diagnosis codes), neglecting other data types (e.g., demographics, procedures or vital signs), and 2) do not represent constraints betweenfeatures. In this paper, we introduce a method to simulate EHRs composed of multiple data types by 1) refining the GAN model, 2) accounting for feature constraints, and 3) incorporating key utility measures for such generation tasks. Our analysis with over 770,000 EHRs from Vanderbilt University Medical Center demonstrates that the new model achieves higher performance in terms ofretaining basic statistics, cross-feature correlations, latent structural properties, feature constraints and associated patterns from real data, without sacrificing privacy.


Assuntos
Registros Eletrônicos de Saúde , Feminino , Humanos , Masculino , Privacidade , Projetos de Pesquisa , Sinais Vitais
9.
J Am Med Inform Assoc ; 26(12): 1536-1544, 2019 12 01.
Artigo em Inglês | MEDLINE | ID: mdl-31390016

RESUMO

OBJECTIVE: Clinical corpora can be deidentified using a combination of machine-learned automated taggers and hiding in plain sight (HIPS) resynthesis. The latter replaces detected personally identifiable information (PII) with random surrogates, allowing leaked PII to blend in or "hide in plain sight." We evaluated the extent to which a malicious attacker could expose leaked PII in such a corpus. MATERIALS AND METHODS: We modeled a scenario where an institution (the defender) externally shared an 800-note corpus of actual outpatient clinical encounter notes from a large, integrated health care delivery system in Washington State. These notes were deidentified by a machine-learned PII tagger and HIPS resynthesis. A malicious attacker obtained and performed a parrot attack intending to expose leaked PII in this corpus. Specifically, the attacker mimicked the defender's process by manually annotating all PII-like content in half of the released corpus, training a PII tagger on these data, and using the trained model to tag the remaining encounter notes. The attacker hypothesized that untagged identifiers would be leaked PII, discoverable by manual review. We evaluated the attacker's success using measures of leak-detection rate and accuracy. RESULTS: The attacker correctly hypothesized that 211 (68%) of 310 actual PII leaks in the corpus were leaks, and wrongly hypothesized that 191 resynthesized PII instances were also leaks. One-third of actual leaks remained undetected. DISCUSSION AND CONCLUSION: A malicious parrot attack to reveal leaked PII in clinical text deidentified by machine-learned HIPS resynthesis can attenuate but not eliminate the protective effect of HIPS deidentification.


Assuntos
Segurança Computacional , Confidencialidade , Anonimização de Dados , Registros Eletrônicos de Saúde , Aprendizado de Máquina , Informações Pessoalmente Identificáveis , Instituições de Assistência Ambulatorial , Atenção à Saúde , Humanos , Washington
10.
Int J Med Inform ; 83(7): 495-506, 2014 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-24845147

RESUMO

OBJECTIVE: Models of healthcare organizations (HCOs) are often defined up front by a select few administrative officials and managers. However, given the size and complexity of modern healthcare systems, this practice does not scale easily. The goal of this work is to investigate the extent to which organizational relationships can be automatically learned from utilization patterns of electronic health record (EHR) systems. METHOD: We designed an online survey to solicit the perspectives of employees of a large academic medical center. We surveyed employees from two administrative areas: (1) Coding & Charge Entry and (2) Medical Information Services and two clinical areas: (3) Anesthesiology and (4) Psychiatry. To test our hypotheses we selected two administrative units that have work-related responsibilities with electronic records; however, for the clinical areas we selected two disciplines with very different patient responsibilities and whose accesses and people who accessed were similar. We provided each group of employees with questions regarding the chance of interaction between areas in the medical center in the form of association rules (e.g., Given someone from Coding & Charge Entry accessed a patient's record, what is the chance that someone from Medical Information Services access the same record?). We compared the respondent predictions with the rules learned from actual EHR utilization using linear-mixed effects regression models. RESULTS: The findings from our survey confirm that medical center employees can distinguish between association rules of high and non-high likelihood when their own area is involved. Moreover, they can make such distinctions between for any HCO area in this survey. It was further observed that, with respect to highly likely interactions, respondents from certain areas were significantly better than other respondents at making such distinctions and certain areas' associations were more distinguishable than others. CONCLUSIONS: These results illustrate that EHR utilization patterns may be consistent with the expectations of HCO employees. Our findings show that certain areas in the HCO are easier than others for employees to assess, which suggests that automated learning strategies may yield more accurate models of healthcare organizations than those based on the perspectives of a select few individuals.


Assuntos
Prestação Integrada de Cuidados de Saúde/organização & administração , Registros Eletrônicos de Saúde/estatística & dados numéricos , Pessoal de Saúde , Equipe de Assistência ao Paciente/organização & administração , Integração de Sistemas , Humanos , Cultura Organizacional
11.
AMIA Annu Symp Proc ; 2012: 93-102, 2012.
Artigo em Inglês | MEDLINE | ID: mdl-23304277

RESUMO

Healthcare organizations are deploying increasingly complex clinical information systems to support patient care. Traditional information security practices (e.g., role-based access control) are embedded in enterprise-level systems, but are insufficient to ensure patient privacy. This is due, in part, to the dynamic nature of healthcare, which makes it difficult to predict which care providers need access to what and when. In this paper, we show that modeling operations at a higher level of granularity (e.g., the departmental level) are stable in the context of a relational network, which may enable more effective auditing strategies. We study three months of access logs from a large academic medical center to illustrate that departmental interaction networks exhibit certain invariants, such as the number, strength, and reciprocity of relationships. We further show that the relations extracted from the network can be leveraged to assess the extent to which a patient's care satisfies expected organizational behavior.


Assuntos
Administração de Instituições de Saúde , Relações Interprofissionais , Sistemas Computadorizados de Registros Médicos , Modelos Organizacionais , Segurança Computacional , Confidencialidade , Humanos , Auditoria Médica
12.
Secur Inform ; 1(5)2012 Feb 27.
Artigo em Inglês | MEDLINE | ID: mdl-23399988

RESUMO

Collaborative information systems (CIS) enable users to coordinate efficiently over shared tasks in complex distributed environments. For flexibility, they provide users with broad access privileges, which, as a side-effect, leave such systems vulnerable to various attacks. Some of the more damaging malicious activities stem from internal misuse, where users are authorized to access system resources. A promising class of insider threat detection models for CIS focuses on mining access patterns from audit logs, however, current models are limited in that they assume organizations have significant resources to generate label cases for training classifiers or assume the user has committed a large number of actions that deviate from "normal" behavior. In lieu of the previous assumptions, we introduce an approach that detects when specific actions of an insider deviate from expectation in the context of collaborative behavior. Specifically, in this paper, we introduce a specialized network anomaly detection model, or SNAD, to detect such events. This approach assesses the extent to which a user influences the similarity of the group of users that access a particular record in the CIS. From a theoretical perspective, we show that the proposed model is appropriate for detecting insider actions in dynamic collaborative systems. From an empirical perspective, we perform an extensive evaluation of SNAD with the access logs of two distinct environments: the patient record access logs a large electronic health record system (6,015 users, 130,457 patients and 1,327,500 accesses) and the editing logs of Wikipedia (2,394,385 revisors, 55,200 articles and 6,482,780 revisions). We compare our model with several competing methods and demonstrate SNAD is significantly more effective: on average it achieves 20-30% greater area under an ROC curve.

13.
IEEE Trans Dependable Secure Comput ; 9(3): 332-344, 2012 May.
Artigo em Inglês | MEDLINE | ID: mdl-24489520

RESUMO

Collaborative information systems (CISs) are deployed within a diverse array of environments that manage sensitive information. Current security mechanisms detect insider threats, but they are ill-suited to monitor systems in which users function in dynamic teams. In this paper, we introduce the community anomaly detection system (CADS), an unsupervised learning framework to detect insider threats based on the access logs of collaborative environments. The framework is based on the observation that typical CIS users tend to form community structures based on the subjects accessed (e.g., patients' records viewed by healthcare providers). CADS consists of two components: 1) relational pattern extraction, which derives community structures and 2) anomaly prediction, which leverages a statistical model to determine when users have sufficiently deviated from communities. We further extend CADS into MetaCADS to account for the semantics of subjects (e.g., patients' diagnoses). To empirically evaluate the framework, we perform an assessment with three months of access logs from a real electronic health record (EHR) system in a large medical center. The results illustrate our models exhibit significant performance gains over state-of-the-art competitors. When the number of illicit users is low, MetaCADS is the best model, but as the number grows, commonly accessed semantics lead to hiding in a crowd, such that CADS is more prudent.

14.
J Biomed Inform ; 44(2): 333-42, 2011 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-21277996

RESUMO

Modern healthcare organizations (HCOs) are composed of complex dynamic teams to ensure clinical operations are executed in a quick and competent manner. At the same time, the fluid nature of such environments hinders administrators' efforts to define access control policies that appropriately balance patient privacy and healthcare functions. Manual efforts to define these policies are labor-intensive and error-prone, often resulting in systems that endow certain care providers with overly broad access to patients' medical records while restricting other providers from legitimate and timely use. In this work, we propose an alternative method to generate these policies by automatically mining usage patterns from electronic health record (EHR) systems. EHR systems are increasingly being integrated into clinical environments and our approach is designed to be generalizable across HCOs, thus assisting in the design and evaluation of local access control policies. Our technique, which is grounded in data mining and social network analysis theory, extracts a statistical model of the organization from the access logs of its EHRs. In doing so, our approach enables the review of predefined policies, as well as the discovery of unknown behaviors. We evaluate our approach with 5 months of access logs from the Vanderbilt University Medical Center and confirm the existence of stable social structures and intuitive business operations. Additionally, we demonstrate that there is significant turnover in the interactions between users in the HCO and that policies learned at the department-level afford greater stability over time.


Assuntos
Registros Eletrônicos de Saúde , Política de Saúde , Segurança Computacional , Confidencialidade , Mineração de Dados , Humanos , Políticas
15.
ISI ; 2011: 119-124, 2011 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-25621314

RESUMO

Collaborative information systems (CIS) enable users to coordinate efficiently over shared tasks. T hey are often deployed in complex dynamic systems that provide users with broad access privileges, but also leave the system vulnerable to various attacks. Techniques to detect threats originating from beyond the system are relatively mature, but methods to detect insider threats are still evolving. A promising class of insider threat detection models for CIS focus on the communities that manifest between users based on the usage of common subjects in the system. However, current methods detect only when a user's aggregate behavior is intruding, not when specific actions have deviated from expectation. In this paper, we introduce a method called specialized network anomaly detection (SNAD) to detect such events. SNAD assembles the community of users that access a particular subject and assesses if similarities of the community with and without a certain user are sufficiently different. We present a theoretical basis and perform an extensive empirical evaluation with the access logs of two distinct environments: those of a large electronic health record system (6,015 users, 130,457 patients and 1,327,500 accesses) and the editing logs of Wikipedia (2,388,955 revisors, 55,200 articles and 6,482,780 revisions). We compare SNAD with several competing methods and demonstrate it is significantly more effective: on average it achieves 20-30% greater area under an ROC curve.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...