Your browser doesn't support javascript.
Validating a membership disclosure metric for synthetic health data.
El Emam, Khaled; Mosquera, Lucy; Fang, Xi.
  • El Emam K; Data Science, Replica Analytics Ltd., Ottawa, Ontario, Canada.
  • Mosquera L; School of Epidemiology and Public Health, University of Ottawa, Ottawa, Ontario, Canada.
  • Fang X; Research Institute, Children's Hospital of Eastern Ontario, Ottawa, Ontario, Canada.
JAMIA Open ; 5(4): ooac083, 2022 Dec.
Article in English | MEDLINE | ID: covidwho-2062926
ABSTRACT

Background:

One of the increasingly accepted methods to evaluate the privacy of synthetic data is by measuring the risk of membership disclosure. This is a measure of the F1 accuracy that an adversary would correctly ascertain that a target individual from the same population as the real data is in the dataset used to train the generative model, and is commonly estimated using a data partitioning methodology with a 0.5 partitioning parameter.

Objective:

Validate the membership disclosure F1 score, evaluate and improve the parametrization of the partitioning method, and provide a benchmark for its interpretation. Materials and

methods:

We performed a simulated membership disclosure attack on 4 population datasets an Ontario COVID-19 dataset, a state hospital discharge dataset, a national health survey, and an international COVID-19 behavioral survey. Two generative methods were evaluated sequential synthesis and a generative adversarial network. A theoretical analysis and a simulation were used to determine the correct partitioning parameter that would give the same F1 score as a ground truth simulated membership disclosure attack.

Results:

The default 0.5 parameter can give quite inaccurate membership disclosure values. The proportion of records from the training dataset in the attack dataset must be equal to the sampling fraction of the real dataset from the population. The approach is demonstrated on 7 clinical trial datasets.

Conclusions:

Our proposed parameterization, as well as interpretation and generative model training guidance provide a theoretically and empirically grounded basis for evaluating and managing membership disclosure risk for synthetic data.
Keywords

Full text: Available Collection: International databases Database: MEDLINE Type of study: Experimental Studies / Observational study / Prognostic study / Randomized controlled trials Language: English Journal: JAMIA Open Year: 2022 Document Type: Article Affiliation country: Jamiaopen

Similar

MEDLINE

...
LILACS

LIS


Full text: Available Collection: International databases Database: MEDLINE Type of study: Experimental Studies / Observational study / Prognostic study / Randomized controlled trials Language: English Journal: JAMIA Open Year: 2022 Document Type: Article Affiliation country: Jamiaopen