Your browser doesn't support javascript.
Measuring re-identification risk using a synthetic estimator to enable data sharing.
Jiang, Yangdi; Mosquera, Lucy; Jiang, Bei; Kong, Linglong; El Emam, Khaled.
  • Jiang Y; Department of Mathematical and Statistical Sciences, University of Alberta, Edmonton, Canada.
  • Mosquera L; Replica Analytics Ltd., Ottawa, Ontario, Canada.
  • Jiang B; Replica Analytics Ltd., Ottawa, Ontario, Canada.
  • Kong L; Department of Mathematical and Statistical Sciences, University of Alberta, Edmonton, Canada.
  • El Emam K; Department of Mathematical and Statistical Sciences, University of Alberta, Edmonton, Canada.
PLoS One ; 17(6): e0269097, 2022.
Article in English | MEDLINE | ID: covidwho-1963000
ABSTRACT

BACKGROUND:

One common way to share health data for secondary analysis while meeting increasingly strict privacy regulations is to de-identify it. To demonstrate that the risk of re-identification is acceptably low, re-identification risk metrics are used. There is a dearth of good risk estimators modeling the attack scenario where an adversary selects a record from the microdata sample and attempts to match it with individuals in the population.

OBJECTIVES:

Develop an accurate risk estimator for the sample-to-population attack.

METHODS:

A type of estimator based on creating a synthetic variant of a population dataset was developed to estimate the re-identification risk for an adversary performing a sample-to-population attack. The accuracy of the estimator was evaluated through a simulation on four different datasets in terms of estimation error. Two estimators were considered, a Gaussian copula and a d-vine copula. They were compared against three other estimators proposed in the literature.

RESULTS:

Taking the average of the two copula estimates consistently had a median error below 0.05 across all sampling fractions and true risk values. This was significantly more accurate than existing methods. A sensitivity analysis of the estimator accuracy based on variation in input parameter accuracy provides further application guidance. The estimator was then used to assess re-identification risk and de-identify a large Ontario COVID-19 behavioral survey dataset.

CONCLUSIONS:

The average of two copula estimators consistently provides the most accurate re-identification risk estimate and can serve as a good basis for managing privacy risks when data are de-identified and shared.
Subject(s)

Full text: Available Collection: International databases Database: MEDLINE Main subject: COVID-19 Type of study: Experimental Studies / Observational study / Prognostic study Topics: Variants Limits: Humans Language: English Journal: PLoS One Journal subject: Science / Medicine Year: 2022 Document Type: Article Affiliation country: Journal.pone.0269097

Similar

MEDLINE

...
LILACS

LIS


Full text: Available Collection: International databases Database: MEDLINE Main subject: COVID-19 Type of study: Experimental Studies / Observational study / Prognostic study Topics: Variants Limits: Humans Language: English Journal: PLoS One Journal subject: Science / Medicine Year: 2022 Document Type: Article Affiliation country: Journal.pone.0269097