Search | VHL Regional Portal

1.

Learning from Data with Heterogeneous Noise using SGD.

Song, Shuang; Chaudhuri, Kamalika; Sarwate, Anand D.

JMLR Workshop Conf Proc ; 2015: 894-902, 2015 Feb.

Article in English | MEDLINE | ID: mdl-26705435

ABSTRACT

We consider learning from data of variable quality that may be obtained from different heterogeneous sources. Addressing learning from heterogenous data in its full generality is a challenging problem. In this paper, we adopt instead a model in which data is observed through heterogeneous noise, where the noise level reflects the quality of the data source. We study how to use stochastic gradient algorithms to learn in this model. Our study is motivated by two concrete examples where this problem arises naturally: learning with local differential privacy based on data from multiple sources with different privacy requirements, and learning from data with labels of variable quality. The main contribution of this paper is to identify how heterogeneous noise impacts performance. We show that given two datasets with heterogeneous noise, the order in which to use them in standard SGD depends on the learning rate. We propose a method for changing the learning rate as a function of the heterogeneity, and prove new regret bounds for our method in two cases of interest. Experiments on real data show that our method performs better than using a single learning rate and using only the less noisy of the two datasets when the noise level is low to moderate.

2.

Signal Processing and Machine Learning with Differential Privacy: Algorithms and challenges for continuous data.

Sarwate, Anand D; Chaudhuri, Kamalika.

IEEE Signal Process Mag ; 30(5): 86-94, 2013 Sep 01.

Article in English | MEDLINE | ID: mdl-24737929

3.

Convergence Rates for Differentially Private Statistical Estimation.

Chaudhuri, Kamalika; Hsu, Daniel.

Proc Int Conf Mach Learn ; 2012: 1327-1334, 2012 Jul.

Article in English | MEDLINE | ID: mdl-25302341

ABSTRACT

Differential privacy is a cryptographically-motivated definition of privacy which has gained significant attention over the past few years. Differentially private solutions enforce privacy by adding random noise to a function computed over the data, and the challenge in designing such algorithms is to control the added noise in order to optimize the privacy-accuracy-sample size tradeoff. This work studies differentially-private statistical estimation, and shows upper and lower bounds on the convergence rates of differentially private approximations to statistical estimators. Our results reveal a formal connection between differential privacy and the notion of Gross Error Sensitivity (GES) in robust statistics, by showing that the convergence rate of any differentially private approximation to an estimator that is accurate over a large class of distributions has to grow with the GES of the estimator. We then provide an upper bound on the convergence rate of a differentially private approximation to an estimator with bounded range and bounded GES. We show that the bounded range condition is necessary if we wish to ensure a strict form of differential privacy.

4.

iDASH: integrating data for analysis, anonymization, and sharing.

Ohno-Machado, Lucila; Bafna, Vineet; Boxwala, Aziz A; Chapman, Brian E; Chapman, Wendy W; Chaudhuri, Kamalika; Day, Michele E; Farcas, Claudiu; Heintzman, Nathaniel D; Jiang, Xiaoqian; Kim, Hyeoneui; Kim, Jihoon; Matheny, Michael E; Resnic, Frederic S; Vinterbo, Staal A.

J Am Med Inform Assoc ; 19(2): 196-201, 2012.

Article in English | MEDLINE | ID: mdl-22081224

ABSTRACT

iDASH (integrating data for analysis, anonymization, and sharing) is the newest National Center for Biomedical Computing funded by the NIH. It focuses on algorithms and tools for sharing data in a privacy-preserving manner. Foundational privacy technology research performed within iDASH is coupled with innovative engineering for collaborative tool development and data-sharing capabilities in a private Health Insurance Portability and Accountability Act (HIPAA)-certified cloud. Driving Biological Projects, which span different biological levels (from molecules to individuals to populations) and focus on various health conditions, help guide research and development within this Center. Furthermore, training and dissemination efforts connect the Center with its stakeholders and educate data owners and data consumers on how to share and use clinical and biological data. Through these various mechanisms, iDASH implements its goal of providing biomedical and behavioral researchers with access to data, software, and a high-performance computing environment, thus enabling them to generate and test new hypotheses.

Subject(s)

Algorithms , Confidentiality , Information Dissemination , Medical Informatics , Forecasting , Goals , Health Insurance Portability and Accountability Act , Information Storage and Retrieval , United States

5.

Differentially Private Empirical Risk Minimization.

Chaudhuri, Kamalika; Monteleoni, Claire; Sarwate, Anand D.

J Mach Learn Res ; 12: 1069-1109, 2011 Mar.

Article in English | MEDLINE | ID: mdl-21892342

ABSTRACT

Privacy-preserving machine learning algorithms are crucial for the increasingly common setting in which personal data, such as medical or financial records, are analyzed. We provide general techniques to produce privacy-preserving approximations of classifiers learned via (regularized) empirical risk minimization (ERM). These algorithms are private under the Îµ-differential privacy definition due to Dwork et al. (2006). First we apply the output perturbation ideas of Dwork et al. (2006), to ERM classification. Then we propose a new method, objective perturbation, for privacy-preserving machine learning algorithm design. This method entails perturbing the objective function before optimizing over classifiers. If the loss and regularizer satisfy certain convexity and differentiability criteria, we prove theoretical results showing that our algorithms preserve privacy, and provide generalization bounds for linear and nonlinear kernels. We further present a privacy-preserving technique for tuning the parameters in general machine learning algorithms, thereby providing end-to-end privacy guarantees for the training process. We apply these results to produce privacy-preserving analogues of regularized logistic regression and support vector machines. We obtain encouraging results from evaluating their performance on real demographic and benchmark data sets. Our results show that both theoretically and empirically, objective perturbation is superior to the previous state-of-the-art, output perturbation, in managing the inherent tradeoff between privacy and learning performance.

6.

Sample Complexity Bounds for Differentially Private Learning.

Chaudhuri, Kamalika; Hsu, Daniel.

JMLR Workshop Conf Proc ; 2011: 155-186, 2011.

Article in English | MEDLINE | ID: mdl-25285183

ABSTRACT

This work studies the problem of privacy-preserving classification - namely, learning a classifier from sensitive data while preserving the privacy of individuals in the training set. In particular, the learning algorithm is required in this problem to guarantee differential privacy, a very strong notion of privacy that has gained significant attention in recent years. A natural question to ask is: what is the sample requirement of a learning algorithm that guarantees a certain level of privacy and accuracy? We address this question in the context of learning with infinite hypothesis classes when the data is drawn from a continuous distribution. We first show that even for very simple hypothesis classes, any algorithm that uses a finite number of examples and guarantees differential privacy must fail to return an accurate classifier for at least some unlabeled data distributions. This result is unlike the case with either finite hypothesis classes or discrete data domains, in which distribution-free private learning is possible, as previously shown by Kasiviswanathan et al. (2008). We then consider two approaches to differentially private learning that get around this lower bound. The first approach is to use prior knowledge about the unlabeled data distribution in the form of a reference distribution chosen independently of the sensitive data. Given such a reference , we provide an upper bound on the sample requirement that depends (among other things) on a measure of closeness between and the unlabeled data distribution. Our upper bound applies to the non-realizable as well as the realizable case. The second approach is to relax the privacy requirement, by requiring only label-privacy - namely, that the only labels (and not the unlabeled parts of the examples) be considered sensitive information. An upper bound on the sample requirement of learning with label privacy was shown by Chaudhuri et al. (2006); in this work, we show a lower bound.

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL