Moritz Hardt
The Reusable Holdout: Preserving Validity in Adaptive Data Analysis
A great deal of effort has been devoted to reducing the risk of
spurious scientific discoveries, from the use of holdout sets and
sophisticated cross-validation techniques, to procedures for
controlling the false discovery rate in multiple hypothesis testing.
However, there is a fundamental disconnect between the theoretical
results and the practice of science: the theory assumes a fixed
collection of hypotheses to be tested, or learning algorithms to be
applied, selected non-adaptively before the data are gathered, whereas
science is by definition an adaptive process, in which data are shared
and re-used, and hypotheses and new studies are generated on the basis
of data exploration and previous outcomes.
Surprisingly, the challenges of adaptivity can be addressed using
insights from differential privacy, a field of study supporting a
definition of privacy tailored to private data analysis. As a
corollary we show how to safely reuse a holdout set a great many times
without undermining its validation power, even when hypotheses and
computations are chosen adaptively. Armed with this technique, the
analyst is free to explore the data ad libitum, generating and
evaluating hypotheses, verifying results on the holdout, and
backtracking as needed.
Joint work with Cynthia Dwork, Vitaly Feldman, Toni Pitassi, Omer
Reingold and Aaron Roth. |
Kobbi Nissim
The sample complexity of private learning
Learning is a task that abstracts many of the computations performed over collections of sensitive individual information, hence natural examine in conjunction with differential privacy.
Predating differential privacy, in 2005 Blum, Dwork, McSherry and Nissim showed that any concept class that is learnable in Kearns' model of statistical queries is also learnable with privacy. The concept of Private learning was later formalized by Kasiviswanathan et al in 2008 as the conjunction of PAC learning and differential privacy. They showed that any concept class |C| can be learned privately with O(log|C|) samples, via a construction that is somewhat analogous to the Occam Razor (non-private) learner.
Investigating the gap between the O(log|C|) sample complexity of private learners and Theta(VC(C)) sample complexity of non-private learners resulted in a rich picture of private learners. Under pure (epsilon, 0)-differential privacy it is now known that sample complexity of proper learners (restricted to output a hypothesis in C) can be significantly higher than that of improper learners, that the latter is fully characterized by the representation dimension of C, and that the representation dimension is generally higher than the VC dimension. Under approximate (epsilon, delta)-differential privacy the picture is currently less clear, but it is known that the sample complexity of proper learners is generally higher than the VC dimension. Other models that were investigated include active learning, semi-supervised learning, and simultaneously performing k learning tasks.
In the talk, I will define private learning, review the main results and present some of the more recent technical results. Time permitting, I will mention how some of these results relate to task of data sanitization.
Based on joint works with: Amos Beimel, Avrim Blum, Hai Brenner, Mark Bun, Cynthia Dwork, Shiva Kasiviswanathan, Homin Lee, Frank McSherry, Sofya Raskhodnikova, Adam Smith, Uri Stremmer, and Salil Vadhan. |
Aaron Roth
Privacy as a Tool for Robust Mechanism Design in Large Markets
Differential privacy is an algorithmic stability guarantee that can be
applied to derive asymptotically dominant strategy truthful mechanisms
for interesting problems in large markets. Example applications
include: computing a school-optimal stable matching, while
incentivizing truthful reporting for the students, maximizing welfare
in any problem in which agents have convex utility functions using
anonymous item pricings (rather than non-anonymous, bundle pricings
that arise from the VCG mechanism), and solving the equilibrium
selection problem using weak "mediators" in settings of incomplete
information.
For some of these problems, there are analogous "large market" results
in the economic literature, but the applications of differential
privacy tend to give results with the following fundamental
difference. In the economics literature, it is typically assumed that
agent types are drawn from a prior distribution (which often results
in replication economies -- i.e. economies in which there are many
individuals who have each type), and the incentive properties are
studied in settings in which an exact outcome is realized -- i.e. an
exact walrasian equilibrium, or an exactly school optimal stable
matching. In contrast, differentially private mechanisms compute
approximate versions of these outcomes, but can guarantee individual
incentives -in the worst case-, without making any assumptions about
how type profiles arise. This allows for truthfulness even in settings
in which there are individuals with highly unusual type profiles. The
differential privacy approach also can give finite population
guarantees even in relatively "small" economies, in which the number
of agents can be e.g. much smaller than the number of types, or the
number of available actions. |