Skip to content

Cynthia Dwork Microsoft Research Paper

Q: Are there examples of that happening?

A: A famous example of a system that has wrestled with bias is the resident matching program that matches graduating medical students with residency programs at hospitals. The matching could be slanted to maximize the happiness of the residency programs, or to maximize the happiness of the medical students. Prior to 1997, the match was mostly about the happiness of the programs.

This changed in 1997 in response to “a crisis of confidence concerning whether the matching algorithm was unreasonably favorable to employers at the expense of applicants, and whether applicants could ‘game the system,’ ” according to a paper by Alvin Roth and Elliott Peranson published in The American Economic Review.

Q: You have studied both privacy and algorithm design, and co-wrote a paper, “Fairness Through Awareness,” that came to some surprising conclusions about discriminatory algorithms and people’s privacy. Could you summarize those?

A: “Fairness Through Awareness” makes the observation that sometimes, in order to be fair, it is important to make use of sensitive information while carrying out the classification task. This may be a little counterintuitive: The instinct might be to hide information that could be the basis of discrimination.

Q: What’s an example?

A: Suppose we have a minority group in which bright students are steered toward studying math, and suppose that in the majority group bright students are steered instead toward finance. An easy way to find good students is to look for students studying finance, and if the minority is small, this simple classification scheme could find most of the bright students.

But not only is it unfair to the bright students in the minority group, it is also low utility. Now, for the purposes of finding bright students, cultural awareness tells us that “minority+math” is similar to “majority+finance.” A classification algorithm that has this sort of cultural awareness is both more fair and more useful.

Fairness means that similar people are treated similarly. A true understanding of who should be considered similar for a particular classification task requires knowledge of sensitive attributes, and removing those attributes from consideration can introduce unfairness and harm utility.

Q: How could the university create a fairer algorithm? Would it mean more human involvement in the work that software does, collecting more personal data from students or taking a different approach when the algorithm is being created?

A: It would require serious thought about who should be treated similarly to whom. I don’t know of any magic bullets, and it is a fascinating question whether it is possible to use techniques from machine learning to help figure this out. There is some preliminary work on this problem, but this direction of research is still in its infancy.

Q: Another recent example of the problem came from Carnegie Mellon University, where researchers found that Google’s advertising system showed an ad for a career coaching service for “$200k+” executive jobs to men much more often than to women. What did that study tell us about these issues?

A: The paper is very thought-provoking. The examples described in the paper raise questions about how things are done in practice. I am currently collaborating with the authors and others to consider the differing legal implications of several ways in which an advertising system could give rise to these behaviors.

Q: What are some of the ways it could have happened? It seems that the advertiser could have targeted men, or the algorithm determined that men were more likely to click on the ad.

A: Here is a different plausible explanation: It may be that there is more competition to advertise to women, and the ad was being outbid when the web surfer was female.

Q: The law protects certain groups from discrimination. Is it possible to teach an algorithm to do the same?

A: This is a relatively new problem area in computer science, and there are grounds for optimism — for example, resources from the Fairness, Accountability and Transparency in Machine Learning workshop, which considers the role that machines play in consequential decisions in areas like employment, health care and policing. This is an exciting and valuable area for research.

Q: Whose responsibility is it to ensure that algorithms or software are not discriminatory?

A: This is better answered by an ethicist. I’m interested in how theoretical computer science and other disciplines can contribute to an understanding of what might be viable options.

The goal of my work is to put fairness on a firm mathematical foundation, but even I have just begun to scratch the surface. This entails finding a mathematically rigorous definition of fairness and developing computational methods — algorithms — that guarantee fairness.

Q: In your paper on fairness, you wrote that ideally a regulatory body or civil rights organization would impose rules governing these issues. The tech world is notoriously resistant to regulation, but do you believe it might be necessary to ensure fairness in algorithms?

A: Yes, just as regulation currently plays a role in certain contexts, such as advertising jobs and extending credit.

Q: Should computer science education include lessons on how to be aware of these issues and the various approaches to addressing them?

A: Absolutely! First, students should learn that design choices in algorithms embody value judgments and therefore bias the way systems operate. They should also learn that these things are subtle: For example, designing an algorithm for targeted advertising that is gender-neutral is more complicated than simply ensuring that gender is ignored. They need to understand that classification rules obtained by machine learning are not immune from bias, especially when historical data incorporates bias. Techniques for addressing these kinds of issues should be quickly incorporated into curricula as they are developed.

Continue reading the main story

Cynthia Dwork has spent much of her career working on ways to ensure that your personal data stay private even when it is being used for scientific research.

Now, she’s also applying those mathematical methods to making certain that the conclusions researchers draw from analyzing big data sets are as accurate as possible.

Dwork, a cryptographer and distinguished scientist at Microsoft Research, and several colleagues recently published a paper in Science magazine showing how their groundbreaking work on differential privacy also can help researchers guarantee the accuracy of their results.

We spoke with her about her work and what has inspired it.

ALLISON LINN: I want to start by talking about differential privacy. How would you explain it to a person who isn’t an expert in this field?

CYNTHIA DWORK: Differential privacy is a definition of privacy that is tailored to privacy-preserving data analysis.

So, assume that you have a large data set that’s full of very useful but also very sensitive information. You’d like to be able to release statistics about that data set while simultaneously preserving the privacy of everybody who’s in the data set.

What differential privacy says is that, essentially, the same things are learned whether any individual opts in or opts out of the data set. So what that means is I wouldn’t be harmed by things that you learn from the data set. You won’t learn anything about me that you wouldn’t learn had I not been included.

ALLISON LINN: Can you give me a real-life example of when a researcher might want to use one of these techniques?

CYNTHIA DWORK: So imagine that somebody asks, “How many members of the House of Representatives have sickle cell trait?” Our intuition says that getting an exact answer to that shouldn’t compromise the privacy of anybody in the House of Representatives because it’s a pretty big set of people and you’re just getting one number back.

But now suppose you have, in addition to the answer to that question, the exact answer to the question, “How many members of the House of Representatives, other than the Speaker of the House, have the sickle cell trait?”

Now, that also, by itself, seems like an innocuous question and getting the answer doesn’t seem to cause any problems because it’s still a pretty big set of people that you’re asking about.

But if you take these two answers together and you subtract one from the other, then you will learn the sickle cell status of the Speaker of the House.

ALLISON LINN: What drew you to this area of research?

CYNTHIA DWORK: Conversations with the philosopher Helen Nissenbaum. Nissenbaum is a philosopher who studies issues that arise in the context of new technologies, and she was doing some work on privacy in public. What is privacy in public when you have video cameras everywhere?

That got me thinking about privacy in general, and I realized that privacy is this sort of catch-all phrase that means many different things in different contexts. I wanted to bite off a piece of the privacy puzzle that I would be able to chew on, and so I thought of privacy-preserving data analysis.

ALLISON LINN: You have a new paper coming out this week in Science that builds on some of the ideas around differential privacy to focus on data accuracy. Can you tell me a little bit about this work?

CYNTHIA DWORK: There is a technique from the machine learning community where you take your whole data set and you split it into two parts: a training set and a holdout set. Then, you do whatever you want on the training set in order to try to come up with some hypotheses about the general population. To check the validity of your conclusion, you test whether the hypothesis holds on the holdout set.

So far, so good. But now suppose that you’d like to do more study of your training set. Now, suddenly, the questions that you’re asking of your training set depend on the holdout set, and so the holdout set can no longer be looked at as fresh data that’s totally independent of everything that you’ve done so far.

What we show is that if you only access the holdout set through a differentially private mechanism, then it is okay to reuse it over and over again.

ALLISON LINN: How does that guarantee that people aren’t going to draw spurious conclusions from the data?

CYNTHIA DWORK: So let’s say that this is actually a very large data set. You publish your conclusions, and now somebody else comes along and they say, “Oh, that’s interesting, I want to study a few other things in that data set.” They can do that, and then they can check their conclusions on this same holdout set, and this can be repeated.

We’re trying to capture the fact that science is an adaptive process. The second question you asked might depend on the answer to the first question. The second study or the fifth study may depend on what was published in the first four studies.

ALLISON LINN: Can you give me an example of how your new method could be applied to helping people ensure that their data is accurate?

CYNTHIA DWORK: We’re coming to a time when data sets will be very, very large, and lots of people will be studying the same data sets. I think this is going to be happening with medical data, for example, and with genomic data.

I don’t think it will be feasible to always go out and recruit completely fresh samples and start all over again, so I think that this question of remaining statistically valid in the adaptive scenario where new questions and new studies depend on the outcomes of previous studies is going to become more and more important. We are proposing a tool that will help with this process.


The Reusable Holdout: Preserving validity in adaptive data analysis, by Cynthia Dwork, Microsoft Research, Vitaly Feldman, IBM Almaden Research Center, Moritz Hardt, Google Research, Toniann Pitassi, University of Toronto, Omer Reingold, Samsung Research America, and Aaron Roth, University of Pennsylvania
Differential privacy
The Algorithmic Foundations of Differential Privacy
Penn Research helps develop algorithm aimed at combating science’s reproducibility problem
Preserving validity in adaptive data analysis

Allison Linn is a senior writer at Microsoft Research. Follow her on Twitter.

Tags: Big Data, data privacy, Microsoft Research, Privacy