FAT* Papers: Systems and Measurement

I’ve made it to Session 2 of my series of posts on the FAT* conference.

If you build it they will come. 

How should we build systems that incorporate all that we’ve learnt about fairness, accountability and transparency. How do we go from saying “this is a problem” to saying “Here’s a solution”?

Three of the four papers in this session seek (in different ways) to address this question, focusing on both the data and the algorithms that make up an ML model.

Beyond Open vs. Closed: Balancing Individual Privacy and Public Accountability in Data Sharing

The paper by Meg Young and friends from UW makes a strong argument for the idea of a data trust. Recognizing that we need good data to drive good policy and to evaluate technology and also recognizing that there are numerous challenges — privacy, fairness, and accountabilty — around providing such data, not to mention issues with private vs public ownership, they present a case study of a data trust built with academics as the liaison between private and public entities that might want to share data and (other) entities that might want to make use of it.

They bring up a number of interesting technology ideas from the world of privacy and fairness: differential privacy to control data releases, causal modeling to generate real-ish data for use in analysis and so on. They argue that this is probably the best way to reconcile legal and commercial hurdles over data sharing and is in fact the responsible way to go.

Takeaway: I think this is an interesting proposal. To some extent the devil is in the details and the paper left me wanting more, but they have a great proof of concept to check out. While I might be quite happy with academics being the “trusted escrow agent”, I wonder if that’s always the best option?

Of course it’s not just data you need governance for. What about models?

Model Cards for Model Reporting

This is one in a series of papers coming up right now that tackle the problem of model portability. How do I know if my model is being applied out of context with unpredictable results?

The solution from Margaret Mitchell et al (9 authors and counting…!) is to attach a model “spec sheet” to a trained model. The spec sheet would give you important documentation about the model — how it was trained, with what training regime, what data, what error rates and so on — in the hope that when the model is applied elsewhere, the spec sheet will prevent you from taking it out of context.

Takeaway: This is again a creative use of the idea of ‘user scripts‘ as a way to carry context. I wondered when I first read it whether it makes sense to talk about a model spec in the abstract without looking at a specific domain like some papers have done. I think the jury is still out (i.e “more research needed”) to see if model spec sheets can be useful in full generality or if we need the right “level of abstraction” to make usable, but this is an interesting direction to explore.

But instead of building data trusts or model specs, how about instrumenting your code itself with certificates of fairness that can be monitored while the code runs?

Fairness-Aware Programming

This paper by Albarghouthi and Vinitsky takes a programming-language perspective on building fair classifiers. Suppose we could annotate our code with specifications describing what properties a classifier must satisfy and then have tools that ensure that these specifications are satisfied while the code runs[1]. That would be pretty neat!

That’s basically what they do in this paper, by taking advantage of Python decorations to encode desired fairness specs. They show how to capture notions like disparate impact  and demographic parity and even weak forms of individual fairness. One thought I did have though: they are essentially trying to encode the ability to verify probabilistic statements, and I wonder if it might be easier to do this in one of the new and shiny probabilistic programming languages out there? Granted, Python is a more mainstream language (uh-oh, the PL police will be after me now). 

I know who you are, but what am I? 

It’s great to build systems that draw on the tech we’ve developed over the last many years. But there’s still more to learn about the ongoing shenanigans happening on the internet.

Who’s the Guinea Pig? Investigating Online A/B/n Tests in-the-Wild

You don’t want to mess with Christo Wilson. He can sue you — literally. He’s part of an ACLU lawsuit against the DoJ regarding the CFAA and its potential misuse to harass researchers trying to audit web systems. In this paper Shan Jiang, John Martin and Christo turn their audit gaze to online A/B testing.

If you’ve ever compulsively reloaded the New York Times, you’ll notice that the headlines of articles will change from time to time. Or you’ll go to a website and not see the same layout as someone else. This is because (as the paper illustrates) major websites are running experiments… on you.. They are doing A/B testing of various kinds potentially to experiment with different layouts, or potentially even to show you different kinds of content depending on who you are.

The paper describes an ingenious mechanism to reveal when a website is using A/B testing and determine what factors appear to be going into the decision to show particular content. The experimental methodology is a lot of fun to read.

While the authors are very careful to point out repeatedly that they find no evidence of sinister motives behind the use of A/B testing in the wild, the fact remains that we are being experimented on constantly without any kind of IRB protection (*cough* facebook emotional contagion *cough*). It’s not too far a leap to realize that the quest for “personalization” might mean that we eventually have no shared experience of the internet and that’s truly frightening.

And that’s it for now. Stay tuned for more…

Footnotes:

  1. An earlier version of this note had incorrectly described the paper as doing static verification instead of run-time verification.
  2. Advertisements

FAT* Papers: Framing and Abstraction

The FAT* Conference is almost upon us, and I thought that instead of live-blogging from the conference (which is always exhausting) I’d do a preview of the papers. Thankfully we aren’t (yet) at 1000 papers in the proceedings, and I can hope to read and say something not entirely stupid (ha!) about each one.

I spent a lot of time pondering how to organize my posts, and then realized the PC chairs had already done the work for me, by grouping papers into sessions. So my plan is to do a brief (BRIEF!) review of each session, hoping to draw some general themes. (ed: paper links will yield downloadable PDFs starting Tuesday Jan 29)

And with that, let’s start with Session 1: Framing and abstraction.

Those who do not learn from history are doomed to repeat it. — George Santayana

Those who learn from history are doomed to repeat it. — twitter user, about machine learning.

50 Years of Test (Un)fairness: Lessons for Machine Learning

Ben Hutchinson and Margaret Mitchell have a fascinating paper on the history of discourse on (un)fairness. It turns out that dating back to the 60s, and in the context of standardized testing, researchers have been worried about the same issues of bias in evaluation that we talk about today.

I’m actually understating the parallels. It’s uncanny how the discussion on fairness and unfairness evolved precisely in the way it’s evolving right now. Their paper has a beautiful table that compares measures of fairness then and now, and the paper is littered with quotes from early work in the area that mirror our current discussions about different notions of fairness, the subtleties in error rate management across racial categories, the concerns over construct validity, and so on. It was a humbling experience for me to read this paper and realize that indeed, everything old is new again.

Takeaways: Among the important takeaways from this paper is:

  • The subtle switch from measuring UNfairness to measuring fairness caused the field to eventually wither away. How should we return to studying UNfairness?
  • If we can’t get notions of fairness to align with public perceptions, it’s unlikely that we will be able to get public policy to align with our technical definitions.
  • It’s going to be important to encode values explicitly in our systems.

which is a useful segue to the next paper:

Assume a spherical cow, …. or a rational man.

Problem Formulation and Fairness

Samir Passi and Solon Barocas present the results of an ethnographic study into the (attempted) deployment of a predictive tool to decide which potential car buyers (leads) to send to dealers. The company building the tool sells these leads to dealers, and so wants to make sure that the quality of the leads is high.

This is a gripping read, almost like a thriller. They carefully outline the choices the team (comprising business analysts, data scientists and managers) makes at each stage, and how they go from the high level goal “send high quality leads to dealers” to a goal that is almost ML-ready: “find leads that are likely to have credit scores above 500”. As one likes to say, the journey is more important than the destination, and indeed the way in which the goals get concretized and narrowed based on technical, business and feasibility constraints is both riveting and familiar to anyone working in (or with) corporate data science.

Takeaways: The authors point out that no actual problem in automated decision-making is a pure classification or regression problem. Rather, people (yes, PEOPLE) make a series of choices that narrow the problem space down to something that is computationally tractable. And it’s a dynamic process where data constraints as well as logistical challenges constrain the modeling. At no point in time do ethical or normative concerns surface, but the sequence of choices made clearly has an overall effect that could lead to disparate impact of some kind. They argue correctly, that we spend far too little time paying attention to these choices and the larger role of the pipeline around the (tiny) ML piece.

which is an even nice segue to the last paper:

Context matters, duh!

Fairness and Abstraction in Sociotechnical Systems

This is one of my papers, together with Andrew Selbst, danah boyd, Sorelle Friedler and Janet Vertesi. And our goal was to understand the failure modes of Fair-ML systems when deployed in a pipeline. The key thesis of our paper is: Abstraction is a core principle in a computer system, but it’s also the key point of failure when dealing with a socio-technical system. 

We outline a series of traps that fair-ML papers fall into even while trying to design what look like enlightened decision systems. You’ll have to read the paper for all the details, but one key trap that I personally struggle with is the formalization trap: the desire to nail down a formal specification that can be then optimized. This is a trap because the nature of the formal specification can be contested and evolve from context to context (even within a single organization, pace the paper above) and a too-hasty formalization can freeze the goals in a way that might not be appropriate for the problem. In other words, don’t fix a fairness definition in stone (this is important: I’m constantly asked by fellow theoryCS people what the one true definition of fairness is — so that they can go away and optimize the heck out of it).

Session Review:

When I read these three papers in sequence, I feel a black box exploding open revealing its messy and dirty inner workings. Ever since Frank Pasquale’s The Black Box Society came out, I’ve felt a constant sentiment from non-technical people (mostly lawyers/policy people) that the goal should be to route policy around the black box “AI” or “ML”. My contention has always been that we can’t do that: that understanding the inner workings of the black box is crucial to understanding both what works and what fails in automated decision systems. Conversely, technical people have been loathe to engage with the world OUTSIDE the black box, preferring to optimize our functions and throw them over the fence, Anathem-style.

I don’t think either approach is viable. Technical designers need to understand (as AOC clearly does!) that design choices that seem innocuous can have major downstream impact and that ML systems are not plug-and-play. But conversely those that wish to regulate and manage such systems need to be willing to go into the nitty gritty of how they are built and think about regulating those processes as well.

P.S

Happy families are all alike; every unhappy family is unhappy in its own way. — Tolstoy

There  is one way to be fair, but many different ways of being unfair.

Every person with good credit looks the same: but people have bad credit for very different reasons.

There might be more to this idea of “looking at unfairness” vs “looking at fairness. As I had remarked to Ben and Margaret a while ago, it has the feel of an NP vs co-NP question 🙂 – and we know that we don’t know if they’re same.

 

 

Benchmarks and reproducibility in fair ML

These days, there are lots of fairness-aware classification algorithms out there. This is great! It should mean that for any task you want to pursue you can try out a bunch of fair classifiers and pick the one that works best on your dataset under the fairness measure you like most.

Unfortunately, this has not been the case. Even in the cases where code is available, the preprocessing of a specific data set is often wrapped into the algorithm, making it hard to reuse the code and hard to see what the impact of different preprocessing choices are on the algorithm. Many authors have used the same data sets, but preprocessed different ways and evaluated under different metrics. Which one is the best?

In an effort to address some of these questions, we’ve made a repository and written an accompanying paper detailing what we’ve found.

http://github.com/algofairness/fairness-comparison

We’ve made our best effort to include existing algorithms and represent them correctly, but if you have code that we’ve missed or see something we’ve messed up, we hope you’ll submit a pull request or just shoot us an email.

Some highlights…

Metrics: There are so many fairness metrics! Or are there? We find that a lot of them are correlated on the algorithms and datasets we looked at. In fact, there are two groups: disparate impact like measures and class-sensitive error measures. And accuracy measures are not a district group! They correlate with the class-sensitive error measures. So perhaps fairness-accuracy tradeoffs are only an issue with disparate impact like measures.

measure-correlation.png

Stability: We look at the stability of the algorithms and metrics over multiple random splits on a given measure by taking its standard deviation.  Here’s a cool graph based on that analysis showing disparate impact versus accuracy.

adult_race_sensitivity

We think it’s easier to understand the relative performance of algorithms taking this into account.

Preprocessing: Given the same algorithm on the same data set you can end up with different, potentially largely different, outcomes depending on small preprocessing variations, such as whether a protected race attribute is represented as all the possible values or, e.g., white and not-white.

preprocessing-tradeoff-accuracy.png

Tradeoffs: For the measures for which we found a fairness-accuracy tradeoff, different algorithms choose different parts of the tradeoff.

So which algorithm is best? As perhaps is not surprising, no one algorithm dominates over all data sets.

There’s a larger ongoing discussion about reproducibility in machine learning. This is our contribution in the fairness world.

 

 

Being Hopeful about Algorithms

I’ve been attending “think-events” around algorithmic fairness of late, firstly in Philadelphia (courtesy of the folks at UPenn) and then in DC (courtesy of the National Academy of Science and the Royal Society).

At these events, one doesn’t see the kind of knee-jerk reaction to the idea of fairness in learning that I’ve documented before. But there’s a very thoughtful critique that comes from people who’ve spent a lot of time themselves thinking and working on these topics. And it goes something like this.

Do we expect more from algorithms than we expect from people? And is that reasonable?

I first heard this critique much earlier at  a Dagstuhl meeting on this topic, when I was asked this question by H. V. Jagadish (who has a great course on ethics in data mining). It came up indirectly during discussions at the Philadelphia event (about which I hope to say something later) and was phrased in this form by Vint Cerf at the Sackler Forum.

I found myself unable to answer it convincingly. We’ve had 1000s of years to set up institutions based on humans decision making. These processes have been flawed, anfdbiased. People have made decisions with implicit and explicit bias.

Why then do we demand then that algorithms do more? Why do we demand that they account for themselves and explain themselves in ways that we don’t ask human judges to do?

I used to have an answer. I argued that algorithms speak the language of mathematics and so we need to translate all our human ideals – of ethics, fairness and justice – into a form that an algorithm could understand. But then we start talking about accountability, interpretability, how an algorithm might explain itself, and what that might even mean.

Jon Kleinberg has this analogy of a learning algorithm as this incredibly obtuse friend that you bring to a party, that you have to explain EVERYTHING to. Where the food is, what the drinks are, what people are saying, and so on. We don’t have to do this for real people because they have a vast body of prior context to work with. Indeed, this prior context is what decides how they function in the world, and is made up of all kinds of heuristics and “biasing” of the space of possible outcomes (as Joanna Bryson puts it).

So it would seem that asking an algorithm for its “audit trail” is the equivalent of asking (say) a human judge “give me the entire story of your life experiences that explains why you made this decision”.

And of course we never do this. In fact, all we really do is set out a series of guidelines and expect the judges to be more or less consistent with them. Similarly for hiring, or credit decision, or any other kind of decision making. In other words, we expect a certain degree of procedural consistency while accepting that individuals may apply discretion based on their own perspective.

So I return to the question from before. Why do we expect an automated decision making process to be any better?

There’s an optimistic take on this. We can’t expect an audit trail from a human decision maker because we don’t have the capacity to generate one. That my verdict on a dog owner might in part be due to being bitten by a dog as a child is something that I’m unlikely to be able to cogently articulate. But it is at least a little unfair that I sentence dog owners more harshly for this reason.

But if we are able to produce such an audit trail from an algorithmic decision maker we do have the hope of revealing implicit preferences and biases based on the *algorithm’s* “life experiences” aka “training data”. And so we can expect more because we have the ability to do so.

An alternate perspective on this goes as follows. We’ve built up over the decades and centuries a system of checks and balances and accountability procedures for evaluating the quality of human decision making. We have laws that require non-discrimination, we have ways to remove decision-makers who make arbitrary decisions, and we have a social structure that makes decision-makers feel a sense of responsibility for their decisions.

None of these exist for algorithmic decision-making, or realistically can. We can’t call an algorithm to account for a bad decision: ultimately all liability rests on legal persons. So the next best thing is to make an algorithm assist in the decision making process, but require transparency so that the human decision-maker can’t blame the algorithm for bad decisions “It’s not me, it’s the algorithm!”, a story that played out in Cory Doctorow’s Human Readable.

There’s a tension between “let’s use automation wherever reasonable”, and “wait. how are you shifting harm?”. We don’t want to stop the deployment of algorithms in decision-making, and frankly I doubt that one could even if one wanted to. But it’s also not unreasonable to express some caution (and perhaps some humility) when doing this. We’re not expecting perfection from automated decision-making: it’s perfectly reasonable to expect just that we can do better than human decision makers. But why not expect that as well as expect a decision that we can understand? Why essentially give up by saying “the algorithm cannot both be powerful and understandable”. To me, that’s the real failure of hope.

Post-doc in Fairness at Data and Society

As part of the research we’re doing in algorithmic fairness we’re looking to hire a post-doctoral researcher who can help us bridge the gap between the more technical aspects of algorithmic fairness and the ways in which this discussion informs and is informed by the larger context in the social sciences. Specifically,

  • Candidates for this position should have a strong grasp of technical systems (including machine learning), as well as a rich understanding of socio-technical discussions. For example, candidates might have an undergraduate degree in computer science and a PhD in a social science field. Or they may have a more hybrid degree in an information school or CS program. They may be a data scientist or study data scientists.
  • Candidates should be able to translate between engineers and critics, feel comfortable at ACM/AAAI/IEEE conferences and want to publish in law reviews or social science journals as well as CS proceedings.
  • Candidates should be excited by the idea of working with researchers invested in fairness, accountability, and transparency in machine learning (e.g., fatml.org).
  • Preference given for researchers who have qualitative empirical skills.

If you might be such a person, please do send in an application (Role #1).

Data & Society is a wonderful place to be if you’re at all interested in this area. danah boyd has assembled a group of thinkers that represent the best kind of holistic thinking on a topic that intersects CS, sociology, political science and the law.

A funny thing happened on the way to the arXiv….

As I mentioned in the previous post, Sorelle Friedler, Carlos Scheidegger and I just posted a note to the arXiv on worldviews for thinking about fairness and nondiscrimination.

We uploaded the article last Friday, and it appeared on the arXiv on Sunday evening. By Monday late morning (less than 24 hours after the article was posted), we received this email:

I’m a reporter for Motherboard, VICE Media’s technology news site who frequently covers bias in machine learning. I read your paper posted to arXiv and would love to interview one of you for a piece on the work.

I assumed the reporter was referring to one of the two papers we’ve written so far on algorithmic fairness. But no, from the subject line it was clear that the reporter was referring to the article we had just posted! 

I was quite nervous about this: on the one hand it was flattering and rather shocking to get a query that quickly, and on the other hand this was an unreviewed preprint.

In any case, I did the interview. And the article is now out!

On the (im)possibility of fairness…

Ever since we started thinking about algorithmic fairness and the general issue of data-driven decision-making, there’s always been this nagging issue of “well what if there are cues in data that seem racist/sexist/(–)-ist and yet provide a good signal for a decision?”

There’s no shortage of people willing to point this out: see for example my post on the standard tropes that appear whenever someone discovers bias in some algorithmic process. Most of the responses betray a unexamined belief in the truth of what algorithms discover in data, and that is not satisfying either.

So the problem we’ve faced is this. If you examine closely the computer science literature on fairness and bias, it becomes clear that people are talking at cross-purposes: essentially arguing about why your orange is not more like my apple. And it has become clear that this is because of different assumptions about the world (how biased it is, how unbiased certain features are, and so on).

Here’s the pitch:

Can we separate out assumptions and beliefs about fairness from mechanisms that we deploy to ensure it? And in doing so, can we provide a useful vocabulary for talking about these issues within a common framework?

Here’s the result of our two-year long quest:

On the (im)possibility of fairness

What does it mean for an algorithm to be fair? Different papers use different notions of algorithmic fairness, and although these appear internally consistent, they also seem mutually incompatible. We present a mathematical setting in which the distinctions in previous papers can be made formal. In addition to characterizing the spaces of inputs (the “observed” space) and outputs (the “decision” space), we introduce the notion of a construct space: a space that captures unobservable, but meaningful variables for the prediction.
We show that in order to prove desirable properties of the entire decision-making process, different mechanisms for fairness require different assumptions about the nature of the mapping from construct space to decision space. The results in this paper imply that future treatments of algorithmic fairness should more explicitly state assumptions about the relationship between constructs and observations.

This paper has been a struggle to write. It’s a strange paper in that the main technical contribution is mainly conceptual: establishing what we think are the right basic primitives that can be used to express (mathematically) concepts like fairness, nondiscrimination, and structural bias.

We owe a great debt to our many friends in the social sciences community, as well as the decades of research on this topic in the social sciences. Much of the conceptual development we outline has been laid out in prose form by the many theories of social justice starting with Rawls, but particularly by Roemer. Our main goal has been to mathematize some of these ideas so that we can apply them to algorithms.

 

There’s a great deal of trepidation with which we release this: it’s in many ways a preliminary work that raises more questions than it answers. But we’ve benefited from lots of feedback within CS and without, and hope that this might clarify some of the discussions swirling around algorithmic fairness.