Being Hopeful about Algorithms

I’ve been attending “think-events” around algorithmic fairness of late, firstly in Philadelphia (courtesy of the folks at UPenn) and then in DC (courtesy of the National Academy of Science and the Royal Society).

At these events, one doesn’t see the kind of knee-jerk reaction to the idea of fairness in learning that I’ve documented before. But there’s a very thoughtful critique that comes from people who’ve spent a lot of time themselves thinking and working on these topics. And it goes something like this.

Do we expect more from algorithms than we expect from people? And is that reasonable?

I first heard this critique much earlier at  a Dagstuhl meeting on this topic, when I was asked this question by H. V. Jagadish (who has a great course on ethics in data mining). It came up indirectly during discussions at the Philadelphia event (about which I hope to say something later) and was phrased in this form by Vint Cerf at the Sackler Forum.

I found myself unable to answer it convincingly. We’ve had 1000s of years to set up institutions based on humans decision making. These processes have been flawed, anfdbiased. People have made decisions with implicit and explicit bias.

Why then do we demand then that algorithms do more? Why do we demand that they account for themselves and explain themselves in ways that we don’t ask human judges to do?

I used to have an answer. I argued that algorithms speak the language of mathematics and so we need to translate all our human ideals – of ethics, fairness and justice – into a form that an algorithm could understand. But then we start talking about accountability, interpretability, how an algorithm might explain itself, and what that might even mean.

Jon Kleinberg has this analogy of a learning algorithm as this incredibly obtuse friend that you bring to a party, that you have to explain EVERYTHING to. Where the food is, what the drinks are, what people are saying, and so on. We don’t have to do this for real people because they have a vast body of prior context to work with. Indeed, this prior context is what decides how they function in the world, and is made up of all kinds of heuristics and “biasing” of the space of possible outcomes (as Joanna Bryson puts it).

So it would seem that asking an algorithm for its “audit trail” is the equivalent of asking (say) a human judge “give me the entire story of your life experiences that explains why you made this decision”.

And of course we never do this. In fact, all we really do is set out a series of guidelines and expect the judges to be more or less consistent with them. Similarly for hiring, or credit decision, or any other kind of decision making. In other words, we expect a certain degree of procedural consistency while accepting that individuals may apply discretion based on their own perspective.

So I return to the question from before. Why do we expect an automated decision making process to be any better?

There’s an optimistic take on this. We can’t expect an audit trail from a human decision maker because we don’t have the capacity to generate one. That my verdict on a dog owner might in part be due to being bitten by a dog as a child is something that I’m unlikely to be able to cogently articulate. But it is at least a little unfair that I sentence dog owners more harshly for this reason.

But if we are able to produce such an audit trail from an algorithmic decision maker we do have the hope of revealing implicit preferences and biases based on the *algorithm’s* “life experiences” aka “training data”. And so we can expect more because we have the ability to do so.

An alternate perspective on this goes as follows. We’ve built up over the decades and centuries a system of checks and balances and accountability procedures for evaluating the quality of human decision making. We have laws that require non-discrimination, we have ways to remove decision-makers who make arbitrary decisions, and we have a social structure that makes decision-makers feel a sense of responsibility for their decisions.

None of these exist for algorithmic decision-making, or realistically can. We can’t call an algorithm to account for a bad decision: ultimately all liability rests on legal persons. So the next best thing is to make an algorithm assist in the decision making process, but require transparency so that the human decision-maker can’t blame the algorithm for bad decisions “It’s not me, it’s the algorithm!”, a story that played out in Cory Doctorow’s Human Readable.

There’s a tension between “let’s use automation wherever reasonable”, and “wait. how are you shifting harm?”. We don’t want to stop the deployment of algorithms in decision-making, and frankly I doubt that one could even if one wanted to. But it’s also not unreasonable to express some caution (and perhaps some humility) when doing this. We’re not expecting perfection from automated decision-making: it’s perfectly reasonable to expect just that we can do better than human decision makers. But why not expect that as well as expect a decision that we can understand? Why essentially give up by saying “the algorithm cannot both be powerful and understandable”. To me, that’s the real failure of hope.

Post-doc in Fairness at Data and Society

As part of the research we’re doing in algorithmic fairness we’re looking to hire a post-doctoral researcher who can help us bridge the gap between the more technical aspects of algorithmic fairness and the ways in which this discussion informs and is informed by the larger context in the social sciences. Specifically,

  • Candidates for this position should have a strong grasp of technical systems (including machine learning), as well as a rich understanding of socio-technical discussions. For example, candidates might have an undergraduate degree in computer science and a PhD in a social science field. Or they may have a more hybrid degree in an information school or CS program. They may be a data scientist or study data scientists.
  • Candidates should be able to translate between engineers and critics, feel comfortable at ACM/AAAI/IEEE conferences and want to publish in law reviews or social science journals as well as CS proceedings.
  • Candidates should be excited by the idea of working with researchers invested in fairness, accountability, and transparency in machine learning (e.g.,
  • Preference given for researchers who have qualitative empirical skills.

If you might be such a person, please do send in an application (Role #1).

Data & Society is a wonderful place to be if you’re at all interested in this area. danah boyd has assembled a group of thinkers that represent the best kind of holistic thinking on a topic that intersects CS, sociology, political science and the law.

A funny thing happened on the way to the arXiv….

As I mentioned in the previous post, Sorelle Friedler, Carlos Scheidegger and I just posted a note to the arXiv on worldviews for thinking about fairness and nondiscrimination.

We uploaded the article last Friday, and it appeared on the arXiv on Sunday evening. By Monday late morning (less than 24 hours after the article was posted), we received this email:

I’m a reporter for Motherboard, VICE Media’s technology news site who frequently covers bias in machine learning. I read your paper posted to arXiv and would love to interview one of you for a piece on the work.

I assumed the reporter was referring to one of the two papers we’ve written so far on algorithmic fairness. But no, from the subject line it was clear that the reporter was referring to the article we had just posted! 

I was quite nervous about this: on the one hand it was flattering and rather shocking to get a query that quickly, and on the other hand this was an unreviewed preprint.

In any case, I did the interview. And the article is now out!

On the (im)possibility of fairness…

Ever since we started thinking about algorithmic fairness and the general issue of data-driven decision-making, there’s always been this nagging issue of “well what if there are cues in data that seem racist/sexist/(–)-ist and yet provide a good signal for a decision?”

There’s no shortage of people willing to point this out: see for example my post on the standard tropes that appear whenever someone discovers bias in some algorithmic process. Most of the responses betray a unexamined belief in the truth of what algorithms discover in data, and that is not satisfying either.

So the problem we’ve faced is this. If you examine closely the computer science literature on fairness and bias, it becomes clear that people are talking at cross-purposes: essentially arguing about why your orange is not more like my apple. And it has become clear that this is because of different assumptions about the world (how biased it is, how unbiased certain features are, and so on).

Here’s the pitch:

Can we separate out assumptions and beliefs about fairness from mechanisms that we deploy to ensure it? And in doing so, can we provide a useful vocabulary for talking about these issues within a common framework?

Here’s the result of our two-year long quest:

On the (im)possibility of fairness

What does it mean for an algorithm to be fair? Different papers use different notions of algorithmic fairness, and although these appear internally consistent, they also seem mutually incompatible. We present a mathematical setting in which the distinctions in previous papers can be made formal. In addition to characterizing the spaces of inputs (the “observed” space) and outputs (the “decision” space), we introduce the notion of a construct space: a space that captures unobservable, but meaningful variables for the prediction.
We show that in order to prove desirable properties of the entire decision-making process, different mechanisms for fairness require different assumptions about the nature of the mapping from construct space to decision space. The results in this paper imply that future treatments of algorithmic fairness should more explicitly state assumptions about the relationship between constructs and observations.

This paper has been a struggle to write. It’s a strange paper in that the main technical contribution is mainly conceptual: establishing what we think are the right basic primitives that can be used to express (mathematically) concepts like fairness, nondiscrimination, and structural bias.

We owe a great debt to our many friends in the social sciences community, as well as the decades of research on this topic in the social sciences. Much of the conceptual development we outline has been laid out in prose form by the many theories of social justice starting with Rawls, but particularly by Roemer. Our main goal has been to mathematize some of these ideas so that we can apply them to algorithms.


There’s a great deal of trepidation with which we release this: it’s in many ways a preliminary work that raises more questions than it answers. But we’ve benefited from lots of feedback within CS and without, and hope that this might clarify some of the discussions swirling around algorithmic fairness.

Bloomberg profile of Richard Berk

Richard Berk is one of the founding fathers of automated risk assessment, and systems based on his work are being deployed in Pennsylvania and other locations. This Bloomberg profile of him has many interesting (and terrifying) nuggets. As always, you should read the whole thing (if Bloomberg’s horrible page rendering doesn’t trigger a headache), but here are some highlights.

What’s interesting in the system he designed is how it’s optimized for cost of incarceration, rather than for accuracy. In the particular case described in the article, this actually makes the system less harsh, because a finding of a problem triggers expensive therapy. On the other side though, there’s a political component: it’s far riskier to release someone who might commit a crime than it is to keep incarcerated someone who might be reformed. As Berk puts it:

The policy position that is taken is that it’s much more dangerous to release Darth Vader than it is to incarcerate Luke Skywalker

The problem of course is that incarcerating Luke Skywalker could turn him into a new Darth Vader, and I don’t know if this is factored into the analysis.

He also says later

Berk argues that eliminating sensitive factors weakens the predictive power of the algorithms. “If you want me to do a totally race-neutral forecast, you’ve got to tell me what variables you’re going to allow me to use, and nobody can, because everything is confounded with race and gender,” he said.

This seems a little binary to me. It’s not an either-or where you either have to keep all sensitive attributes or throw them all out. There are ways to quantify and even subtract out the influence of certain problematic attributes without having to throw out all the information: in fact, we have a paper on this!

As the article, Berk is heading to Norway:

Berk wants to predict at the moment of birth whether people will commit a crime by their 18th birthday, based on factors such as environment and the history of a new child’s parents. This would be almost impossible in the U.S., given that much of a person’s biographical information is spread out across many agencies and subject to many restrictions. He’s not sure if it’s possible in Norway, either, and he acknowledges he also hasn’t completely thought through how best to use such information.

The idea that data can be collected to make such predictions is certainly alluring and tempting. But everything we’re beginning to understand about predictions based on algorithms suggests that making such predictions in the absence of any understanding of the model behavior and why it’s making its decisions is a recipe for disaster.

I’ll note that the recidivism predictions typically work 6 months to 2 years out, and are not particularly accurate! Trying to predict 18 years out is rather scary.

Wisconsin Supreme Court decision on COMPAS

We finally have the first legal ruling on algorithmic decision making. This case comes from Wisconsin, where Eric Loomis challenged the use of COMPAS for sentencing him.

While the Supreme Court denied the appeal, it made a number of interesting observations and recommendations:

  • “risk scores may not be considered as the determinative factor in deciding whether the offender can be supervised safely and effectively in the community.”
  • “the following warning must be given to sentencing judges: “(1) the proprietary nature of COMPAS has been invoked to prevent disclosure of information relating to how factors are weighed or how risk scores are to be determined; (2) risk assessment compares defendants to a national sample, but no cross- validation study for a Wisconsin population has yet been completed; (3) some studies of COMPAS risk assessment scores have raised questions about whether they disproportionately classify minority offenders as having a higher risk of recidivism; and (4) risk assessment tools must be constantly monitored and re-normed for accuracy due to changing populations and subpopulations.”

Like Danielle Citron (the author of the Forbes article) I’m a little skeptical that this will be enough. Warning labels on cigarette boxes didn’t really stop people smoking. But I think as part of a larger effort to increase awareness of the risks, and to make people even stop and think a little before blindly forging ahead with algorithms, this is a decent first step.

At the AINow Symposium in New York (that I’ll say more about later), one proposed extreme along the policy spectrum regarding algorithic decision-making was to place a moratorium on the use of algorithms entirely. I don’t know if that makes complete sense. But a heavy heavy dose of caution is definitely warranted, and rulings like this might lead to a patchwork of caveats and speedbumps that help us flesh out exactly where algorithmic decision making makes more or less sense.


Friday links

  • The ACLU together with four researchers in algorithmic accountability is challenging the CFAA (The Computer Fraud and Abuse Act), arguing that its provisions make it illegal to do the necessary auditing of algorithms to test for discrimination and bias.
  • The popular word2vec embedding method for words might learn biased associations, such as associating the word ‘nurse’ with the gender ‘female’ and so on. A new paper seeks to fix this problem.
  • Diversity in teams that build AI might help the algorithms themselves be less biased.