On stochastic parrots

I don’t often write long reviews of single papers. Maybe I should.

Stochastic Parrots have finally launched into mid-air. The paper at the heart of the huge brouhaha involving Google’s ‘resignating’ of Timnit Gebru back in December is now available, and will appear at FAccT 2021.

Reading papers in this space is always a tricky business. The world of algorithmic fairness is much broader than either algorithms or fairness (and stay tuned for my next post on this). Contributions come in many forms, from many different disciplinary and methodological traditions, and are situated in different contexts. Identifying the key contributions of a paper and how they broaden the overal discussion in the community can be tricky, especially if we define contributions based on our own traditions. And then we have to critique a paper on its own terms, rather than in terms of things we want to see. And distinguishing those two is apparently hard, as evidenced by the hours of argument I’ve been having*.

So what are the contributions of this paper? I view it as making two big arguments (containing smaller arguments) about large language models. Firstly, that the costs of building large language models systematically ignore large externalities which if factored in would probably change the cost equation. And secondly that the benefits associated with these models are based on claims of efficacy that don’t incorporate many known issues in the model building process. There’s a third related point about the way in which research in LLMs itself runs the risk of distracting from NLP research.

Let’s take these one by one. The first externality is clear: the now well-documented environmental cost associated with building larger and larger models. The paper doesn’t necessarily add new information to this cost, but it points out that in the context of language models, it is unclear whether the added accuracy improvements that might come from large models are worth the environmental cost. And especially since one of the arguments made for language models is the ‘democratization of NLP tech’, the authors also remind us (Bender Rule alert!) that much of the benefit of work on language models is will accrue to English-speaking (and therefore allready well-resourced) populations while the environmental harms will accrue to people in small island nations that are unlikely to benefit at all.

Are language models the biggest contributor to climate change? No. but I don’t think that’s the point here. The point is that pursuing size as a value in and of itself (see the gushing headlines over the new trilliion parameter model that just came out) as a proxy for better conveniently misses out on the harms associated with it, something that everyone who thinks about the calculus of harms and benefits in societyi knows they shouldn’t be doing. Further, in the spirit of ‘constraints breeding creativity’, why NOT initiate more research into how to build small and effective language models?

The second externality is what the authors partly refer to as the ‘documentation debt’ of large and complex data sets. This is not directly linked to trillion-state TMs per se, but it is clear that the collection of and access to huge text corpora enables (and justifies) the building of LLMs. Many of the problems with large data sets are well documented and are explored in detail here in an NLP context: the shifting nature of context, the problems with bias, the self-selection and non-uniformity in how these sets are collected and curated, and so on.

So much for the costs. What about the benefits? The section that gives its name to the title of the paper goes into great detail on the ways in which the perceived benefits of language models come from a deep tension between the weak and strong AI standpoints: does a language model actually understand language or not? The authors go into a number of cases where LLMs have demonstrated what appears to be an understanding of language, particularly in the case of question-answering. They argue that the apparent understanding is illusory and subjective, and this is a real problem because of our human tendency to ascribe meaning and intent to things that look like communication. They document in excruciating detail (yes with lots of references, ahem) cases where this perception of meaning and intent can be and has been weaponized.

A more subtle danger of the focus on LLMs, one that I am ill equipped to comment on since I’m not an NLP researcher, is the argument that the focus on large language models is distorting and takes away from efforts to truly understand and build general language systems. One might very well respond with “well I can chew gum and walk at the same time”, but I think it’s fair game to be concerned about the viability of research that doesn’t solely focus on LLMs, especially as DL continues its Big Chungus-ifying march.

CS papers are considered incomplete if they raise problems and don’t present solutions. It’s beyond the scope of this post to address the many problems with that take. However, the authors don’t want to fight another battle and dutifully present ways to move forward with LLMS, amassing a thoughtful collection of process suggestions that synthesize a lot of thinking in the community and apply them to language model development.

So. How does this paper add to the overall discussion?

The NLP context is very important. It’s one thing to write a general articulation of harms and risks. And indeed in a very broad sense the harms and risks identified here have been noted in other contexts. But the contextualization to NLP is where the work is being done. And while the field does have a robust ongoing discussion around ethical issues, thanks in no small part to the authors of this paper and their prior work, this systematic spelling out of the problems is both important to do, not something anyone can do (I couldn’t have done it!) and valuable as a reference for future work.

Does the paper make sound arguments? Yoav Goldberg has posted a few of his concerns, and I want to address one of them, which comes up a lot in this area. Namely, the idea that taking a “one-sided” position weakens the paper.

In my view, the paper is taking a “position” only in the sense of calling out the implicit positioning of the background discussions. There’s a term in media studies coined by Barthe called ‘exnomination‘. It’s a complex notion that I can’t do full justice to here, but in its simplest form it describes the act of “setting the default” or “ex-nominating” as an act of power that forces any (now) non-default position to justify itself. In the context of language, you see in this in ‘person’ vs ‘black person’, or (my go-to example) in Indian food of “food” vs “non-vegetarian food”.

What has happened is that the values of largeness, scale, and “accuracy” have been set as the baseline, and now any other concern (diversity, the need to worry about marginalized populations, concern for low-resource language users etc) becomes a “political position” to be justified.

Does this paper indulge in advocacy? Of course it does. Does that make it less “scientific”? Setting aside the normative loadedness of such a question, we should recognize that papers advocate for specific research agendas all the time. The authors have indeed made the case (and you’d have to be hiding under a rock to not know this as well) that the focus on size obscures the needs of populations that aren’t represented by the majority as seen through biased data collection. You might disagree with that case and that’s fine. But it’s well within scope of a paper in this area to make the case for such a focus without being called ‘political’ in a pejorative manner.

And this brings me back to the point I started with. In the space of topics that broadly speak to the social impact of machine learning, we don’t just have a collection of well defined CS problems. We have a bewildering, complex and contested collection of frames with which we think about this landscape. A paper contributes by adding a framing discussion as much as by specific technical contributions. In other words, it remains true that “what is the right problem to solve” is one of the core questions in any research area. And in that respect, this paper provides a strong corrective to the idea of LLMs as an unquestionably valuable object of study.

* My initial knee-jerk reaction to this paper was that it was nice but unexceptional. Many arguments with Sorelle and Carlos have helped move me away from that knee-jerk response towards a more careful articulation of its contributions.

On “Bostock vs Clayton County” and algorithmic discrimination.



I’ve been reading a number of analyses of the landmark Gorsuch decision in the LGBTQ discrimination case. The articles linked above are very helpful in this regard, but I couldn’t help but also notice a very computational argument in Gorsuch’s reasoning that might be relevant for algorithmic discrimination.

The question at hand was whether firing someone because they were gay or trans could be viewed as being “because of sex” as per the Civil Rights Act. The opposing argument was that they weren’t fired because of their sex (or gender to be more precise) but because they were gay or trans, and since sexual orientation/gender identity was not protected in the Civl Rights Act explicitly, it’s not a violation.

The argument from the majority as I understand it can be translated mathematically as follows. Consider a function WTF (“whether to fire”) that seemingly takes two parameters (s, t): where s = X is “sex = X” and t = Y denotes either “attraction to people of gender Y” or “presents as gender Y”

(note that for the purpose of mathematical abstraction I’m violently conflating gender, sex and what it means to “present as gender Y”: these distinctions are very material but I can make the mathematical argument without needing to delve into the context).

The question is whether we can express WTF(s, t) = g(t) alone. As Gorsuch argues, this clearly cannot be the case else (in the case of Bostock), they’d have to also fire all women attracted to men, or (in the case of Stephens) all people presenting as women.

Clearly WTF(s, t) cannot also be written as h(s) (in fact if it were that would be blatant discrimination). In other words, the variable s contributes to the function outcome without being the sole determiner of it. ie in the parlance of explanations, the feature s has influence over WTFf(s, t).

And in the mind of the majority, this suffices to declare that this is invalid under Title VII.

It should be clear then what the implications for algorithmic discrimination are. On the positive side, it might be sufficient to show that a protected feature has some influence on the outcome (i.e a more disparate impact-like analysis). But before we get too excited about this, it’s unlikely that we’ll get the clear and stark difference between WTF(s, t) and g(t) that was present in this case, so it will remain to be seen what kind of ‘burden of scrutiny’ will come into play. Will it be as simple as a 4/5 rule?

On centering, solutionism, justice and (un)fairness.


One of the topics of discussion in the broader conversation around algorithmic fairness has been the idea of decentering: that we should move technology away from the center of attention – as the thing we build to apply to people – and towards the sides – as a tool to instead help people.

This idea took me a while to understand, but makes a lot of sense. After all, we indeed wish to use “tech for good” — to help us flourish — the idea of eudaimonia that dates back to Aristotle and the birth of virtue ethics.

We can’t really do that if technology remains at the center. Centering the algorithm reinforces structure; the algorithm becomes a force multiplier to apply uniform solutions for all people. And that kind of flattening – the treatment of all the same way – is what leads to procedural ideas of fairness as consistency, as well as systematically unequal treatment of those that are different.

Centering the algorithm feeds into our worst inclinations towards tech solutionism – the idea that we should find the “one true method” and apply it everywhere.

So what should we, as computer scientists, do instead? How can we avoid centering the algorithm and instead focus on helping people flourish, while at the same time allowing ourselves to be solution-driven? One idea that I’m becoming more and more convinced of is that, as Mitchell and Hutchinson argue in their FAT* 2019 paper, we should make the shift from thinking about fairness to thinking about (un)fairness.


When we study fairness, we are necessarily looking for something universal. It must hold in all circumstances — a process cannot be fair if it only works in some cases. This universality is what leads to the idea of an all-encompassing solution – “Do this one thing and your system will be fair”. It’s what puts the algorithm at the center.

But unfairness comes into many guises, to paraphrase Tolstoy. And it looks different for different people under different circumstances. There may be general patterns of unfairness that we can identify, but they often emerge from the ground up. Indeed, as Hutchinson and Mitchell put it,

Individuals seeking justice do so when they believe that something has been unfair

Hutchinson & Mitchell. 50 Years of Test (Un)fairness: Lessons for Machine Learning. ACM FAT* 2019.

And to the extent that our focus should be on justice rather than fairness, this distinction becomes very important.

How does a study of unfairness center the people affected by algorithmic systems while still satisfying the computer scientist’s need for solutions? Because it aligns nicely with the idea of “threat models” in computer security.

Threat Models

When we say that a system is secure, it is always with respect to a particular collection of threats. We don’t allow a designer to claim that a system is universally secure against threats other than those explicitly accounted for. Similarly, we should think of different kinds of unfairness as attacks on society at large, or even attacks on groups of people. We can design tools to detect these attacks and possibly even protect against them — these are the solutions we seek. But addressing one kind of attack does not mean that we can fix a different “attack” the same way. That might require a different solution.

Identifying these attacks requires the designer to actually pay attention to the subject of the threat — the groups or individuals being targeted. Because if you don’t know their situation, how on earth do you expect to identify where their harms are coming from? This allows us a great deal more nuance in modeling, and I’d even argue that it pushes the level of abstraction for our reasoning down to the “right” level.

This search for nuance in modeling is precisely where I think computer science can excel. Our solutions here would be the conception of different forms of attack, how they relate to each other, and how we might mitigate them.

We’re already beginning to see examples of this way of thinking. One notable example that comes to mind is the set of strategies that fall under what has been termed POTs (“Protective Optimization Technologies”) due to Overdorf, Kulynych, Balsa, Troncoso and Gürses (one, two). They argue that in order to defeat the many problems introduced by optimization systems – a general framework that goes beyond decision-making to things like representations and recommendations – we should design technology that users (or their “protectors”) could use to subvert the behavior of the optimization system.

POTs have challenges of their own – for one thing they can also be gamed by players with access to more resources than others. But they are an example of what decentered solution-focused technology might look like.

I wrote this essay partly to help myself understand what decentering even might mean in a tech context, and why current formulations of fairness might be missing out on novel perspectives. I’ll have more to say on this in a later post.

On the new PA recidivism risk assessment tool

(Update: apparently as a result of all the pushback from activists, the ACLU and others, the rollout of the new tool has been pushed back at least 6 months)

The Pennsylvania Commission on Sentencing is preparing a new risk assessment tool for recidivism to aid in sentencing. The mandate for the commission (taken from their report — also see the detailed documentation at their site) is to (emphasis all mine):

adopt a Sentence Risk Assessment Instrument for the sentencing court to use to help determine the appropriate sentence within the limits established by law…The risk assessment instrument may be used as an aide in evaluating the relative risk that an offender will reoffend and be a threat to public safety.” (42 Pa.C.S.§2154.7) In addition to considering the risk of re- offense and threat to public safety, Act 2010-95 also permits the risk assessment instrument to be used to determine whether a more thorough assessment is necessary, or as an aid in determining appropriate candidates for alternative sentencing (e.g., County Intermediate Punishment, State Intermediate Punishment, State Motivational Boot Camp, and Recidivism Risk Reduction Incentive).

I was hired by the ACLU of Pennsylvania to look at the documentation provided as part of this new tool and see how they built it. I submitted a report to them a little while ago,.

The commission is running public hearings to take comments and I thought I’d highlight some points, especially focusing on what I think are important “FAT*” notions for any data science project of this kind.

What is the goal of the predictor?

When you build any ML system, you have to be very careful about deciding what it is that you want to predict. In PA’s model, the risk assessment tool is to be used (by mandate) for determining

  • reoffense likelihood
  • risk to public safety

Note that these are not the same thing! Using a single tool to predict both, or using its predictions to make asssessments about both, is a problem.

How is this goal being measured?

You have to dig into the reports to see this (page 6): they measure recidivism as

re-arrest for a felony or misdemeanor in Pennsylvania within three years of imposition of a sentence to the community or within three years of release from confinement; or, for offenders sentenced to state prison, a recommitment to the Department of Corrections for a technical violation within three years of release from confinement.

How does the predictor goal match the measured goal?

Here’s where it gets interesting. I’m not at all clear how “risk to public safety” is measured by re-arrests. Moreover, using re-arrest as a proxy for reoffense is a big potential red flag, if we are concerned about excessive policing issues as well as patterns that target minorities. As a matter of fact, a 2013 recidivism report by Pennsylvania (Thanks to Nyssa Taylor at ACLU-PA for finding this) says (page 17) that re-arrest rates are highest for African-Americans, whereas reincarceration rates are more evenly balanced by race.

Notice also that technical violations of parole are included in measurements of recidivism. Setting aside the question of whether any technical violation of parole amounts to a risk to public safety, it’s known that for example when considering pre-trial risk assessments that failure to appear in court occurs for many reasons that often correlate more with poverty (and inability to take time off to appear in court) than actual flight risk.

It’s not clear what a technical violation of parole might constitute and whether there are race biases in this calculation. Note that since this is aggregated into a predicted value, it doesn’t undergo the more detailed nondiscrimination analysis that other features do.

Separately, I’ll note that the PA SC did discover that as a feature, prior arrests carry a race bias that is not mitigated by predictive efficacy, and therefore decided to replace it by prior convictions

How is the predictor being used?

What’s interesting about this tool is that its output is converted (as usual) into a low, medium or high risk label. But the tool is only used when the risk is deemed either low or high. This determination then triggers further reports. In the case when it returns a medium risk, the tool results are not passed on.

What I didn’t see is how the predictor guides a decision towards alternate sentencing, and whether a single predictor for “risk of recidivism” is suficient to determine the efficacy of alternate interventions (Narrator: it probably isn’t).


There are many interesting aspects of the tool building process: how they essentially build different tools for one of 10 different crime categories, how they decided to group categories together, and how they decided to use a different model for crimes against a person. The models used are all logistic regression, and the reports provide the variables that end up in each model, as well as the weights.

But to me, the detailed analysis of the effectiveness of the tool and which variables don’t carry racial bias miss some of the larger issues with how they even decide what the “labels” are.

Models need doubt: the problematic modeling behind predictive policing

Predictive policing describes a collection of data-driven tools that are used to determine where to send officers on patrol on any given day. The idea behind these tools is that we can use historical data to make predictions about when and where crime will happen on a given day and use that information to allocate officers appropriately.

On the one hand, predictive policing tools are becoming ever more popular in jurisdictions across the country. They represent an argument based on efficiency: why not use data to model crime more effectively and therefore provision officers more usefully where they might be needed?

On the other hand, critiques of predictive policing point out that a) predicting crimes based on arrest data really predicts arrests and not crimes and b) by sending officers out based on predictions from a model and then using the resulting arrest data to update the model, you’re liable to get into a feedback loop where the model results start to diverge from reality.

This was empirically demonstrated quite elegantly by Lum and Isaac in a paper last year, using simulated drug arrest data in the Oakland area as well as an implementation of a predictive policing algorithm developed by PredPol (the implementation was based on a paper published by researchers associated with PredPol). For further discussion on this, it’s worth reading Bärí A. Williams’ op-ed in the New York Times, a response to this op-ed by Andrew Guthrie Ferguson (who’s also written a great book on this topic) and then a response by Isaac and Lum to his response.

Most of the discussion and response has focused on specifics of the kinds of crimes being recorded and modeled and the potential for racial bias in the outcomes.

In our work, we wanted to ask a more basic question: what’s the mechanism that makes feedback affect the predictions a model makes? The top-line ideas emerging from our work (two papers that will be published at the 1st FAT* conference and at ALT 2018) can be summarized as:

Biased observations can cause runaway feedback loops.  If police don’t see crime in a neighborhood because the model told them not to go there, this can cause a feedback loop.

Over time, such models can generate predictions of crime rates that (if used to decide officer deployment) will skew the data used to train the next iteration of the model. Since models might be run every day (and were done so in at least one published work describing PredPol-like algorithms), this skew might take hold quickly.

But this is still speculation. Can we mathematically prove that this will happen? The answer is yes, and this is the main contribution in our paper to appear at FAT*. By modeling the predictive process with a generalization of a Pólya urn, we can mathematically prove that the system will diverge out of control, to the extent that if two areas have even slightly different crime rates, a system that used predictive modeling to allocate officers, collect the resulting observational data and retrain the model will progressively put more and more emphasis on the area with the slightly higher crime rate.

Moreover, we can see this effect in simulations of real-world predictive policing deployments using the implementation of PredPol used by Lum and Isaac in their work, providing justification for our mathematical model.

Now let’s take a step back. If we have a model that exhibits runaway feedback loops, then we might try to fix the model to avoid such bad behavior. In our paper, we show how to do that as well. The intuition here is quite simple. Suppose we have an area with a very high crime rate as estimated by our predictive model. Then observing an incident should not surprise us very much: in fact, it’s likely that we shouldn’t even try to update the model from this incident. On the other hand, the less we expect crime to happen, the more we should be surprised by seeing an incident and the more willing we should be to update our model.

This intuition leads to a way in which we can take predictions produced by a black box model and tweak the data that is fed into it so that it only reacts to surprising events. This then provably yields a system that will converge to the observed crime rates. And we can validate this empirically again using the PredPol-inspired implementation. What our experiments show is that such a modified system does not exhibit runaway feedback loops.

A disclaimer: in the interest of clarity, I’ve conflated terms that in reality should be distinct: an incident is not an arrest is not a crime. And it can’t always be assumed that just because we don’t send an officer to an area that we don’t get any information about incidents (e.g via 911 calls). We model these issues more carefully in the paper, and in fact show that as the proportion of “reported incidents” (i.e those not obtained as a consequence of model-directed officer patrols) increases, model accuracy increases in a predictable and quantifiable way if we assume that those reported incidents accurately reflect crime.  This is obviously a big assumption, and the extent to which different types of incidents reflect the underlying ground truth crime rate likely differs by crime and neighborhood – something we don’t investigate in our paper but believe should be a priority for any predictive policing system.

From the perspective of machine learning, the problem here is that the predictive system should be an online learning algorithm, but is actually running in batch mode. That means that it is unable to explore the space of possible models and instead merely exploits what it learns initially.

What if we could redesign the predictive model from scratch? Could we bring in insights from online learning to do a better job? This is the topic of our second paper and the next post. The short summary I’ll leave you with is that by carefully modeling the problem of limited feedback, we can harness powerful reinforcement learning frameworks to design new algorithms with provable bounds for predictive policing.

Being Hopeful about Algorithms

I’ve been attending “think-events” around algorithmic fairness of late, firstly in Philadelphia (courtesy of the folks at UPenn) and then in DC (courtesy of the National Academy of Science and the Royal Society).

At these events, one doesn’t see the kind of knee-jerk reaction to the idea of fairness in learning that I’ve documented before. But there’s a very thoughtful critique that comes from people who’ve spent a lot of time themselves thinking and working on these topics. And it goes something like this.

Do we expect more from algorithms than we expect from people? And is that reasonable?

I first heard this critique much earlier at  a Dagstuhl meeting on this topic, when I was asked this question by H. V. Jagadish (who has a great course on ethics in data mining). It came up indirectly during discussions at the Philadelphia event (about which I hope to say something later) and was phrased in this form by Vint Cerf at the Sackler Forum.

I found myself unable to answer it convincingly. We’ve had 1000s of years to set up institutions based on humans decision making. These processes have been flawed, anfdbiased. People have made decisions with implicit and explicit bias.

Why then do we demand then that algorithms do more? Why do we demand that they account for themselves and explain themselves in ways that we don’t ask human judges to do?

I used to have an answer. I argued that algorithms speak the language of mathematics and so we need to translate all our human ideals – of ethics, fairness and justice – into a form that an algorithm could understand. But then we start talking about accountability, interpretability, how an algorithm might explain itself, and what that might even mean.

Jon Kleinberg has this analogy of a learning algorithm as this incredibly obtuse friend that you bring to a party, that you have to explain EVERYTHING to. Where the food is, what the drinks are, what people are saying, and so on. We don’t have to do this for real people because they have a vast body of prior context to work with. Indeed, this prior context is what decides how they function in the world, and is made up of all kinds of heuristics and “biasing” of the space of possible outcomes (as Joanna Bryson puts it).

So it would seem that asking an algorithm for its “audit trail” is the equivalent of asking (say) a human judge “give me the entire story of your life experiences that explains why you made this decision”.

And of course we never do this. In fact, all we really do is set out a series of guidelines and expect the judges to be more or less consistent with them. Similarly for hiring, or credit decision, or any other kind of decision making. In other words, we expect a certain degree of procedural consistency while accepting that individuals may apply discretion based on their own perspective.

So I return to the question from before. Why do we expect an automated decision making process to be any better?

There’s an optimistic take on this. We can’t expect an audit trail from a human decision maker because we don’t have the capacity to generate one. That my verdict on a dog owner might in part be due to being bitten by a dog as a child is something that I’m unlikely to be able to cogently articulate. But it is at least a little unfair that I sentence dog owners more harshly for this reason.

But if we are able to produce such an audit trail from an algorithmic decision maker we do have the hope of revealing implicit preferences and biases based on the *algorithm’s* “life experiences” aka “training data”. And so we can expect more because we have the ability to do so.

An alternate perspective on this goes as follows. We’ve built up over the decades and centuries a system of checks and balances and accountability procedures for evaluating the quality of human decision making. We have laws that require non-discrimination, we have ways to remove decision-makers who make arbitrary decisions, and we have a social structure that makes decision-makers feel a sense of responsibility for their decisions.

None of these exist for algorithmic decision-making, or realistically can. We can’t call an algorithm to account for a bad decision: ultimately all liability rests on legal persons. So the next best thing is to make an algorithm assist in the decision making process, but require transparency so that the human decision-maker can’t blame the algorithm for bad decisions “It’s not me, it’s the algorithm!”, a story that played out in Cory Doctorow’s Human Readable.

There’s a tension between “let’s use automation wherever reasonable”, and “wait. how are you shifting harm?”. We don’t want to stop the deployment of algorithms in decision-making, and frankly I doubt that one could even if one wanted to. But it’s also not unreasonable to express some caution (and perhaps some humility) when doing this. We’re not expecting perfection from automated decision-making: it’s perfectly reasonable to expect just that we can do better than human decision makers. But why not expect that as well as expect a decision that we can understand? Why essentially give up by saying “the algorithm cannot both be powerful and understandable”. To me, that’s the real failure of hope.

A funny thing happened on the way to the arXiv….

As I mentioned in the previous post, Sorelle Friedler, Carlos Scheidegger and I just posted a note to the arXiv on worldviews for thinking about fairness and nondiscrimination.

We uploaded the article last Friday, and it appeared on the arXiv on Sunday evening. By Monday late morning (less than 24 hours after the article was posted), we received this email:

I’m a reporter for Motherboard, VICE Media’s technology news site who frequently covers bias in machine learning. I read your paper posted to arXiv and would love to interview one of you for a piece on the work.

I assumed the reporter was referring to one of the two papers we’ve written so far on algorithmic fairness. But no, from the subject line it was clear that the reporter was referring to the article we had just posted! 

I was quite nervous about this: on the one hand it was flattering and rather shocking to get a query that quickly, and on the other hand this was an unreviewed preprint.

In any case, I did the interview. And the article is now out!

On the (im)possibility of fairness…

Ever since we started thinking about algorithmic fairness and the general issue of data-driven decision-making, there’s always been this nagging issue of “well what if there are cues in data that seem racist/sexist/(–)-ist and yet provide a good signal for a decision?”

There’s no shortage of people willing to point this out: see for example my post on the standard tropes that appear whenever someone discovers bias in some algorithmic process. Most of the responses betray a unexamined belief in the truth of what algorithms discover in data, and that is not satisfying either.

So the problem we’ve faced is this. If you examine closely the computer science literature on fairness and bias, it becomes clear that people are talking at cross-purposes: essentially arguing about why your orange is not more like my apple. And it has become clear that this is because of different assumptions about the world (how biased it is, how unbiased certain features are, and so on).

Here’s the pitch:

Can we separate out assumptions and beliefs about fairness from mechanisms that we deploy to ensure it? And in doing so, can we provide a useful vocabulary for talking about these issues within a common framework?

Here’s the result of our two-year long quest:

On the (im)possibility of fairness

What does it mean for an algorithm to be fair? Different papers use different notions of algorithmic fairness, and although these appear internally consistent, they also seem mutually incompatible. We present a mathematical setting in which the distinctions in previous papers can be made formal. In addition to characterizing the spaces of inputs (the “observed” space) and outputs (the “decision” space), we introduce the notion of a construct space: a space that captures unobservable, but meaningful variables for the prediction.
We show that in order to prove desirable properties of the entire decision-making process, different mechanisms for fairness require different assumptions about the nature of the mapping from construct space to decision space. The results in this paper imply that future treatments of algorithmic fairness should more explicitly state assumptions about the relationship between constructs and observations.

This paper has been a struggle to write. It’s a strange paper in that the main technical contribution is mainly conceptual: establishing what we think are the right basic primitives that can be used to express (mathematically) concepts like fairness, nondiscrimination, and structural bias.

We owe a great debt to our many friends in the social sciences community, as well as the decades of research on this topic in the social sciences. Much of the conceptual development we outline has been laid out in prose form by the many theories of social justice starting with Rawls, but particularly by Roemer. Our main goal has been to mathematize some of these ideas so that we can apply them to algorithms.


There’s a great deal of trepidation with which we release this: it’s in many ways a preliminary work that raises more questions than it answers. But we’ve benefited from lots of feedback within CS and without, and hope that this might clarify some of the discussions swirling around algorithmic fairness.

White House Report on Algorithmic Fairness

The White House has put out a report on big data and algorithmic fairness (announcement, full report).  From the announcement:

Using case studies on credit lending, employment, higher education, and criminal justice, the report we are releasing today illustrates how big data techniques can be used to detect bias and prevent discrimination. It also demonstrates the risks involved, particularly how technologies can deliberately or inadvertently perpetuate, exacerbate, or mask discrimination.

The table of contents for the report gives a good overview of the issues addressed:

Big Data and Access to Credit
The Problem: Many Americans lack access to affordable credit due to thin or non-existent credit files.
The Big Data Opportunity: Use of big data in lending can increase access to credit for the financially underserved.
The Big Data Challenge: Expanding access to affordable credit while preserving consumer rights that protect against discrimination in credit eligibility decisions

Big Data and Employment
The Problem: Traditional hiring practices may unnecessarily filter out applicants whose skills match the job opening.
The Big Data Opportunity: Big data can be used to uncover or possibly reduce employment discrimination.
The Big Data Challenge: Promoting fairness, ethics, and mechanisms for mitigating discrimination in employment opportunity.

Big Data and Higher Education
The Problem: Students often face challenges accessing higher education, finding information to help choose the right college, and staying enrolled.
The Big Data Opportunity: Using big data can increase educational opportunities for the students who most need them.
The Big Data Challenge: Administrators must be careful to address the possibility of discrimination in higher education admissions decisions.

Big Data and Criminal Justice
The Problem: In a rapidly evolving world, law enforcement officials are looking for smart ways to use new technologies to increase community safety and trust.
The Big Data Opportunity: Data and algorithms can potentially help law enforcement become more transparent, effective, and efficient.
The Big Data Challenge: The law enforcement community can use new technologies to enhance trust and public safety in the community, especially through measures that promote transparency and accountability and mitigate risks of disparities in treatment and outcomes based on individual characteristics.

“Investigating the algorithms that govern our lives”

This is the title of a new Columbia Journalism Review article by Chava Gourarie on the role of journalists in explaining the power of algorithms. She goes on to say

But when it comes to algorithms that can comput what the human mind can’t, that won’t be enough. Journalists who want to report on algorithms must expand their literacy into the areas of computing and data, in order to be equipped to deal with the ever-more-complex algorithms governing our lives.

I’m quoted in this article, as are other researchers, and Moritz Hardt’s Medium article on how big data is unfair is mentioned as well.

As they say, read the rest 🙂