On stochastic parrots

I don’t often write long reviews of single papers. Maybe I should.

Stochastic Parrots have finally launched into mid-air. The paper at the heart of the huge brouhaha involving Google’s ‘resignating’ of Timnit Gebru back in December is now available, and will appear at FAccT 2021.

Reading papers in this space is always a tricky business. The world of algorithmic fairness is much broader than either algorithms or fairness (and stay tuned for my next post on this). Contributions come in many forms, from many different disciplinary and methodological traditions, and are situated in different contexts. Identifying the key contributions of a paper and how they broaden the overal discussion in the community can be tricky, especially if we define contributions based on our own traditions. And then we have to critique a paper on its own terms, rather than in terms of things we want to see. And distinguishing those two is apparently hard, as evidenced by the hours of argument I’ve been having*.

So what are the contributions of this paper? I view it as making two big arguments (containing smaller arguments) about large language models. Firstly, that the costs of building large language models systematically ignore large externalities which if factored in would probably change the cost equation. And secondly that the benefits associated with these models are based on claims of efficacy that don’t incorporate many known issues in the model building process. There’s a third related point about the way in which research in LLMs itself runs the risk of distracting from NLP research.

Let’s take these one by one. The first externality is clear: the now well-documented environmental cost associated with building larger and larger models. The paper doesn’t necessarily add new information to this cost, but it points out that in the context of language models, it is unclear whether the added accuracy improvements that might come from large models are worth the environmental cost. And especially since one of the arguments made for language models is the ‘democratization of NLP tech’, the authors also remind us (Bender Rule alert!) that much of the benefit of work on language models is will accrue to English-speaking (and therefore allready well-resourced) populations while the environmental harms will accrue to people in small island nations that are unlikely to benefit at all.

Are language models the biggest contributor to climate change? No. but I don’t think that’s the point here. The point is that pursuing size as a value in and of itself (see the gushing headlines over the new trilliion parameter model that just came out) as a proxy for better conveniently misses out on the harms associated with it, something that everyone who thinks about the calculus of harms and benefits in societyi knows they shouldn’t be doing. Further, in the spirit of ‘constraints breeding creativity’, why NOT initiate more research into how to build small and effective language models?

The second externality is what the authors partly refer to as the ‘documentation debt’ of large and complex data sets. This is not directly linked to trillion-state TMs per se, but it is clear that the collection of and access to huge text corpora enables (and justifies) the building of LLMs. Many of the problems with large data sets are well documented and are explored in detail here in an NLP context: the shifting nature of context, the problems with bias, the self-selection and non-uniformity in how these sets are collected and curated, and so on.

So much for the costs. What about the benefits? The section that gives its name to the title of the paper goes into great detail on the ways in which the perceived benefits of language models come from a deep tension between the weak and strong AI standpoints: does a language model actually understand language or not? The authors go into a number of cases where LLMs have demonstrated what appears to be an understanding of language, particularly in the case of question-answering. They argue that the apparent understanding is illusory and subjective, and this is a real problem because of our human tendency to ascribe meaning and intent to things that look like communication. They document in excruciating detail (yes with lots of references, ahem) cases where this perception of meaning and intent can be and has been weaponized.

A more subtle danger of the focus on LLMs, one that I am ill equipped to comment on since I’m not an NLP researcher, is the argument that the focus on large language models is distorting and takes away from efforts to truly understand and build general language systems. One might very well respond with “well I can chew gum and walk at the same time”, but I think it’s fair game to be concerned about the viability of research that doesn’t solely focus on LLMs, especially as DL continues its Big Chungus-ifying march.

CS papers are considered incomplete if they raise problems and don’t present solutions. It’s beyond the scope of this post to address the many problems with that take. However, the authors don’t want to fight another battle and dutifully present ways to move forward with LLMS, amassing a thoughtful collection of process suggestions that synthesize a lot of thinking in the community and apply them to language model development.

So. How does this paper add to the overall discussion?

The NLP context is very important. It’s one thing to write a general articulation of harms and risks. And indeed in a very broad sense the harms and risks identified here have been noted in other contexts. But the contextualization to NLP is where the work is being done. And while the field does have a robust ongoing discussion around ethical issues, thanks in no small part to the authors of this paper and their prior work, this systematic spelling out of the problems is both important to do, not something anyone can do (I couldn’t have done it!) and valuable as a reference for future work.

Does the paper make sound arguments? Yoav Goldberg has posted a few of his concerns, and I want to address one of them, which comes up a lot in this area. Namely, the idea that taking a “one-sided” position weakens the paper.

In my view, the paper is taking a “position” only in the sense of calling out the implicit positioning of the background discussions. There’s a term in media studies coined by Barthe called ‘exnomination‘. It’s a complex notion that I can’t do full justice to here, but in its simplest form it describes the act of “setting the default” or “ex-nominating” as an act of power that forces any (now) non-default position to justify itself. In the context of language, you see in this in ‘person’ vs ‘black person’, or (my go-to example) in Indian food of “food” vs “non-vegetarian food”.

What has happened is that the values of largeness, scale, and “accuracy” have been set as the baseline, and now any other concern (diversity, the need to worry about marginalized populations, concern for low-resource language users etc) becomes a “political position” to be justified.

Does this paper indulge in advocacy? Of course it does. Does that make it less “scientific”? Setting aside the normative loadedness of such a question, we should recognize that papers advocate for specific research agendas all the time. The authors have indeed made the case (and you’d have to be hiding under a rock to not know this as well) that the focus on size obscures the needs of populations that aren’t represented by the majority as seen through biased data collection. You might disagree with that case and that’s fine. But it’s well within scope of a paper in this area to make the case for such a focus without being called ‘political’ in a pejorative manner.

And this brings me back to the point I started with. In the space of topics that broadly speak to the social impact of machine learning, we don’t just have a collection of well defined CS problems. We have a bewildering, complex and contested collection of frames with which we think about this landscape. A paper contributes by adding a framing discussion as much as by specific technical contributions. In other words, it remains true that “what is the right problem to solve” is one of the core questions in any research area. And in that respect, this paper provides a strong corrective to the idea of LLMs as an unquestionably valuable object of study.

* My initial knee-jerk reaction to this paper was that it was nice but unexceptional. Many arguments with Sorelle and Carlos have helped move me away from that knee-jerk response towards a more careful articulation of its contributions.

On “Bostock vs Clayton County” and algorithmic discrimination.

https://www.theatlantic.com/ideas/archive/2020/06/what-because-of-sex-really-means/613099/

https://www.stanfordlawreview.org/online/the-many-meanings-of-because-of/

I’ve been reading a number of analyses of the landmark Gorsuch decision in the LGBTQ discrimination case. The articles linked above are very helpful in this regard, but I couldn’t help but also notice a very computational argument in Gorsuch’s reasoning that might be relevant for algorithmic discrimination.

The question at hand was whether firing someone because they were gay or trans could be viewed as being “because of sex” as per the Civil Rights Act. The opposing argument was that they weren’t fired because of their sex (or gender to be more precise) but because they were gay or trans, and since sexual orientation/gender identity was not protected in the Civl Rights Act explicitly, it’s not a violation.

The argument from the majority as I understand it can be translated mathematically as follows. Consider a function WTF (“whether to fire”) that seemingly takes two parameters (s, t): where s = X is “sex = X” and t = Y denotes either “attraction to people of gender Y” or “presents as gender Y”

(note that for the purpose of mathematical abstraction I’m violently conflating gender, sex and what it means to “present as gender Y”: these distinctions are very material but I can make the mathematical argument without needing to delve into the context).

The question is whether we can express WTF(s, t) = g(t) alone. As Gorsuch argues, this clearly cannot be the case else (in the case of Bostock), they’d have to also fire all women attracted to men, or (in the case of Stephens) all people presenting as women.

Clearly WTF(s, t) cannot also be written as h(s) (in fact if it were that would be blatant discrimination). In other words, the variable s contributes to the function outcome without being the sole determiner of it. ie in the parlance of explanations, the feature s has influence over WTFf(s, t).

And in the mind of the majority, this suffices to declare that this is invalid under Title VII.

It should be clear then what the implications for algorithmic discrimination are. On the positive side, it might be sufficient to show that a protected feature has some influence on the outcome (i.e a more disparate impact-like analysis). But before we get too excited about this, it’s unlikely that we’ll get the clear and stark difference between WTF(s, t) and g(t) that was present in this case, so it will remain to be seen what kind of ‘burden of scrutiny’ will come into play. Will it be as simple as a 4/5 rule?

On centering, solutionism, justice and (un)fairness.

Centering

One of the topics of discussion in the broader conversation around algorithmic fairness has been the idea of decentering: that we should move technology away from the center of attention – as the thing we build to apply to people – and towards the sides – as a tool to instead help people.

This idea took me a while to understand, but makes a lot of sense. After all, we indeed wish to use “tech for good” — to help us flourish — the idea of eudaimonia that dates back to Aristotle and the birth of virtue ethics.

We can’t really do that if technology remains at the center. Centering the algorithm reinforces structure; the algorithm becomes a force multiplier to apply uniform solutions for all people. And that kind of flattening – the treatment of all the same way – is what leads to procedural ideas of fairness as consistency, as well as systematically unequal treatment of those that are different.

Centering the algorithm feeds into our worst inclinations towards tech solutionism – the idea that we should find the “one true method” and apply it everywhere.

So what should we, as computer scientists, do instead? How can we avoid centering the algorithm and instead focus on helping people flourish, while at the same time allowing ourselves to be solution-driven? One idea that I’m becoming more and more convinced of is that, as Mitchell and Hutchinson argue in their FAT* 2019 paper, we should make the shift from thinking about fairness to thinking about (un)fairness.

Unfairness

When we study fairness, we are necessarily looking for something universal. It must hold in all circumstances — a process cannot be fair if it only works in some cases. This universality is what leads to the idea of an all-encompassing solution – “Do this one thing and your system will be fair”. It’s what puts the algorithm at the center.

But unfairness comes into many guises, to paraphrase Tolstoy. And it looks different for different people under different circumstances. There may be general patterns of unfairness that we can identify, but they often emerge from the ground up. Indeed, as Hutchinson and Mitchell put it,

Individuals seeking justice do so when they believe that something has been unfair

Hutchinson & Mitchell. 50 Years of Test (Un)fairness: Lessons for Machine Learning. ACM FAT* 2019.

And to the extent that our focus should be on justice rather than fairness, this distinction becomes very important.

How does a study of unfairness center the people affected by algorithmic systems while still satisfying the computer scientist’s need for solutions? Because it aligns nicely with the idea of “threat models” in computer security.

Threat Models

When we say that a system is secure, it is always with respect to a particular collection of threats. We don’t allow a designer to claim that a system is universally secure against threats other than those explicitly accounted for. Similarly, we should think of different kinds of unfairness as attacks on society at large, or even attacks on groups of people. We can design tools to detect these attacks and possibly even protect against them — these are the solutions we seek. But addressing one kind of attack does not mean that we can fix a different “attack” the same way. That might require a different solution.

Identifying these attacks requires the designer to actually pay attention to the subject of the threat — the groups or individuals being targeted. Because if you don’t know their situation, how on earth do you expect to identify where their harms are coming from? This allows us a great deal more nuance in modeling, and I’d even argue that it pushes the level of abstraction for our reasoning down to the “right” level.

This search for nuance in modeling is precisely where I think computer science can excel. Our solutions here would be the conception of different forms of attack, how they relate to each other, and how we might mitigate them.

We’re already beginning to see examples of this way of thinking. One notable example that comes to mind is the set of strategies that fall under what has been termed POTs (“Protective Optimization Technologies”) due to Overdorf, Kulynych, Balsa, Troncoso and Gürses (one, two). They argue that in order to defeat the many problems introduced by optimization systems – a general framework that goes beyond decision-making to things like representations and recommendations – we should design technology that users (or their “protectors”) could use to subvert the behavior of the optimization system.

POTs have challenges of their own – for one thing they can also be gamed by players with access to more resources than others. But they are an example of what decentered solution-focused technology might look like.

I wrote this essay partly to help myself understand what decentering even might mean in a tech context, and why current formulations of fairness might be missing out on novel perspectives. I’ll have more to say on this in a later post.

FAT* Papers: Fairness Methods

The conference is over, and I’m more exhausted than I thought I’d be. It was exhilarating. But the job of a paper summarizer never ends, and I am doing this exercise as much for my own edification as anyone else’s 🙂

The theme of this session is a little more spread out, but all the papers are “tools” papers in the classic ML sense: trying to build widgets that can be useful in a more introspective processing pipeline.

Fairness through Causal Awareness: Learning Causal Latent-Variable Models for Biased Data

The interaction between causality and fairness is getting steadily more interesting. In a sense it’s hard to imagine doing any kind of nondiscriminatory learning without causal models because a) observational data isn’t sufficient to determine bias without some understanding of causality, and b) causal modeling helps us understand where the sources of bias might be coming from (more on this in a later post).

This paper continues a line of thinking that says, “let’s try to posit an explicit causal model that connects protected attributes with outcomes, and learn the parameters of this model from observational data”. The key idea here is: we have observational data X, treatment T and outcome Y. We have a feeling that protected attributes A might be affecting all of these, but it’s not clear how. Let’s assume that the observed data X (and the treatment and outcome) is influenced by A, and some hidden latent features Z that by definition are independent of A. If we can infer Z from observational data then we have a way to measure the effects of A on representation, treatment and outcome.

But how do we learn Z? By using a deep learning architecture that asks that we find Z given X. This is the main technical meat of the paper, and require some careful assumptions over confounders. But the “trick” as it were, is to replace “parameters of a distribution” by “a neural network” which is a common trick in representation learning.

The upshot of all of this is that we can learn structural equations connecting X, Z, A, T and Y, and then do interventions to determine the degree to which A influences treatments and outcomes.

To decide or not to decide, that is the question. 

From Soft Classifiers to Hard Decisions: How fair can we be?

Risk assessment tools (and in fact many regression-like systems) output “probabilities”: there’s an X% chance that this person will fail to appear in court etc. But a judge has to make a decison: YES or NO. How we go from “soft decisions” (aka probabilities) to “hard decisions” aka binary output is the topic of this paper.

There are a few obvious ways to do this:

  1. If the probability p > .5, declare YES, else declare NO
  2. Toss a coin that is YES with probability p, and return the value

Either of these “work” in the sense that the accuracy of the resulting “hard” decision system will not be very different from the soft one. But what about the effect on group decisions? It turns out that one has to be far more careful in that case.

Assume that the soft decision-maker is “well calibrated” – which means that if of all the times it gives a probability p of an event occuring, then a p fraction of those time the event will actually occur.  And assume that the system is also well-calibrated for each group of people. The bad news is that there’s still no way to ensure (if the base rates for groups are different) that any converter from soft to hard classification preserves good error rates by group evenly.

The paper goes on to discuss scenarios where this “equality of odds” can still be achieved, and it requires stronger conditions on the initial soft classifier. Another interesting trick they make use of is the idea of “deferring” a decision, or basically saying “I DON’T KNOW”. In that case it is in fact possible to equalize all kinds of error rates across groups, with the caveat that the errors are only measured with respect to inputs for which an answer is given.

In practice, a tool that says “I don’t know” a lot will essentially cede control back to a human decision maker, which might be fine, but also makes the claims of balanced error rates questionable because the overall system (including the human) might not have this property (remember the framing trap?). Nevertheless, allowing an “I’m not sure” state in an algorithmic process might be a good way to acknowledge when the algorithm isn’t really sure about what to do.

Which is a convenient segue to the next paper:

Deep Weighted Averaging Classifiers

This paper  is squarely in the explanations realm. The idea is to find a signal to encode the degree of confidence in a prediction and also explain where that confidence comes from, by using a small set of exemplars to say “I classified the point this way because I saw these other close by points classified the same way”.

The trick is to use kernel smoothing ideas. Here’s the thought. Suppose I built a classifier for points, and when a new point came along, associated a certainty score with this point What one might traditionally do is to say that the further away from the classification boundary a point is, the more confident we are in the answer.

We don’t want to say that here though. Because suppose we get a point in a region we’ve never sampled before but happens to be far from the current classification. Then one of two things could be happening:

  1. The point is really on the right side with high confidence
  2. We have no idea what the real label should be because we never sampled points nearby the query point so have no idea whether the classifier might have been affected.

The only way to distinguish between the two cases is to express the uncertainty in terms of nearby points, and that’s what they propose to do. The details involve kernel regression and are not relevant here, but the idea is that a point that is close to many other points with the same label will have a lower uncertainty score than one that’s far from any training points.

FAT* Papers: Profiling and Representation

Me (in the hallway at FAT*): Hi

[person]: Oh hi, how’re you doing?

pause…

[person];. So…. when’s the next post going to be up?

Which brings us to Session 3.

Kate Crawford gave a talk at NIPS (NeurIPS?) 2017 on harms of representation that has had a profound influence on my thinking about fairness. We’re all familiar with harms that come from biased decision making — harms of allocation — but it’s a little harder to discuss what it means to face harm from a skew in representation. 

A few years ago we saw a series of papers that demonstrated that standard representations of text using methods like Word2Vec and GloVe could encode biases in the training corpora. But can we connect these harms directly to harms of allocation? In other words, to what extent can we attribute a harm of allocation to a skewed representation rather than distributional bias or bad metrics?

Bias in Bios: A Case Study of Semantic Representation Bias in a High-Stakes Setting

This paper by De-Artega et al attempts to explore this question in the context of resume screening. Specifically, suppose we could construct a high dimensional representation of the text in someone’s CV. And then suppose we built an occupation predictor that takes this representation as input? Would the behavior of the predictor differ if we scrubbed the CV of obvious gender markers? Specifically, if we defined “behavior of predictor” as difference in true positive rates between genders, how does this number look with and without scrubbed data?

They do this test with three different representations for CV data scrubbed from the web and cleaned up. In each instance, the first sentence provides the occupation label and the rest of the text is the source for the derived representation. Gender scrubbing is performed by removing obvious coded pronouns as well as first names.

The main upshot is not too surprising: the worse the gender imbalance in the input training data for an occupation, the more skewed the behavior of the predictor (the TPR difference between male and female). And scrubbing gender markers helps alleviate some of this, but not all. What’s interesting is that using the standard information-theoretic trick of “can I predict gender from the representation” they can show that even the scrubbed representation still has some latent gender coding, which explains why the scrubbed representation doesn’t perfectly eliminate biased error rates.

I think this paper is a great addition to the growing study of harms of representation. At the risk of tooting my own (student’s) horn, my student Mohsen Abbasi (together with the invincible monotonically length-increasing triumvirate of Friedler/Scheidegger/Venkatasubramanian) has a paper accepted to the 2019 SIAM Conference on Data Mining that looks at harms of representation more closely from the perspective of stereotype formation.

Those who vote decide nothing. Those who count the vote decide everything. — Joseph Stalin.

Equality of Voice: Towards Fair Representation in Crowdsourced Top-K Recommendations

This paper by Chakraborty et al is not quite about representations: rather it’s about diversity of results in recommendations. The problem is thus: if we want to aggregate recomendations from users (say for popular news topics to be fed into a “trending news” timeline), we have to be able to use some kind of voting scheme where users “vote” on news by either retweeting, or sharing, or engaging positively in some way.

But there are very few votes in the system: in the parlance of linear algebra, the matrix of users and news items is very sparse. In such a setting, it’s going to be highly unreliable to vote for winners, because most articles will only get a few votes, and a well-organized minority could strategically manipulate the ranking to be non-representative (ed: C’mon Suresh, why you being all doom and gloom – no way this can happen *cough*4chan*cough)

Trick #1: use ranked choice voting instead of simply counting. User activity creates a ranked list of preferences and these can be aggregated using ranked choice voting to defeat strategic manipulation. But that requires a dense user-topic matrix!

Trick #2: In order to fill out the matrix, predict  the missing entries for each user using personalized ranking predictors.  For some news sites you can do this by inferring a ranking from how long each user spends on the page.

There are things that make me a little uneasy in the framing and execution of this work. For one, while it’s nice to imagine that the top-K recommendations are flawed due to ineffective voting schemes, I suspect the truth — involving a desperate play for engagement — is far more depressing. And so I’m not convinced that fixing the errors in tallying votes really addresses the problems with recommendations. Secondly it’s not clear to me how to generalize Trick#2 to other media: for example the method used to infer interest in Twiter hash tags is rather baroque.

I know what I know. — Paul Simon

Every now and then you have a very bittersweet experience reading a paper. It captures thoughts that have been in your head for ages, and that you should have written down in a paper. But you know that there’s no way you could have written down the thoughts in the way this paper does it.

The Profiling Potential of Computer Vision and the Challenge of Computational Empiricism

Hewing close to the interdisciplinary soul of the conference, this paper by Jake Goldenfein is a rumination on how we know what we know, through the lens of computer vision.

There’s always been a bit of  epistemic jujitsu at the heart of our modern ML-enhanced world. Especially when applied to social phenomena, the promise of ML is that it can make knowable through LEARNING AND DATA things about ourselves that we could not have gleaned from other methods. That is to say, a deep learning system that claims to predict the risk of recidivism carries within itself epistemic content that is irreducible.

This paper brings some much needed context to this idea, pointing out the long history of using measurement and quantification as a privileged way to represent knowledge, and arguing that our current obsession with facial recognition draws on this history, with the extra irony that rather than actually looking at the face, we encode it as a vector before processing it. He argues that in order to think about legal remedies, we have to focus less on the technical limitations of facial recognition, and more on the claim that there are purely computational ways to know. And maybe we should legally disallow such claims.

Let A be the group of all people who are not members of a group…. — Bertrand Russell’s FAT* ghost. 

It’s one thing to define fairness with respect to a single protected variable. But what if you don’t want to specify the variable? Maybe you’re concerned about intersectionality issues and realize that merely protecting against different treatments on race and gender doesn’t guarantee protection with respect to race-and-gender. Or maybe you actually don’t yet know what groups are likely to treated unfairly? A series of papers last year explored the idea of defining fairness with respect to all groups that can be expressed through a “simple” predicate (where “simple” has a technical meaning that I won’t get into here).

An Empirical Study of Rich Subgroup Fairness for Machine Learning

This paper takes those ideas and empirically evaluates them. And now time for a confession. I find it difficult to read purely empirical papers. I tend to get drowned in seas of charts and tables without a nice juicy lemma to hang my hat on. And so I’ll admit to some difficulty in pulling together the detailed message from this paper. At the highest level, it establishes that trying to optimize fairness with respect to classes of groups is feasible even with heuristic oracles that solve the classification problem unconstrained by fairness constraints, and so this method of building a fair classifier deserves its place in benchmarking tests. Personally I’d be curious to see if this can do as well or possibly even better than a method that targets a particular group. Time for some ‘pip install fairness‘ :). But I’m also curious as to whether the methods described here would work if the classifier and groups were not simple linear functions and thresholds.

FAT* Papers: Systems and Measurement

I’ve made it to Session 2 of my series of posts on the FAT* conference.

If you build it they will come. 

How should we build systems that incorporate all that we’ve learnt about fairness, accountability and transparency. How do we go from saying “this is a problem” to saying “Here’s a solution”?

Three of the four papers in this session seek (in different ways) to address this question, focusing on both the data and the algorithms that make up an ML model.

Beyond Open vs. Closed: Balancing Individual Privacy and Public Accountability in Data Sharing

The paper by Meg Young and friends from UW makes a strong argument for the idea of a data trust. Recognizing that we need good data to drive good policy and to evaluate technology and also recognizing that there are numerous challenges — privacy, fairness, and accountabilty — around providing such data, not to mention issues with private vs public ownership, they present a case study of a data trust built with academics as the liaison between private and public entities that might want to share data and (other) entities that might want to make use of it.

They bring up a number of interesting technology ideas from the world of privacy and fairness: differential privacy to control data releases, causal modeling to generate real-ish data for use in analysis and so on. They argue that this is probably the best way to reconcile legal and commercial hurdles over data sharing and is in fact the responsible way to go.

Takeaway: I think this is an interesting proposal. To some extent the devil is in the details and the paper left me wanting more, but they have a great proof of concept to check out. While I might be quite happy with academics being the “trusted escrow agent”, I wonder if that’s always the best option?

Of course it’s not just data you need governance for. What about models?

Model Cards for Model Reporting

This is one in a series of papers coming up right now that tackle the problem of model portability. How do I know if my model is being applied out of context with unpredictable results?

The solution from Margaret Mitchell et al (9 authors and counting…!) is to attach a model “spec sheet” to a trained model. The spec sheet would give you important documentation about the model — how it was trained, with what training regime, what data, what error rates and so on — in the hope that when the model is applied elsewhere, the spec sheet will prevent you from taking it out of context.

Takeaway: This is again a creative use of the idea of ‘user scripts‘ as a way to carry context. I wondered when I first read it whether it makes sense to talk about a model spec in the abstract without looking at a specific domain like some papers have done. I think the jury is still out (i.e “more research needed”) to see if model spec sheets can be useful in full generality or if we need the right “level of abstraction” to make usable, but this is an interesting direction to explore.

But instead of building data trusts or model specs, how about instrumenting your code itself with certificates of fairness that can be monitored while the code runs?

Fairness-Aware Programming

This paper by Albarghouthi and Vinitsky takes a programming-language perspective on building fair classifiers. Suppose we could annotate our code with specifications describing what properties a classifier must satisfy and then have tools that ensure that these specifications are satisfied while the code runs[1]. That would be pretty neat!

That’s basically what they do in this paper, by taking advantage of Python decorations to encode desired fairness specs. They show how to capture notions like disparate impact  and demographic parity and even weak forms of individual fairness. One thought I did have though: they are essentially trying to encode the ability to verify probabilistic statements, and I wonder if it might be easier to do this in one of the new and shiny probabilistic programming languages out there? Granted, Python is a more mainstream language (uh-oh, the PL police will be after me now). 

I know who you are, but what am I? 

It’s great to build systems that draw on the tech we’ve developed over the last many years. But there’s still more to learn about the ongoing shenanigans happening on the internet.

Who’s the Guinea Pig? Investigating Online A/B/n Tests in-the-Wild

You don’t want to mess with Christo Wilson. He can sue you — literally. He’s part of an ACLU lawsuit against the DoJ regarding the CFAA and its potential misuse to harass researchers trying to audit web systems. In this paper Shan Jiang, John Martin and Christo turn their audit gaze to online A/B testing.

If you’ve ever compulsively reloaded the New York Times, you’ll notice that the headlines of articles will change from time to time. Or you’ll go to a website and not see the same layout as someone else. This is because (as the paper illustrates) major websites are running experiments… on you.. They are doing A/B testing of various kinds potentially to experiment with different layouts, or potentially even to show you different kinds of content depending on who you are.

The paper describes an ingenious mechanism to reveal when a website is using A/B testing and determine what factors appear to be going into the decision to show particular content. The experimental methodology is a lot of fun to read.

While the authors are very careful to point out repeatedly that they find no evidence of sinister motives behind the use of A/B testing in the wild, the fact remains that we are being experimented on constantly without any kind of IRB protection (*cough* facebook emotional contagion *cough*). It’s not too far a leap to realize that the quest for “personalization” might mean that we eventually have no shared experience of the internet and that’s truly frightening.

And that’s it for now. Stay tuned for more…

Footnotes:

  1. An earlier version of this note had incorrectly described the paper as doing static verification instead of run-time verification.

FAT* Papers: Framing and Abstraction

The FAT* Conference is almost upon us, and I thought that instead of live-blogging from the conference (which is always exhausting) I’d do a preview of the papers. Thankfully we aren’t (yet) at 1000 papers in the proceedings, and I can hope to read and say something not entirely stupid (ha!) about each one.

I spent a lot of time pondering how to organize my posts, and then realized the PC chairs had already done the work for me, by grouping papers into sessions. So my plan is to do a brief (BRIEF!) review of each session, hoping to draw some general themes. (ed: paper links will yield downloadable PDFs starting Tuesday Jan 29)

And with that, let’s start with Session 1: Framing and abstraction.

Those who do not learn from history are doomed to repeat it. — George Santayana

Those who learn from history are doomed to repeat it. — twitter user, about machine learning.

50 Years of Test (Un)fairness: Lessons for Machine Learning

Ben Hutchinson and Margaret Mitchell have a fascinating paper on the history of discourse on (un)fairness. It turns out that dating back to the 60s, and in the context of standardized testing, researchers have been worried about the same issues of bias in evaluation that we talk about today.

I’m actually understating the parallels. It’s uncanny how the discussion on fairness and unfairness evolved precisely in the way it’s evolving right now. Their paper has a beautiful table that compares measures of fairness then and now, and the paper is littered with quotes from early work in the area that mirror our current discussions about different notions of fairness, the subtleties in error rate management across racial categories, the concerns over construct validity, and so on. It was a humbling experience for me to read this paper and realize that indeed, everything old is new again.

Takeaways: Among the important takeaways from this paper is:

  • The subtle switch from measuring UNfairness to measuring fairness caused the field to eventually wither away. How should we return to studying UNfairness?
  • If we can’t get notions of fairness to align with public perceptions, it’s unlikely that we will be able to get public policy to align with our technical definitions.
  • It’s going to be important to encode values explicitly in our systems.

which is a useful segue to the next paper:

Assume a spherical cow, …. or a rational man.

Problem Formulation and Fairness

Samir Passi and Solon Barocas present the results of an ethnographic study into the (attempted) deployment of a predictive tool to decide which potential car buyers (leads) to send to dealers. The company building the tool sells these leads to dealers, and so wants to make sure that the quality of the leads is high.

This is a gripping read, almost like a thriller. They carefully outline the choices the team (comprising business analysts, data scientists and managers) makes at each stage, and how they go from the high level goal “send high quality leads to dealers” to a goal that is almost ML-ready: “find leads that are likely to have credit scores above 500”. As one likes to say, the journey is more important than the destination, and indeed the way in which the goals get concretized and narrowed based on technical, business and feasibility constraints is both riveting and familiar to anyone working in (or with) corporate data science.

Takeaways: The authors point out that no actual problem in automated decision-making is a pure classification or regression problem. Rather, people (yes, PEOPLE) make a series of choices that narrow the problem space down to something that is computationally tractable. And it’s a dynamic process where data constraints as well as logistical challenges constrain the modeling. At no point in time do ethical or normative concerns surface, but the sequence of choices made clearly has an overall effect that could lead to disparate impact of some kind. They argue correctly, that we spend far too little time paying attention to these choices and the larger role of the pipeline around the (tiny) ML piece.

which is an even nice segue to the last paper:

Context matters, duh!

Fairness and Abstraction in Sociotechnical Systems

This is one of my papers, together with Andrew Selbst, danah boyd, Sorelle Friedler and Janet Vertesi. And our goal was to understand the failure modes of Fair-ML systems when deployed in a pipeline. The key thesis of our paper is: Abstraction is a core principle in a computer system, but it’s also the key point of failure when dealing with a socio-technical system. 

We outline a series of traps that fair-ML papers fall into even while trying to design what look like enlightened decision systems. You’ll have to read the paper for all the details, but one key trap that I personally struggle with is the formalization trap: the desire to nail down a formal specification that can be then optimized. This is a trap because the nature of the formal specification can be contested and evolve from context to context (even within a single organization, pace the paper above) and a too-hasty formalization can freeze the goals in a way that might not be appropriate for the problem. In other words, don’t fix a fairness definition in stone (this is important: I’m constantly asked by fellow theoryCS people what the one true definition of fairness is — so that they can go away and optimize the heck out of it).

Session Review:

When I read these three papers in sequence, I feel a black box exploding open revealing its messy and dirty inner workings. Ever since Frank Pasquale’s The Black Box Society came out, I’ve felt a constant sentiment from non-technical people (mostly lawyers/policy people) that the goal should be to route policy around the black box “AI” or “ML”. My contention has always been that we can’t do that: that understanding the inner workings of the black box is crucial to understanding both what works and what fails in automated decision systems. Conversely, technical people have been loathe to engage with the world OUTSIDE the black box, preferring to optimize our functions and throw them over the fence, Anathem-style.

I don’t think either approach is viable. Technical designers need to understand (as AOC clearly does!) that design choices that seem innocuous can have major downstream impact and that ML systems are not plug-and-play. But conversely those that wish to regulate and manage such systems need to be willing to go into the nitty gritty of how they are built and think about regulating those processes as well.

P.S

Happy families are all alike; every unhappy family is unhappy in its own way. — Tolstoy

There  is one way to be fair, but many different ways of being unfair.

Every person with good credit looks the same: but people have bad credit for very different reasons.

There might be more to this idea of “looking at unfairness” vs “looking at fairness. As I had remarked to Ben and Margaret a while ago, it has the feel of an NP vs co-NP question 🙂 – and we know that we don’t know if they’re same.

 

 

On the new PA recidivism risk assessment tool

(Update: apparently as a result of all the pushback from activists, the ACLU and others, the rollout of the new tool has been pushed back at least 6 months)

The Pennsylvania Commission on Sentencing is preparing a new risk assessment tool for recidivism to aid in sentencing. The mandate for the commission (taken from their report — also see the detailed documentation at their site) is to (emphasis all mine):

adopt a Sentence Risk Assessment Instrument for the sentencing court to use to help determine the appropriate sentence within the limits established by law…The risk assessment instrument may be used as an aide in evaluating the relative risk that an offender will reoffend and be a threat to public safety.” (42 Pa.C.S.§2154.7) In addition to considering the risk of re- offense and threat to public safety, Act 2010-95 also permits the risk assessment instrument to be used to determine whether a more thorough assessment is necessary, or as an aid in determining appropriate candidates for alternative sentencing (e.g., County Intermediate Punishment, State Intermediate Punishment, State Motivational Boot Camp, and Recidivism Risk Reduction Incentive).

I was hired by the ACLU of Pennsylvania to look at the documentation provided as part of this new tool and see how they built it. I submitted a report to them a little while ago,.

The commission is running public hearings to take comments and I thought I’d highlight some points, especially focusing on what I think are important “FAT*” notions for any data science project of this kind.

What is the goal of the predictor?

When you build any ML system, you have to be very careful about deciding what it is that you want to predict. In PA’s model, the risk assessment tool is to be used (by mandate) for determining

  • reoffense likelihood
  • risk to public safety

Note that these are not the same thing! Using a single tool to predict both, or using its predictions to make asssessments about both, is a problem.

How is this goal being measured?

You have to dig into the reports to see this (page 6): they measure recidivism as

re-arrest for a felony or misdemeanor in Pennsylvania within three years of imposition of a sentence to the community or within three years of release from confinement; or, for offenders sentenced to state prison, a recommitment to the Department of Corrections for a technical violation within three years of release from confinement.

How does the predictor goal match the measured goal?

Here’s where it gets interesting. I’m not at all clear how “risk to public safety” is measured by re-arrests. Moreover, using re-arrest as a proxy for reoffense is a big potential red flag, if we are concerned about excessive policing issues as well as patterns that target minorities. As a matter of fact, a 2013 recidivism report by Pennsylvania (Thanks to Nyssa Taylor at ACLU-PA for finding this) says (page 17) that re-arrest rates are highest for African-Americans, whereas reincarceration rates are more evenly balanced by race.

Notice also that technical violations of parole are included in measurements of recidivism. Setting aside the question of whether any technical violation of parole amounts to a risk to public safety, it’s known that for example when considering pre-trial risk assessments that failure to appear in court occurs for many reasons that often correlate more with poverty (and inability to take time off to appear in court) than actual flight risk.

It’s not clear what a technical violation of parole might constitute and whether there are race biases in this calculation. Note that since this is aggregated into a predicted value, it doesn’t undergo the more detailed nondiscrimination analysis that other features do.

Separately, I’ll note that the PA SC did discover that as a feature, prior arrests carry a race bias that is not mitigated by predictive efficacy, and therefore decided to replace it by prior convictions

How is the predictor being used?

What’s interesting about this tool is that its output is converted (as usual) into a low, medium or high risk label. But the tool is only used when the risk is deemed either low or high. This determination then triggers further reports. In the case when it returns a medium risk, the tool results are not passed on.

What I didn’t see is how the predictor guides a decision towards alternate sentencing, and whether a single predictor for “risk of recidivism” is suficient to determine the efficacy of alternate interventions (Narrator: it probably isn’t).

Coda

There are many interesting aspects of the tool building process: how they essentially build different tools for one of 10 different crime categories, how they decided to group categories together, and how they decided to use a different model for crimes against a person. The models used are all logistic regression, and the reports provide the variables that end up in each model, as well as the weights.

But to me, the detailed analysis of the effectiveness of the tool and which variables don’t carry racial bias miss some of the larger issues with how they even decide what the “labels” are.

Benchmarks and reproducibility in fair ML

These days, there are lots of fairness-aware classification algorithms out there. This is great! It should mean that for any task you want to pursue you can try out a bunch of fair classifiers and pick the one that works best on your dataset under the fairness measure you like most.

Unfortunately, this has not been the case. Even in the cases where code is available, the preprocessing of a specific data set is often wrapped into the algorithm, making it hard to reuse the code and hard to see what the impact of different preprocessing choices are on the algorithm. Many authors have used the same data sets, but preprocessed different ways and evaluated under different metrics. Which one is the best?

In an effort to address some of these questions, we’ve made a repository and written an accompanying paper detailing what we’ve found.

http://github.com/algofairness/fairness-comparison

We’ve made our best effort to include existing algorithms and represent them correctly, but if you have code that we’ve missed or see something we’ve messed up, we hope you’ll submit a pull request or just shoot us an email.

Some highlights…

Metrics: There are so many fairness metrics! Or are there? We find that a lot of them are correlated on the algorithms and datasets we looked at. In fact, there are two groups: disparate impact like measures and class-sensitive error measures. And accuracy measures are not a district group! They correlate with the class-sensitive error measures. So perhaps fairness-accuracy tradeoffs are only an issue with disparate impact like measures.

measure-correlation.png

Stability: We look at the stability of the algorithms and metrics over multiple random splits on a given measure by taking its standard deviation.  Here’s a cool graph based on that analysis showing disparate impact versus accuracy.

adult_race_sensitivity

We think it’s easier to understand the relative performance of algorithms taking this into account.

Preprocessing: Given the same algorithm on the same data set you can end up with different, potentially largely different, outcomes depending on small preprocessing variations, such as whether a protected race attribute is represented as all the possible values or, e.g., white and not-white.

preprocessing-tradeoff-accuracy.png

Tradeoffs: For the measures for which we found a fairness-accuracy tradeoff, different algorithms choose different parts of the tradeoff.

So which algorithm is best? As perhaps is not surprising, no one algorithm dominates over all data sets.

There’s a larger ongoing discussion about reproducibility in machine learning. This is our contribution in the fairness world.

 

 

Models need doubt: the problematic modeling behind predictive policing

Predictive policing describes a collection of data-driven tools that are used to determine where to send officers on patrol on any given day. The idea behind these tools is that we can use historical data to make predictions about when and where crime will happen on a given day and use that information to allocate officers appropriately.

On the one hand, predictive policing tools are becoming ever more popular in jurisdictions across the country. They represent an argument based on efficiency: why not use data to model crime more effectively and therefore provision officers more usefully where they might be needed?

On the other hand, critiques of predictive policing point out that a) predicting crimes based on arrest data really predicts arrests and not crimes and b) by sending officers out based on predictions from a model and then using the resulting arrest data to update the model, you’re liable to get into a feedback loop where the model results start to diverge from reality.

This was empirically demonstrated quite elegantly by Lum and Isaac in a paper last year, using simulated drug arrest data in the Oakland area as well as an implementation of a predictive policing algorithm developed by PredPol (the implementation was based on a paper published by researchers associated with PredPol). For further discussion on this, it’s worth reading Bärí A. Williams’ op-ed in the New York Times, a response to this op-ed by Andrew Guthrie Ferguson (who’s also written a great book on this topic) and then a response by Isaac and Lum to his response.

Most of the discussion and response has focused on specifics of the kinds of crimes being recorded and modeled and the potential for racial bias in the outcomes.

In our work, we wanted to ask a more basic question: what’s the mechanism that makes feedback affect the predictions a model makes? The top-line ideas emerging from our work (two papers that will be published at the 1st FAT* conference and at ALT 2018) can be summarized as:

Biased observations can cause runaway feedback loops.  If police don’t see crime in a neighborhood because the model told them not to go there, this can cause a feedback loop.

Over time, such models can generate predictions of crime rates that (if used to decide officer deployment) will skew the data used to train the next iteration of the model. Since models might be run every day (and were done so in at least one published work describing PredPol-like algorithms), this skew might take hold quickly.

But this is still speculation. Can we mathematically prove that this will happen? The answer is yes, and this is the main contribution in our paper to appear at FAT*. By modeling the predictive process with a generalization of a Pólya urn, we can mathematically prove that the system will diverge out of control, to the extent that if two areas have even slightly different crime rates, a system that used predictive modeling to allocate officers, collect the resulting observational data and retrain the model will progressively put more and more emphasis on the area with the slightly higher crime rate.

Moreover, we can see this effect in simulations of real-world predictive policing deployments using the implementation of PredPol used by Lum and Isaac in their work, providing justification for our mathematical model.

Now let’s take a step back. If we have a model that exhibits runaway feedback loops, then we might try to fix the model to avoid such bad behavior. In our paper, we show how to do that as well. The intuition here is quite simple. Suppose we have an area with a very high crime rate as estimated by our predictive model. Then observing an incident should not surprise us very much: in fact, it’s likely that we shouldn’t even try to update the model from this incident. On the other hand, the less we expect crime to happen, the more we should be surprised by seeing an incident and the more willing we should be to update our model.

This intuition leads to a way in which we can take predictions produced by a black box model and tweak the data that is fed into it so that it only reacts to surprising events. This then provably yields a system that will converge to the observed crime rates. And we can validate this empirically again using the PredPol-inspired implementation. What our experiments show is that such a modified system does not exhibit runaway feedback loops.

A disclaimer: in the interest of clarity, I’ve conflated terms that in reality should be distinct: an incident is not an arrest is not a crime. And it can’t always be assumed that just because we don’t send an officer to an area that we don’t get any information about incidents (e.g via 911 calls). We model these issues more carefully in the paper, and in fact show that as the proportion of “reported incidents” (i.e those not obtained as a consequence of model-directed officer patrols) increases, model accuracy increases in a predictable and quantifiable way if we assume that those reported incidents accurately reflect crime.  This is obviously a big assumption, and the extent to which different types of incidents reflect the underlying ground truth crime rate likely differs by crime and neighborhood – something we don’t investigate in our paper but believe should be a priority for any predictive policing system.

From the perspective of machine learning, the problem here is that the predictive system should be an online learning algorithm, but is actually running in batch mode. That means that it is unable to explore the space of possible models and instead merely exploits what it learns initially.

What if we could redesign the predictive model from scratch? Could we bring in insights from online learning to do a better job? This is the topic of our second paper and the next post. The short summary I’ll leave you with is that by carefully modeling the problem of limited feedback, we can harness powerful reinforcement learning frameworks to design new algorithms with provable bounds for predictive policing.