Me (in the hallway at FAT*): Hi
[person]: Oh hi, how’re you doing?
[person];. So…. when’s the next post going to be up?
Which brings us to Session 3.
Kate Crawford gave a talk at NIPS (NeurIPS?) 2017 on harms of representation that has had a profound influence on my thinking about fairness. We’re all familiar with harms that come from biased decision making — harms of allocation — but it’s a little harder to discuss what it means to face harm from a skew in representation.
A few years ago we saw a series of papers that demonstrated that standard representations of text using methods like Word2Vec and GloVe could encode biases in the training corpora. But can we connect these harms directly to harms of allocation? In other words, to what extent can we attribute a harm of allocation to a skewed representation rather than distributional bias or bad metrics?
Bias in Bios: A Case Study of Semantic Representation Bias in a High-Stakes Setting
This paper by De-Artega et al attempts to explore this question in the context of resume screening. Specifically, suppose we could construct a high dimensional representation of the text in someone’s CV. And then suppose we built an occupation predictor that takes this representation as input? Would the behavior of the predictor differ if we scrubbed the CV of obvious gender markers? Specifically, if we defined “behavior of predictor” as difference in true positive rates between genders, how does this number look with and without scrubbed data?
They do this test with three different representations for CV data scrubbed from the web and cleaned up. In each instance, the first sentence provides the occupation label and the rest of the text is the source for the derived representation. Gender scrubbing is performed by removing obvious coded pronouns as well as first names.
The main upshot is not too surprising: the worse the gender imbalance in the input training data for an occupation, the more skewed the behavior of the predictor (the TPR difference between male and female). And scrubbing gender markers helps alleviate some of this, but not all. What’s interesting is that using the standard information-theoretic trick of “can I predict gender from the representation” they can show that even the scrubbed representation still has some latent gender coding, which explains why the scrubbed representation doesn’t perfectly eliminate biased error rates.
I think this paper is a great addition to the growing study of harms of representation. At the risk of tooting my own (student’s) horn, my student Mohsen Abbasi (together with the invincible monotonically length-increasing triumvirate of Friedler/Scheidegger/Venkatasubramanian) has a paper accepted to the 2019 SIAM Conference on Data Mining that looks at harms of representation more closely from the perspective of stereotype formation.
Those who vote decide nothing. Those who count the vote decide everything. — Joseph Stalin.
Equality of Voice: Towards Fair Representation in Crowdsourced Top-K Recommendations
This paper by Chakraborty et al is not quite about representations: rather it’s about diversity of results in recommendations. The problem is thus: if we want to aggregate recomendations from users (say for popular news topics to be fed into a “trending news” timeline), we have to be able to use some kind of voting scheme where users “vote” on news by either retweeting, or sharing, or engaging positively in some way.
But there are very few votes in the system: in the parlance of linear algebra, the matrix of users and news items is very sparse. In such a setting, it’s going to be highly unreliable to vote for winners, because most articles will only get a few votes, and a well-organized minority could strategically manipulate the ranking to be non-representative (ed: C’mon Suresh, why you being all doom and gloom – no way this can happen *cough*4chan*cough)
Trick #1: use ranked choice voting instead of simply counting. User activity creates a ranked list of preferences and these can be aggregated using ranked choice voting to defeat strategic manipulation. But that requires a dense user-topic matrix!
Trick #2: In order to fill out the matrix, predict the missing entries for each user using personalized ranking predictors. For some news sites you can do this by inferring a ranking from how long each user spends on the page.
There are things that make me a little uneasy in the framing and execution of this work. For one, while it’s nice to imagine that the top-K recommendations are flawed due to ineffective voting schemes, I suspect the truth — involving a desperate play for engagement — is far more depressing. And so I’m not convinced that fixing the errors in tallying votes really addresses the problems with recommendations. Secondly it’s not clear to me how to generalize Trick#2 to other media: for example the method used to infer interest in Twiter hash tags is rather baroque.
I know what I know. — Paul Simon
Every now and then you have a very bittersweet experience reading a paper. It captures thoughts that have been in your head for ages, and that you should have written down in a paper. But you know that there’s no way you could have written down the thoughts in the way this paper does it.
The Profiling Potential of Computer Vision and the Challenge of Computational Empiricism
Hewing close to the interdisciplinary soul of the conference, this paper by Jake Goldenfein is a rumination on how we know what we know, through the lens of computer vision.
There’s always been a bit of epistemic jujitsu at the heart of our modern ML-enhanced world. Especially when applied to social phenomena, the promise of ML is that it can make knowable through LEARNING AND DATA things about ourselves that we could not have gleaned from other methods. That is to say, a deep learning system that claims to predict the risk of recidivism carries within itself epistemic content that is irreducible.
This paper brings some much needed context to this idea, pointing out the long history of using measurement and quantification as a privileged way to represent knowledge, and arguing that our current obsession with facial recognition draws on this history, with the extra irony that rather than actually looking at the face, we encode it as a vector before processing it. He argues that in order to think about legal remedies, we have to focus less on the technical limitations of facial recognition, and more on the claim that there are purely computational ways to know. And maybe we should legally disallow such claims.
Let A be the group of all people who are not members of a group…. — Bertrand Russell’s FAT* ghost.
It’s one thing to define fairness with respect to a single protected variable. But what if you don’t want to specify the variable? Maybe you’re concerned about intersectionality issues and realize that merely protecting against different treatments on race and gender doesn’t guarantee protection with respect to race-and-gender. Or maybe you actually don’t yet know what groups are likely to treated unfairly? A series of papers last year explored the idea of defining fairness with respect to all groups that can be expressed through a “simple” predicate (where “simple” has a technical meaning that I won’t get into here).
An Empirical Study of Rich Subgroup Fairness for Machine Learning
This paper takes those ideas and empirically evaluates them. And now time for a confession. I find it difficult to read purely empirical papers. I tend to get drowned in seas of charts and tables without a nice juicy lemma to hang my hat on. And so I’ll admit to some difficulty in pulling together the detailed message from this paper. At the highest level, it establishes that trying to optimize fairness with respect to classes of groups is feasible even with heuristic oracles that solve the classification problem unconstrained by fairness constraints, and so this method of building a fair classifier deserves its place in benchmarking tests. Personally I’d be curious to see if this can do as well or possibly even better than a method that targets a particular group. Time for some ‘pip install fairness‘ :). But I’m also curious as to whether the methods described here would work if the classifier and groups were not simple linear functions and thresholds.