The conference is over, and I’m more exhausted than I thought I’d be. It was exhilarating. But the job of a paper summarizer never ends, and I am doing this exercise as much for my own edification as anyone else’s 🙂
The theme of this session is a little more spread out, but all the papers are “tools” papers in the classic ML sense: trying to build widgets that can be useful in a more introspective processing pipeline.
The interaction between causality and fairness is getting steadily more interesting. In a sense it’s hard to imagine doing any kind of nondiscriminatory learning without causal models because a) observational data isn’t sufficient to determine bias without some understanding of causality, and b) causal modeling helps us understand where the sources of bias might be coming from (more on this in a later post).
This paper continues a line of thinking that says, “let’s try to posit an explicit causal model that connects protected attributes with outcomes, and learn the parameters of this model from observational data”. The key idea here is: we have observational data X, treatment T and outcome Y. We have a feeling that protected attributes A might be affecting all of these, but it’s not clear how. Let’s assume that the observed data X (and the treatment and outcome) is influenced by A, and some hidden latent features Z that by definition are independent of A. If we can infer Z from observational data then we have a way to measure the effects of A on representation, treatment and outcome.
But how do we learn Z? By using a deep learning architecture that asks that we find Z given X. This is the main technical meat of the paper, and require some careful assumptions over confounders. But the “trick” as it were, is to replace “parameters of a distribution” by “a neural network” which is a common trick in representation learning.
The upshot of all of this is that we can learn structural equations connecting X, Z, A, T and Y, and then do interventions to determine the degree to which A influences treatments and outcomes.
To decide or not to decide, that is the question.
Risk assessment tools (and in fact many regression-like systems) output “probabilities”: there’s an X% chance that this person will fail to appear in court etc. But a judge has to make a decison: YES or NO. How we go from “soft decisions” (aka probabilities) to “hard decisions” aka binary output is the topic of this paper.
There are a few obvious ways to do this:
- If the probability p > .5, declare YES, else declare NO
- Toss a coin that is YES with probability p, and return the value
Either of these “work” in the sense that the accuracy of the resulting “hard” decision system will not be very different from the soft one. But what about the effect on group decisions? It turns out that one has to be far more careful in that case.
Assume that the soft decision-maker is “well calibrated” – which means that if of all the times it gives a probability p of an event occuring, then a p fraction of those time the event will actually occur. And assume that the system is also well-calibrated for each group of people. The bad news is that there’s still no way to ensure (if the base rates for groups are different) that any converter from soft to hard classification preserves good error rates by group evenly.
The paper goes on to discuss scenarios where this “equality of odds” can still be achieved, and it requires stronger conditions on the initial soft classifier. Another interesting trick they make use of is the idea of “deferring” a decision, or basically saying “I DON’T KNOW”. In that case it is in fact possible to equalize all kinds of error rates across groups, with the caveat that the errors are only measured with respect to inputs for which an answer is given.
In practice, a tool that says “I don’t know” a lot will essentially cede control back to a human decision maker, which might be fine, but also makes the claims of balanced error rates questionable because the overall system (including the human) might not have this property (remember the framing trap?). Nevertheless, allowing an “I’m not sure” state in an algorithmic process might be a good way to acknowledge when the algorithm isn’t really sure about what to do.
Which is a convenient segue to the next paper:
This paper is squarely in the explanations realm. The idea is to find a signal to encode the degree of confidence in a prediction and also explain where that confidence comes from, by using a small set of exemplars to say “I classified the point this way because I saw these other close by points classified the same way”.
The trick is to use kernel smoothing ideas. Here’s the thought. Suppose I built a classifier for points, and when a new point came along, associated a certainty score with this point What one might traditionally do is to say that the further away from the classification boundary a point is, the more confident we are in the answer.
We don’t want to say that here though. Because suppose we get a point in a region we’ve never sampled before but happens to be far from the current classification. Then one of two things could be happening:
- The point is really on the right side with high confidence
- We have no idea what the real label should be because we never sampled points nearby the query point so have no idea whether the classifier might have been affected.
The only way to distinguish between the two cases is to express the uncertainty in terms of nearby points, and that’s what they propose to do. The details involve kernel regression and are not relevant here, but the idea is that a point that is close to many other points with the same label will have a lower uncertainty score than one that’s far from any training points.