Benchmarks and reproducibility in fair ML

These days, there are lots of fairness-aware classification algorithms out there. This is great! It should mean that for any task you want to pursue you can try out a bunch of fair classifiers and pick the one that works best on your dataset under the fairness measure you like most.

Unfortunately, this has not been the case. Even in the cases where code is available, the preprocessing of a specific data set is often wrapped into the algorithm, making it hard to reuse the code and hard to see what the impact of different preprocessing choices are on the algorithm. Many authors have used the same data sets, but preprocessed different ways and evaluated under different metrics. Which one is the best?

In an effort to address some of these questions, we’ve made a repository and written an accompanying paper detailing what we’ve found.

We’ve made our best effort to include existing algorithms and represent them correctly, but if you have code that we’ve missed or see something we’ve messed up, we hope you’ll submit a pull request or just shoot us an email.

Some highlights…

Metrics: There are so many fairness metrics! Or are there? We find that a lot of them are correlated on the algorithms and datasets we looked at. In fact, there are two groups: disparate impact like measures and class-sensitive error measures. And accuracy measures are not a district group! They correlate with the class-sensitive error measures. So perhaps fairness-accuracy tradeoffs are only an issue with disparate impact like measures.


Stability: We look at the stability of the algorithms and metrics over multiple random splits on a given measure by taking its standard deviation.  Here’s a cool graph based on that analysis showing disparate impact versus accuracy.


We think it’s easier to understand the relative performance of algorithms taking this into account.

Preprocessing: Given the same algorithm on the same data set you can end up with different, potentially largely different, outcomes depending on small preprocessing variations, such as whether a protected race attribute is represented as all the possible values or, e.g., white and not-white.


Tradeoffs: For the measures for which we found a fairness-accuracy tradeoff, different algorithms choose different parts of the tradeoff.

So which algorithm is best? As perhaps is not surprising, no one algorithm dominates over all data sets.

There’s a larger ongoing discussion about reproducibility in machine learning. This is our contribution in the fairness world.




Racist risk assessments, algorithmic fairness, and the issue of harm

By now, you are likely to have heard of the fascinating report (and white paper) released by ProPublica describing the way that risk assessment algorithms in the criminal justice system appear to affect different races differently, and are not particularly accurate in their predictions. Even worse, they are even worse at predicting outcomes for black subjects than for white. Notice that this is a separate problem than ensuring equal outcomes pace disparate impact: it’s the problem of ensuring equal failure modes as well.


There is much to pick apart in this article, and you should read the whole thing yourself. But from the perspective of research in algorithmic fairness, and how this research is discussed in the media, there’s another very important consequence of this work.

It provides concrete examples of people who have possibly been harmed by algorithmic decision-making. 

We talk to reporters frequently about the larger set of questions surrounding algorithmic accountability and eventually they always ask some version of:

Can you point to anyone who’s actually been harmed by algorithms?

and we’ve never been able to point to specific instances so far. But now, after this article, we can.


White House Report on Algorithmic Fairness

The White House has put out a report on big data and algorithmic fairness (announcement, full report).  From the announcement:

Using case studies on credit lending, employment, higher education, and criminal justice, the report we are releasing today illustrates how big data techniques can be used to detect bias and prevent discrimination. It also demonstrates the risks involved, particularly how technologies can deliberately or inadvertently perpetuate, exacerbate, or mask discrimination.

The table of contents for the report gives a good overview of the issues addressed:

Big Data and Access to Credit
The Problem: Many Americans lack access to affordable credit due to thin or non-existent credit files.
The Big Data Opportunity: Use of big data in lending can increase access to credit for the financially underserved.
The Big Data Challenge: Expanding access to affordable credit while preserving consumer rights that protect against discrimination in credit eligibility decisions

Big Data and Employment
The Problem: Traditional hiring practices may unnecessarily filter out applicants whose skills match the job opening.
The Big Data Opportunity: Big data can be used to uncover or possibly reduce employment discrimination.
The Big Data Challenge: Promoting fairness, ethics, and mechanisms for mitigating discrimination in employment opportunity.

Big Data and Higher Education
The Problem: Students often face challenges accessing higher education, finding information to help choose the right college, and staying enrolled.
The Big Data Opportunity: Using big data can increase educational opportunities for the students who most need them.
The Big Data Challenge: Administrators must be careful to address the possibility of discrimination in higher education admissions decisions.

Big Data and Criminal Justice
The Problem: In a rapidly evolving world, law enforcement officials are looking for smart ways to use new technologies to increase community safety and trust.
The Big Data Opportunity: Data and algorithms can potentially help law enforcement become more transparent, effective, and efficient.
The Big Data Challenge: The law enforcement community can use new technologies to enhance trust and public safety in the community, especially through measures that promote transparency and accountability and mitigate risks of disparities in treatment and outcomes based on individual characteristics.

Friday links dump

What I’ve been reading (or meaning to read) this week:

Related links

A dump of what I’ve been reading lately:


NPR: Can Computers be Racist?


As will come as no surprise to readers of this blog, algorithms can make biased decisions.  NPR tackles this question in their latest All Tech Considered (which I was interviewed for!).

They start by talking to Jacky Alcine, the software engineer who discovered that Google Photos had tagged his friend as an animal:

As Jacky points out: “One could say, ‘Oh, it’s a computer,’ I’m like OK … a computer built by whom? A computer designed by whom? A computer trained by whom?” It’s a short segment, but we go on to talk a bit about how that bias could come about.

What I want to emphasize here is that, while hiring more Black software engineers would likely help and make it more likely that these issues would be caught quickly, it is not enough. As Jacky implies, the training data itself is biased. In this case, likely by including more photos of white people and animals than of Black people. In other cases, because the labels have been created by people whose past racist decisions are being purposefully used to guide future decisions.

Consider the automated hiring algorithms now touted by many startups (Jobaline, Hirevue, Gild, …). If an all-white company attempts to use their current employees as training data, i.e., attempts to find future employees who are like their current employees, then they’re likely to continue being an all-white company. That’s because the data about their current employees encodes systemic racial bias such as differences between white and Black SAT test-takers even when controlling for ability. Algorithmic decisions will find and replicate this bias.

We need to be proactive to keep such biases from influencing algorithmic decisions.