Benchmarks and reproducibility in fair ML

These days, there are lots of fairness-aware classification algorithms out there. This is great! It should mean that for any task you want to pursue you can try out a bunch of fair classifiers and pick the one that works best on your dataset under the fairness measure you like most.

Unfortunately, this has not been the case. Even in the cases where code is available, the preprocessing of a specific data set is often wrapped into the algorithm, making it hard to reuse the code and hard to see what the impact of different preprocessing choices are on the algorithm. Many authors have used the same data sets, but preprocessed different ways and evaluated under different metrics. Which one is the best?

In an effort to address some of these questions, we’ve made a repository and written an accompanying paper detailing what we’ve found.

We’ve made our best effort to include existing algorithms and represent them correctly, but if you have code that we’ve missed or see something we’ve messed up, we hope you’ll submit a pull request or just shoot us an email.

Some highlights…

Metrics: There are so many fairness metrics! Or are there? We find that a lot of them are correlated on the algorithms and datasets we looked at. In fact, there are two groups: disparate impact like measures and class-sensitive error measures. And accuracy measures are not a district group! They correlate with the class-sensitive error measures. So perhaps fairness-accuracy tradeoffs are only an issue with disparate impact like measures.


Stability: We look at the stability of the algorithms and metrics over multiple random splits on a given measure by taking its standard deviation.  Here’s a cool graph based on that analysis showing disparate impact versus accuracy.


We think it’s easier to understand the relative performance of algorithms taking this into account.

Preprocessing: Given the same algorithm on the same data set you can end up with different, potentially largely different, outcomes depending on small preprocessing variations, such as whether a protected race attribute is represented as all the possible values or, e.g., white and not-white.


Tradeoffs: For the measures for which we found a fairness-accuracy tradeoff, different algorithms choose different parts of the tradeoff.

So which algorithm is best? As perhaps is not surprising, no one algorithm dominates over all data sets.

There’s a larger ongoing discussion about reproducibility in machine learning. This is our contribution in the fairness world.



Racist risk assessments, algorithmic fairness, and the issue of harm

By now, you are likely to have heard of the fascinating report (and white paper) released by ProPublica describing the way that risk assessment algorithms in the criminal justice system appear to affect different races differently, and are not particularly accurate in their predictions. Even worse, they are even worse at predicting outcomes for black subjects than for white. Notice that this is a separate problem than ensuring equal outcomes pace disparate impact: it’s the problem of ensuring equal failure modes as well.


There is much to pick apart in this article, and you should read the whole thing yourself. But from the perspective of research in algorithmic fairness, and how this research is discussed in the media, there’s another very important consequence of this work.

It provides concrete examples of people who have possibly been harmed by algorithmic decision-making. 

We talk to reporters frequently about the larger set of questions surrounding algorithmic accountability and eventually they always ask some version of:

Can you point to anyone who’s actually been harmed by algorithms?

and we’ve never been able to point to specific instances so far. But now, after this article, we can.


White House Report on Algorithmic Fairness

The White House has put out a report on big data and algorithmic fairness (announcement, full report).  From the announcement:

Using case studies on credit lending, employment, higher education, and criminal justice, the report we are releasing today illustrates how big data techniques can be used to detect bias and prevent discrimination. It also demonstrates the risks involved, particularly how technologies can deliberately or inadvertently perpetuate, exacerbate, or mask discrimination.

The table of contents for the report gives a good overview of the issues addressed:

Big Data and Access to Credit
The Problem: Many Americans lack access to affordable credit due to thin or non-existent credit files.
The Big Data Opportunity: Use of big data in lending can increase access to credit for the financially underserved.
The Big Data Challenge: Expanding access to affordable credit while preserving consumer rights that protect against discrimination in credit eligibility decisions

Big Data and Employment
The Problem: Traditional hiring practices may unnecessarily filter out applicants whose skills match the job opening.
The Big Data Opportunity: Big data can be used to uncover or possibly reduce employment discrimination.
The Big Data Challenge: Promoting fairness, ethics, and mechanisms for mitigating discrimination in employment opportunity.

Big Data and Higher Education
The Problem: Students often face challenges accessing higher education, finding information to help choose the right college, and staying enrolled.
The Big Data Opportunity: Using big data can increase educational opportunities for the students who most need them.
The Big Data Challenge: Administrators must be careful to address the possibility of discrimination in higher education admissions decisions.

Big Data and Criminal Justice
The Problem: In a rapidly evolving world, law enforcement officials are looking for smart ways to use new technologies to increase community safety and trust.
The Big Data Opportunity: Data and algorithms can potentially help law enforcement become more transparent, effective, and efficient.
The Big Data Challenge: The law enforcement community can use new technologies to enhance trust and public safety in the community, especially through measures that promote transparency and accountability and mitigate risks of disparities in treatment and outcomes based on individual characteristics.

Friday links dump

What I’ve been reading (or meaning to read) this week:

Related links

A dump of what I’ve been reading lately:


NPR: Can Computers be Racist?


As will come as no surprise to readers of this blog, algorithms can make biased decisions.  NPR tackles this question in their latest All Tech Considered (which I was interviewed for!).

They start by talking to Jacky Alcine, the software engineer who discovered that Google Photos had tagged his friend as an animal:

As Jacky points out: “One could say, ‘Oh, it’s a computer,’ I’m like OK … a computer built by whom? A computer designed by whom? A computer trained by whom?” It’s a short segment, but we go on to talk a bit about how that bias could come about.

What I want to emphasize here is that, while hiring more Black software engineers would likely help and make it more likely that these issues would be caught quickly, it is not enough. As Jacky implies, the training data itself is biased. In this case, likely by including more photos of white people and animals than of Black people. In other cases, because the labels have been created by people whose past racist decisions are being purposefully used to guide future decisions.

Consider the automated hiring algorithms now touted by many startups (Jobaline, Hirevue, Gild, …). If an all-white company attempts to use their current employees as training data, i.e., attempts to find future employees who are like their current employees, then they’re likely to continue being an all-white company. That’s because the data about their current employees encodes systemic racial bias such as differences between white and Black SAT test-takers even when controlling for ability. Algorithmic decisions will find and replicate this bias.

We need to be proactive to keep such biases from influencing algorithmic decisions.

Disparate impact ruled discriminatory in housing

Symposium: Supreme Court’s victory for disparate impact includes a cautionary tale

The Supreme Court decided a huge case today in favor of civil rights… and in favor of statistics! More nuance on the legal opinion can be found above and at many other posts at SCOTUSblog. But for computational purposes, I find these sections interesting:

the high court cabined disparate impact liability to those policies that pose “artificial, arbitrary, and unnecessary barriers.” That important qualifier may ultimately determine the outcome of this case on remand.

Does this mean that if we develop a model that has high utility and decreases disparate impact that the previous decision will be ruled unnecessary?

Most importantly, the full majority cautions that disparate impact liability poses “special dangers” that must be limited to avoid serious constitutional questions that might arise under the FHA if, for example, such liability were imposed based solely on a showing of a statistical disparity. This requires giving housing authorities and private developers adequate leeway to explain the valid interests their policies serve, an analysis that is analogous to Title VII’s business necessity defense. The Court emphasizes that policies are not contrary to the disparate impact requirement unless they are “artificial, arbitrary, and unnecessary barriers.” And the Court confirms that a disparate impact claim relying on a statistical disparity must fail if the plaintiff cannot point to a defendant’s policy causing that disparity. The Court views this crucial causality requirement as necessary to ensure that defendants (and courts) do not resort to the use of racial quotas.

There’s a lot here, but I think the big point is the importance of interpretable models. It sounds like the plaintiff must be able to interpret the defendant’s model in order to bring a valid claim. So we also need interpretability of models even without the right to examine or build the model ourselves! (Though I would argue that this is an unreasonable standard for the law to expect and that the burden should be on the defendant to provide an interpretable version of their policy. Perhaps it is?)

As today’s decision presages, the next challenge to “disparate impact” theory the Court will undoubtedly be forced to consider may prove to be a far more difficult one. As Justice Scalia noted in his concurrence in Ricci v. DeStefano, whether any statute that affirmatively requires race-based actions to remedy “disparate impacts” can be harmonized with the Fourteenth Amendment’s guarantee of equal protection is not an easy question to answer.

This is the question of if it’s possible to repair disparate impact without seeing its effect first. The assumption seems to be “no”, but we’ve actually shown a way of doing this (see paper here).

Another inspiring interpretation of the ruling can be found here. It argues that the ruling acknowledges “unconscious prejudice.” I.e., the statistics might show that a decision was discriminatory even if you didn’t mean it to be!

Everything you need to know about evidence-based sentencing

Moneyballing Justice: “Evidence-Based” Criminal Reforms Ignore Real Evidence

There are many issues with so-called evidence-based sentencing reforms – from the lack of basic statistical validity, to the lack of transparency, to their discriminatory impact – and this article surveys all of them, with detailed links for much more information. Here’s some of the high profile criticism these methods have received:

Attorney General Eric Holder has warned that use of predictive data in sentencing is likely to adversely affect communities of color. University of Michigan legal scholar Sonja Starr explains that risk scores are based primarily or wholly on an individual’s prior characteristics, including criminal history – some instruments include not only convictions, but arrests and failure to appear in court. Other allegedly criminogenic factors “unrelated to conduct” often include homelessness, “unemployment, marital status, age, education, finances, neighborhood, and family background, including family members’ criminal history.” Starr asserts that because poor people and people of color bear the brunt of mass incarceration, “[p]unishment profiling will exacerbate these disparities.”

Yet some proponents seem to still have missed the point, including the folks behind the Public Safety Assessment – Court from the Laura and John Arnold Foundation:

Importantly, because it does not rely on factors like neighborhood or income, the PSA-Court is helping deliver these results without discriminating on the basis of race or gender.

That might be true, or it might not be. Neighborhood and income certainly aren’t the only possible attributes correlated with race and gender. The group wouldn’t release the list of 9 attributes when the authors of this article asked, so it’s not verifiable. Yet the title of the article about their tool is “Data and Research for Increased Safety and Fairness.” If they really care about fairness, I hope they release their data or at least the methodology by which they determined it wasn’t discriminatory.

More Facebook nymwars

Say my name: Facebook’s unfair “real names” policy continues to harm vulnerable users

Another community is hit by the Facebook nymwars. We don’t know for sure that Facebook is using an algorithm to evaluate whether users’ names are real or fake, but given the large number of users involved it seems likely. So:

Dear Facebook – please evaluate your “fake” name recognition algorithm using class-conditioned error measures. Finding that they work well on most users may still mean that they work terribly for most users who are drag queens.