Benchmarks and reproducibility in fair ML

These days, there are lots of fairness-aware classification algorithms out there. This is great! It should mean that for any task you want to pursue you can try out a bunch of fair classifiers and pick the one that works best on your dataset under the fairness measure you like most.

Unfortunately, this has not been the case. Even in the cases where code is available, the preprocessing of a specific data set is often wrapped into the algorithm, making it hard to reuse the code and hard to see what the impact of different preprocessing choices are on the algorithm. Many authors have used the same data sets, but preprocessed different ways and evaluated under different metrics. Which one is the best?

In an effort to address some of these questions, we’ve made a repository and written an accompanying paper detailing what we’ve found.

We’ve made our best effort to include existing algorithms and represent them correctly, but if you have code that we’ve missed or see something we’ve messed up, we hope you’ll submit a pull request or just shoot us an email.

Some highlights…

Metrics: There are so many fairness metrics! Or are there? We find that a lot of them are correlated on the algorithms and datasets we looked at. In fact, there are two groups: disparate impact like measures and class-sensitive error measures. And accuracy measures are not a district group! They correlate with the class-sensitive error measures. So perhaps fairness-accuracy tradeoffs are only an issue with disparate impact like measures.


Stability: We look at the stability of the algorithms and metrics over multiple random splits on a given measure by taking its standard deviation.  Here’s a cool graph based on that analysis showing disparate impact versus accuracy.


We think it’s easier to understand the relative performance of algorithms taking this into account.

Preprocessing: Given the same algorithm on the same data set you can end up with different, potentially largely different, outcomes depending on small preprocessing variations, such as whether a protected race attribute is represented as all the possible values or, e.g., white and not-white.


Tradeoffs: For the measures for which we found a fairness-accuracy tradeoff, different algorithms choose different parts of the tradeoff.

So which algorithm is best? As perhaps is not surprising, no one algorithm dominates over all data sets.

There’s a larger ongoing discussion about reproducibility in machine learning. This is our contribution in the fairness world.





Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s