On the new PA recidivism risk assessment tool

(Update: apparently as a result of all the pushback from activists, the ACLU and others, the rollout of the new tool has been pushed back at least 6 months)

The Pennsylvania Commission on Sentencing is preparing a new risk assessment tool for recidivism to aid in sentencing. The mandate for the commission (taken from their report — also see the detailed documentation at their site) is to (emphasis all mine):

adopt a Sentence Risk Assessment Instrument for the sentencing court to use to help determine the appropriate sentence within the limits established by law…The risk assessment instrument may be used as an aide in evaluating the relative risk that an offender will reoffend and be a threat to public safety.” (42 Pa.C.S.§2154.7) In addition to considering the risk of re- offense and threat to public safety, Act 2010-95 also permits the risk assessment instrument to be used to determine whether a more thorough assessment is necessary, or as an aid in determining appropriate candidates for alternative sentencing (e.g., County Intermediate Punishment, State Intermediate Punishment, State Motivational Boot Camp, and Recidivism Risk Reduction Incentive).

I was hired by the ACLU of Pennsylvania to look at the documentation provided as part of this new tool and see how they built it. I submitted a report to them a little while ago,.

The commission is running public hearings to take comments and I thought I’d highlight some points, especially focusing on what I think are important “FAT*” notions for any data science project of this kind.

What is the goal of the predictor?

When you build any ML system, you have to be very careful about deciding what it is that you want to predict. In PA’s model, the risk assessment tool is to be used (by mandate) for determining

  • reoffense likelihood
  • risk to public safety

Note that these are not the same thing! Using a single tool to predict both, or using its predictions to make asssessments about both, is a problem.

How is this goal being measured?

You have to dig into the reports to see this (page 6): they measure recidivism as

re-arrest for a felony or misdemeanor in Pennsylvania within three years of imposition of a sentence to the community or within three years of release from confinement; or, for offenders sentenced to state prison, a recommitment to the Department of Corrections for a technical violation within three years of release from confinement.

How does the predictor goal match the measured goal?

Here’s where it gets interesting. I’m not at all clear how “risk to public safety” is measured by re-arrests. Moreover, using re-arrest as a proxy for reoffense is a big potential red flag, if we are concerned about excessive policing issues as well as patterns that target minorities. As a matter of fact, a 2013 recidivism report by Pennsylvania (Thanks to Nyssa Taylor at ACLU-PA for finding this) says (page 17) that re-arrest rates are highest for African-Americans, whereas reincarceration rates are more evenly balanced by race.

Notice also that technical violations of parole are included in measurements of recidivism. Setting aside the question of whether any technical violation of parole amounts to a risk to public safety, it’s known that for example when considering pre-trial risk assessments that failure to appear in court occurs for many reasons that often correlate more with poverty (and inability to take time off to appear in court) than actual flight risk.

It’s not clear what a technical violation of parole might constitute and whether there are race biases in this calculation. Note that since this is aggregated into a predicted value, it doesn’t undergo the more detailed nondiscrimination analysis that other features do.

Separately, I’ll note that the PA SC did discover that as a feature, prior arrests carry a race bias that is not mitigated by predictive efficacy, and therefore decided to replace it by prior convictions

How is the predictor being used?

What’s interesting about this tool is that its output is converted (as usual) into a low, medium or high risk label. But the tool is only used when the risk is deemed either low or high. This determination then triggers further reports. In the case when it returns a medium risk, the tool results are not passed on.

What I didn’t see is how the predictor guides a decision towards alternate sentencing, and whether a single predictor for “risk of recidivism” is suficient to determine the efficacy of alternate interventions (Narrator: it probably isn’t).

Coda

There are many interesting aspects of the tool building process: how they essentially build different tools for one of 10 different crime categories, how they decided to group categories together, and how they decided to use a different model for crimes against a person. The models used are all logistic regression, and the reports provide the variables that end up in each model, as well as the weights.

But to me, the detailed analysis of the effectiveness of the tool and which variables don’t carry racial bias miss some of the larger issues with how they even decide what the “labels” are.

Advertisements

Models need doubt: the problematic modeling behind predictive policing

Predictive policing describes a collection of data-driven tools that are used to determine where to send officers on patrol on any given day. The idea behind these tools is that we can use historical data to make predictions about when and where crime will happen on a given day and use that information to allocate officers appropriately.

On the one hand, predictive policing tools are becoming ever more popular in jurisdictions across the country. They represent an argument based on efficiency: why not use data to model crime more effectively and therefore provision officers more usefully where they might be needed?

On the other hand, critiques of predictive policing point out that a) predicting crimes based on arrest data really predicts arrests and not crimes and b) by sending officers out based on predictions from a model and then using the resulting arrest data to update the model, you’re liable to get into a feedback loop where the model results start to diverge from reality.

This was empirically demonstrated quite elegantly by Lum and Isaac in a paper last year, using simulated drug arrest data in the Oakland area as well as an implementation of a predictive policing algorithm developed by PredPol (the implementation was based on a paper published by researchers associated with PredPol). For further discussion on this, it’s worth reading Bärí A. Williams’ op-ed in the New York Times, a response to this op-ed by Andrew Guthrie Ferguson (who’s also written a great book on this topic) and then a response by Isaac and Lum to his response.

Most of the discussion and response has focused on specifics of the kinds of crimes being recorded and modeled and the potential for racial bias in the outcomes.

In our work, we wanted to ask a more basic question: what’s the mechanism that makes feedback affect the predictions a model makes? The top-line ideas emerging from our work (two papers that will be published at the 1st FAT* conference and at ALT 2018) can be summarized as:

Biased observations can cause runaway feedback loops.  If police don’t see crime in a neighborhood because the model told them not to go there, this can cause a feedback loop.

Over time, such models can generate predictions of crime rates that (if used to decide officer deployment) will skew the data used to train the next iteration of the model. Since models might be run every day (and were done so in at least one published work describing PredPol-like algorithms), this skew might take hold quickly.

But this is still speculation. Can we mathematically prove that this will happen? The answer is yes, and this is the main contribution in our paper to appear at FAT*. By modeling the predictive process with a generalization of a Pólya urn, we can mathematically prove that the system will diverge out of control, to the extent that if two areas have even slightly different crime rates, a system that used predictive modeling to allocate officers, collect the resulting observational data and retrain the model will progressively put more and more emphasis on the area with the slightly higher crime rate.

Moreover, we can see this effect in simulations of real-world predictive policing deployments using the implementation of PredPol used by Lum and Isaac in their work, providing justification for our mathematical model.

Now let’s take a step back. If we have a model that exhibits runaway feedback loops, then we might try to fix the model to avoid such bad behavior. In our paper, we show how to do that as well. The intuition here is quite simple. Suppose we have an area with a very high crime rate as estimated by our predictive model. Then observing an incident should not surprise us very much: in fact, it’s likely that we shouldn’t even try to update the model from this incident. On the other hand, the less we expect crime to happen, the more we should be surprised by seeing an incident and the more willing we should be to update our model.

This intuition leads to a way in which we can take predictions produced by a black box model and tweak the data that is fed into it so that it only reacts to surprising events. This then provably yields a system that will converge to the observed crime rates. And we can validate this empirically again using the PredPol-inspired implementation. What our experiments show is that such a modified system does not exhibit runaway feedback loops.

A disclaimer: in the interest of clarity, I’ve conflated terms that in reality should be distinct: an incident is not an arrest is not a crime. And it can’t always be assumed that just because we don’t send an officer to an area that we don’t get any information about incidents (e.g via 911 calls). We model these issues more carefully in the paper, and in fact show that as the proportion of “reported incidents” (i.e those not obtained as a consequence of model-directed officer patrols) increases, model accuracy increases in a predictable and quantifiable way if we assume that those reported incidents accurately reflect crime.  This is obviously a big assumption, and the extent to which different types of incidents reflect the underlying ground truth crime rate likely differs by crime and neighborhood – something we don’t investigate in our paper but believe should be a priority for any predictive policing system.

From the perspective of machine learning, the problem here is that the predictive system should be an online learning algorithm, but is actually running in batch mode. That means that it is unable to explore the space of possible models and instead merely exploits what it learns initially.

What if we could redesign the predictive model from scratch? Could we bring in insights from online learning to do a better job? This is the topic of our second paper and the next post. The short summary I’ll leave you with is that by carefully modeling the problem of limited feedback, we can harness powerful reinforcement learning frameworks to design new algorithms with provable bounds for predictive policing.

Bloomberg profile of Richard Berk

Richard Berk is one of the founding fathers of automated risk assessment, and systems based on his work are being deployed in Pennsylvania and other locations. This Bloomberg profile of him has many interesting (and terrifying) nuggets. As always, you should read the whole thing (if Bloomberg’s horrible page rendering doesn’t trigger a headache), but here are some highlights.

What’s interesting in the system he designed is how it’s optimized for cost of incarceration, rather than for accuracy. In the particular case described in the article, this actually makes the system less harsh, because a finding of a problem triggers expensive therapy. On the other side though, there’s a political component: it’s far riskier to release someone who might commit a crime than it is to keep incarcerated someone who might be reformed. As Berk puts it:

The policy position that is taken is that it’s much more dangerous to release Darth Vader than it is to incarcerate Luke Skywalker

The problem of course is that incarcerating Luke Skywalker could turn him into a new Darth Vader, and I don’t know if this is factored into the analysis.

He also says later

Berk argues that eliminating sensitive factors weakens the predictive power of the algorithms. “If you want me to do a totally race-neutral forecast, you’ve got to tell me what variables you’re going to allow me to use, and nobody can, because everything is confounded with race and gender,” he said.

This seems a little binary to me. It’s not an either-or where you either have to keep all sensitive attributes or throw them all out. There are ways to quantify and even subtract out the influence of certain problematic attributes without having to throw out all the information: in fact, we have a paper on this!

As the article, Berk is heading to Norway:

Berk wants to predict at the moment of birth whether people will commit a crime by their 18th birthday, based on factors such as environment and the history of a new child’s parents. This would be almost impossible in the U.S., given that much of a person’s biographical information is spread out across many agencies and subject to many restrictions. He’s not sure if it’s possible in Norway, either, and he acknowledges he also hasn’t completely thought through how best to use such information.

The idea that data can be collected to make such predictions is certainly alluring and tempting. But everything we’re beginning to understand about predictions based on algorithms suggests that making such predictions in the absence of any understanding of the model behavior and why it’s making its decisions is a recipe for disaster.

I’ll note that the recidivism predictions typically work 6 months to 2 years out, and are not particularly accurate! Trying to predict 18 years out is rather scary.

Wisconsin Supreme Court decision on COMPAS

We finally have the first legal ruling on algorithmic decision making. This case comes from Wisconsin, where Eric Loomis challenged the use of COMPAS for sentencing him.

While the Supreme Court denied the appeal, it made a number of interesting observations and recommendations:

  • “risk scores may not be considered as the determinative factor in deciding whether the offender can be supervised safely and effectively in the community.”
  • “the following warning must be given to sentencing judges: “(1) the proprietary nature of COMPAS has been invoked to prevent disclosure of information relating to how factors are weighed or how risk scores are to be determined; (2) risk assessment compares defendants to a national sample, but no cross- validation study for a Wisconsin population has yet been completed; (3) some studies of COMPAS risk assessment scores have raised questions about whether they disproportionately classify minority offenders as having a higher risk of recidivism; and (4) risk assessment tools must be constantly monitored and re-normed for accuracy due to changing populations and subpopulations.”

Like Danielle Citron (the author of the Forbes article) I’m a little skeptical that this will be enough. Warning labels on cigarette boxes didn’t really stop people smoking. But I think as part of a larger effort to increase awareness of the risks, and to make people even stop and think a little before blindly forging ahead with algorithms, this is a decent first step.

At the AINow Symposium in New York (that I’ll say more about later), one proposed extreme along the policy spectrum regarding algorithic decision-making was to place a moratorium on the use of algorithms entirely. I don’t know if that makes complete sense. But a heavy heavy dose of caution is definitely warranted, and rulings like this might lead to a patchwork of caveats and speedbumps that help us flesh out exactly where algorithmic decision making makes more or less sense.

 

Testing algorithmic decision-making in court.

Well that was quick!

On the heels of the ProPublica article about bias in algorithmic decision-making in the criminal justice system, a lawsuit now before the Wisconsin Supreme Court could mark the first legal determination about the use of algorithmic methods in sentencing.

The first few paragraphs of the article summarize the issue at hand:

When Eric L. Loomis was sentenced for eluding the police in La Crosse, Wis., the judge told him he presented a “high risk” to the community and handed down a six-year prison term.

The judge said he had arrived at his sentencing decision in part because of Mr. Loomis’s rating on the Compas assessment, a secret algorithm used in the Wisconsin justice system to calculate the likelihood that someone will commit another crime.

Mr. Loomis has challenged the judge’s reliance on the Compas score, and the Wisconsin Supreme Court, which heard arguments on his appeal in April, could rule in the coming days or weeks. Mr. Loomis’s appeal centers on the criteria used by the Compas algorithm, which is proprietary and as a result is protected, and on the differences in its application for men and women.

Racist risk assessments, algorithmic fairness, and the issue of harm

By now, you are likely to have heard of the fascinating report (and white paper) released by ProPublica describing the way that risk assessment algorithms in the criminal justice system appear to affect different races differently, and are not particularly accurate in their predictions. Even worse, they are even worse at predicting outcomes for black subjects than for white. Notice that this is a separate problem than ensuring equal outcomes pace disparate impact: it’s the problem of ensuring equal failure modes as well.

Screenshot_2016-05-24-08-53-55~2

There is much to pick apart in this article, and you should read the whole thing yourself. But from the perspective of research in algorithmic fairness, and how this research is discussed in the media, there’s another very important consequence of this work.

It provides concrete examples of people who have possibly been harmed by algorithmic decision-making. 

We talk to reporters frequently about the larger set of questions surrounding algorithmic accountability and eventually they always ask some version of:

Can you point to anyone who’s actually been harmed by algorithms?

and we’ve never been able to point to specific instances so far. But now, after this article, we can.

 

White House Report on Algorithmic Fairness

The White House has put out a report on big data and algorithmic fairness (announcement, full report).  From the announcement:

Using case studies on credit lending, employment, higher education, and criminal justice, the report we are releasing today illustrates how big data techniques can be used to detect bias and prevent discrimination. It also demonstrates the risks involved, particularly how technologies can deliberately or inadvertently perpetuate, exacerbate, or mask discrimination.

The table of contents for the report gives a good overview of the issues addressed:

Big Data and Access to Credit
The Problem: Many Americans lack access to affordable credit due to thin or non-existent credit files.
The Big Data Opportunity: Use of big data in lending can increase access to credit for the financially underserved.
The Big Data Challenge: Expanding access to affordable credit while preserving consumer rights that protect against discrimination in credit eligibility decisions

Big Data and Employment
The Problem: Traditional hiring practices may unnecessarily filter out applicants whose skills match the job opening.
The Big Data Opportunity: Big data can be used to uncover or possibly reduce employment discrimination.
The Big Data Challenge: Promoting fairness, ethics, and mechanisms for mitigating discrimination in employment opportunity.

Big Data and Higher Education
The Problem: Students often face challenges accessing higher education, finding information to help choose the right college, and staying enrolled.
The Big Data Opportunity: Using big data can increase educational opportunities for the students who most need them.
The Big Data Challenge: Administrators must be careful to address the possibility of discrimination in higher education admissions decisions.

Big Data and Criminal Justice
The Problem: In a rapidly evolving world, law enforcement officials are looking for smart ways to use new technologies to increase community safety and trust.
The Big Data Opportunity: Data and algorithms can potentially help law enforcement become more transparent, effective, and efficient.
The Big Data Challenge: The law enforcement community can use new technologies to enhance trust and public safety in the community, especially through measures that promote transparency and accountability and mitigate risks of disparities in treatment and outcomes based on individual characteristics.