(Update: apparently as a result of all the pushback from activists, the ACLU and others, the rollout of the new tool has been pushed back at least 6 months)
The Pennsylvania Commission on Sentencing is preparing a new risk assessment tool for recidivism to aid in sentencing. The mandate for the commission (taken from their report — also see the detailed documentation at their site) is to (emphasis all mine):
adopt a Sentence Risk Assessment Instrument for the sentencing court to use to help determine the appropriate sentence within the limits established by law…The risk assessment instrument may be used as an aide in evaluating the relative risk that an offender will reoffend and be a threat to public safety.” (42 Pa.C.S.§2154.7) In addition to considering the risk of re- offense and threat to public safety, Act 2010-95 also permits the risk assessment instrument to be used to determine whether a more thorough assessment is necessary, or as an aid in determining appropriate candidates for alternative sentencing (e.g., County Intermediate Punishment, State Intermediate Punishment, State Motivational Boot Camp, and Recidivism Risk Reduction Incentive).
I was hired by the ACLU of Pennsylvania to look at the documentation provided as part of this new tool and see how they built it. I submitted a report to them a little while ago,.
The commission is running public hearings to take comments and I thought I’d highlight some points, especially focusing on what I think are important “FAT*” notions for any data science project of this kind.
What is the goal of the predictor?
When you build any ML system, you have to be very careful about deciding what it is that you want to predict. In PA’s model, the risk assessment tool is to be used (by mandate) for determining
- reoffense likelihood
- risk to public safety
Note that these are not the same thing! Using a single tool to predict both, or using its predictions to make asssessments about both, is a problem.
How is this goal being measured?
You have to dig into the reports to see this (page 6): they measure recidivism as
re-arrest for a felony or misdemeanor in Pennsylvania within three years of imposition of a sentence to the community or within three years of release from confinement; or, for offenders sentenced to state prison, a recommitment to the Department of Corrections for a technical violation within three years of release from confinement.
How does the predictor goal match the measured goal?
Here’s where it gets interesting. I’m not at all clear how “risk to public safety” is measured by re-arrests. Moreover, using re-arrest as a proxy for reoffense is a big potential red flag, if we are concerned about excessive policing issues as well as patterns that target minorities. As a matter of fact, a 2013 recidivism report by Pennsylvania (Thanks to Nyssa Taylor at ACLU-PA for finding this) says (page 17) that re-arrest rates are highest for African-Americans, whereas reincarceration rates are more evenly balanced by race.
Notice also that technical violations of parole are included in measurements of recidivism. Setting aside the question of whether any technical violation of parole amounts to a risk to public safety, it’s known that for example when considering pre-trial risk assessments that failure to appear in court occurs for many reasons that often correlate more with poverty (and inability to take time off to appear in court) than actual flight risk.
It’s not clear what a technical violation of parole might constitute and whether there are race biases in this calculation. Note that since this is aggregated into a predicted value, it doesn’t undergo the more detailed nondiscrimination analysis that other features do.
Separately, I’ll note that the PA SC did discover that as a feature, prior arrests carry a race bias that is not mitigated by predictive efficacy, and therefore decided to replace it by prior convictions.
How is the predictor being used?
What’s interesting about this tool is that its output is converted (as usual) into a low, medium or high risk label. But the tool is only used when the risk is deemed either low or high. This determination then triggers further reports. In the case when it returns a medium risk, the tool results are not passed on.
What I didn’t see is how the predictor guides a decision towards alternate sentencing, and whether a single predictor for “risk of recidivism” is suficient to determine the efficacy of alternate interventions (Narrator: it probably isn’t).
There are many interesting aspects of the tool building process: how they essentially build different tools for one of 10 different crime categories, how they decided to group categories together, and how they decided to use a different model for crimes against a person. The models used are all logistic regression, and the reports provide the variables that end up in each model, as well as the weights.
But to me, the detailed analysis of the effectiveness of the tool and which variables don’t carry racial bias miss some of the larger issues with how they even decide what the “labels” are.