Lawyer-turned-data-scientist David Colarusso recently came out with a very interesting and important analysis highlighting the effects of race, sex, and (imputed) income on criminal sentencing – it’s called “Uncovering Big Bias with Big Data“. (I came across this via MetaFilter via mathbabe.)
Colarusso’s findings are that defendants who are black, poor, or male can expect longer sentences for the same charges than defendants who are, respectively, white, rich, or female – a finding that is both a sad comment on our justice system and an unsurprising one.
But before taking his quantitative assertions as empirically valid, I wanted to look at it a little deeper, and here is what I found:
- For a model to show merit, it’s crucial for it to perform predictably on unseen data. Colarusso pays lip service to testing a model with held-out data (which he inaccurately calls “cross-validation”), but that’s pretty much it. The main post linked above actually doesn’t present any details on it at all. When I dug deeper in the supporting iPython notebook, things got even weirder. Instead of using coefficients derived from training data to make predictions on the held-out data and then assess the validity of the predictions, he simply runs a regression training a second time on the held-out data, producing a new set of coefficients. What?! He says “the code below doesn’t really capture how I go about cross-validation”, but there is no other description of how he did go about testing with held-out data.
- Using a single predictor – charge seriousness – the R2 score drops in half when applying the log function to the outcome. Thereafter, it does not rise when adding more predictors. So from an explanation-of-variance standpoint, the very first simplistic model is better than the final one.
- Speaking of predictors, race and income were treated as independent covariates, when they are obviously correlated. Regularization could help with this problem, but was not considered. Interactions weren’t considered either – why not?
- Finally, despite all the significant issues I mention above, this is perhaps the most worthy and important piece of analysis I’ve seen recently. Why do I say this? We have a glut of data scientists doing analyses on things that simply do not matter. Meanwhile, Colarusso has taken incarceration, something that deeply and destructively impacts the lives of not just individuals but entire communities, and scrutinized the notion many take for granted: that convicts deserve the sentences they get, and the oft-repeated (and racist) lie that disproportions in the justice system merely reflect the demographics of those who commit crimes. For this he deserves commendation. Since both data and his code are freely available, I’d encourage those who find fault with his analysis (and I include myself in this group) to not merely criticize, but try to do better.