Raw predictions considered harmful?

Last week I went to a really interesting talk by Bill Howe called “Raw Data Considered Harmful,” which presented a strong case for handing ML researchers and data scientists semisynthetic data in sensitive settings. Some of his work, currently under review, proposes methods for scrubbing raw data of “biases,” or signals that are unwanted because they should not or cannot be legitimate representations of the relationships between variables in a dataset.

The idea for the solution comes from the failure of many fairness metrics under Simpson’s paradox, where a signal may appear different, or disappear completely, under different grouping conditions. Basically, given a sensitive input X and outcome of interest Y, as well as “admissible” inputs S (determined perhaps by a social scientist or domain expert familiar with how the data was collected), X should not “affect” (don’t ask me to define this) Y no matter which subset of features S are conditioned on. Now, once you have a dataset that “minimally” satisfies this property in some way (think: adding and removing some tuples), you have something that should resemble your original dataset, but without those pesky “encoded historical biases” everyone has been worried about so much.

I think this work is great for lots of reasons. I wonder, though, how people should interpret their analyses of the new dataset. If we know that historical outcome Y is biased in a certain way, what does it mean to build a model that accurately recovers Y values given the admissible features in this “corrected” setting? Let’s say that we were interested in evaluating job aptitude, and we are using historical promotion data to decide whether to hire people. The fundamental problem here is that promotion != aptitude, because, perhaps among other reasons, women are promoted less frequently by sexist superiors, or don’t ask for raises because of social conditioning. So we correct the data for historical biases against women. Our new target is still promotion, but a model we train on the admissible features should be less sexist (by a number of metrics, I think- see the forthcoming papers) because the data now satisfies the assertion that gender has nothing to do with aptitude.

But how do we know the signal in this new data is actually a better proxy for aptitude? If optimizing for promotion was such a terrible idea in the first place, what exactly is left that would make it a good idea now? Even in the case where domain experts are telling you what to control on, who makes the call on whether the data could possibly contain any meaningful information? A related question I didn’t get to ask Bill is whether this data would be useful for scientific inquiry / causal inference in the many settings where a prediction isn’t actually what you want but I don’t know anything about causal inference anyway so maybe someone who does can chat with him about it.