High-stakes decision-making in areas like healthcare, finance and governance requires accountability for decisions and for how data is used in making decisions. Many concerns have been raised about whether machine learning (ML) models can meet these expectations. In many cases, ML model predictions have been found to be objectionable and violating their original expectations after deployment.
A key reason is that ML models are often complex black-boxes and thus have varying, unknown failure modes that are revealed only after deployment: models fail to achieve the reported high accuracies, lead to unfair decisions, and sometimes provide predictions that are plain unacceptable given basic domain knowledge. To address these problems, there has been work on enhancing fairness, improving generalization to new data domains, and building explanations for an ML model. However, these three goals—fairness, stability and explanation—are often studied relatively independent of each other.
The Reliable ML project addresses unified questions of model stability, fairness and explanation. We believe that there are fundamental connections between stability (generalization), fairness, and explainability of an ML model. Having one without the other two is not useful: all three should be met for an ML model to deliver its stated objective in a high-stakes application. If a fair and explainable model is not stable across data distributions, its stated properties can vary over time and across domains. Similarly, stable and fair models that cannot be explained are difficult to debug or improve. And a stable and explainable model without fairness guarantees may be unacceptable for many applications.
As a concrete example, consider adversarial examples, small perturbations of input examples that make even a highly accurate ML model give incorrect predictions.
- Adversarial examples can be used to regularize the training procedure and make a model robust to small perturbations of data (which is a special case of stability).
- Adversarial examples can be used as explanations by providing the minimal changes in the input that would alter the model prediction on it (counterfactual explanations).
- Adversarial examples that only change certain protected attributes like gender or race can be used to verify and optimize for fairness (fairness audit).
Browse our publications for more details.