Upstream Mitigation Is Not All You Need

Upstream Mitigation Is Not All You Need

Ryan Steed, Michael Wick, Ari Kobren, Swetasudha Panda

22 May 2022

A few large, homogenous pre-trained models undergird many machine learning systems — and often, these models contain harmful stereotypes learned from the internet. We investigate the bias transfer hypothesis, the possibility that social biases (such as stereotypes) internalized by large language models during pre-training could also affect task-specific behavior after fine-tuning. For two classification tasks, we find that reducing intrinsic bias with controlled interventions before fine-tuning does little to mitigate the classifier’s discriminatory behavior after fine-tuning. Regression analysis suggests that downstream disparities are better explained by biases in the fine-tuning dataset. Still, pre-training plays a role: simple alterations to co-occurrence rates in the fine-tuning dataset are ineffective when the model has been pre-trained. Our results encourage practitioners to focus more on dataset quality and context-specific harms.


Venue : ACL 2022