A few large, homogenous pre-trained models
undergird many machine learning systems — and
often, these models contain harmful stereotypes
learned from the internet. We investigate the bias
transfer hypothesis, the possibility that social biases (such as stereotypes) internalized by large language models during pre-training could also affect task-specific behavior after fine-tuning. For two classification tasks, we find that reducing intrinsic bias with controlled interventions before fine-tuning does little to mitigate the classifier’s discriminatory behavior after fine-tuning. Regression analysis suggests that downstream disparities are better explained by biases in the fine-tuning dataset. Still, pre-training plays a role: simple alterations to co-occurrence rates in the fine-tuning dataset are ineffective when the model has been pre-trained. Our results encourage practitioners to focus more on dataset quality and context-specific harms.