Searching Near and Far for Examples in Data Augmentation

Searching Near and Far for Examples in Data Augmentation

Ari Kobren, Naveen Jafer Nizar, Michael Wick, Swetasudha Panda

07 November 2021

In this work, we demonstrate that augmenting a dataset with examples that are far from the initial training set can lead to significant improvements in test set accuracy. We draw on the similarity of deep neural networks and nearest neighbor models. Like a nearest neighbor classifier, we show that, for any test example, augmentation with a single, nearby training example of the same label--followed by retraining--is often sufficient for a BERT-based model to correctly classify the test example. In light of this result, we devise FRaNN, an algorithm that attempts to cover the embedding space defined by the trained model with training examples. Empirically, we show that FRaNN, and its variant FRaNNK, construct augmented datasets that lead to models with higher test set accuracy than either uncertainty sampling or a random augmentation baseline.


Venue : EMNLP 2021