Searching Near and Far for Examples in Data Augmentation

Searching Near and Far for Examples in Data Augmentation

Ari Kobren, Naveen Jafer Nizar, Michael Wick, Swetasudha Panda

15 September 2021

In this work, we demonstrate that augmenting a dataset with examples that are far from the initial training set can lead to significant improvements in test set accuracy. We draw on the similarity of deep neural networks and nearest neighbor models. Like a nearest neighbor classifier, we show that, for any test example, augmentation with a single, nearby training example of the same label--followed by retraining--is often sufficient for a BERT-based model to correctly classify the test example. In light of this result, we devise FRaNN, an algorithm that attempts to cover the embedding space defined by the trained model with training examples. Empirically, we show that FRaNNk, and its variant FRaNNk, construct augmented datasets that lead to models with higher test set accuracy than either uncertainty sampling or a random augmentation baseline.


Venue : Black Box NLP workshop (at EMNLP 2021)

File Name : emnlp_2021_short_frann.pdf