Diverse Data Augmentation via Unscrambling Text with Missing Words

Diverse Data Augmentation via Unscrambling Text with Missing Words

Ari Kobren, Naveen Jafer Nizar, Michael Wick, Swetasudha Panda

10 November 2021

We present the Diverse Augmentation using Scrambled Seq2Seq (DAugSS) algorithm, a fully automated data augmentation mechanism that leverages a model to generate examples in a semi-controllable fashion. The main component of DAugSS is a training procedure in which the generative model is trained to transform a class label and a sequence of tokens into a well-formed sentence of the specified class that contains the specified tokens. Empirically, we show that DAugSS is competitive with or outperforms state-of-the-art, generative models for data augmentation in terms of test set accuracy on 4 datasets. We show that the flexibility of our approach yields datasets with expansive vocabulary, and that models trained on these datasets are more resilient to adversarial attacks than when trained on datasets augmented by competing methods.


Venue : NLP for Conversational AI