Privacy Audits for Large Language Models


Privacy Audits for Large Language Models

Principal Investigator

University of Virginia

Oracle Principal Investigator

Pallika Kanani, Research Director
Virendra Marathe, Consulting Member Technical Staff


Large language models (LLMs) provide tremendous opportunities, and there is substantial interest in building LLM applications that can leverage proprietary and sensitive data in domains such as healthcare, finance, and education. LLMs are prone, however, to memorizing specific information in their training data and to reveal it either directly or when a sophisticated adversary interacts with the LLM in a purposeful way. The goal of this project is to develop effective and practical tools for conducting privacy audits on LLMs and LLM-based applications.

The purpose of a privacy audit is to understand and measure the disclosure risks associated with a model release or deployment. The state-of-the-art in both academic research and industrial practice today is to use membership inference attacks to measure privacy risks, but these attacks only capture a narrow notion of privacy and are largely ineffective on LLMs. For LLMs, these include both dataset disclosure risks where sensitive text in the training data is revealed to an adversary as well as distribution disclosure risks where an adversary learns some targeted property about the training data distribution. This project will develop new methods for general-purpose privacy audits that provide a comprehensive and actionable understanding of the disclosure risks. Our goal is to go beyond the current approach that just evaluates the effectiveness of a handful of known attacks, to a tool that can measure the underlying information disclosure risks from the model in a way that can predict vulnerability to not just the known and tested attacks, but to all potential attacks of a large class.. The resulting insights and measurement tools will enable effective mitigations that control disclosure risks without sacrificing model utility.