Automatic construction of meta-evaluation benchmarks for evaluation metrics

Naveen Jafer Nizar, Qinlan Shen

01 August 2025

The evaluation of long-form text, particularly in domains requiring factual accuracy, is increasingly reliant on automated metrics. However, the reliability of these metrics themselves is often assumed rather than rigorously tested, especially for use cases where long-form generations are expected as output. This paper addresses this gap by proposing a framework to evaluate the quality of reference-based evaluation metrics. We introduce a methodology that iteratively perturbs candidate texts to assess the sensitivity and discrimination power of reference-based text evaluation metrics. Our experiments, conducted on the ACI-Bench medical dataset, demonstrate the importance of evaluating evaluation metrics for long-form text, highlighting the need for robust validation methodologies.

Venue : The 63rd Annual Meeting of the Association for Computational Linguistics

File Name : ACL_Evaluation_workshop_submission (2).pdf

Automatic construction of meta-evaluation benchmarks for evaluation metrics

Automatic construction of meta-evaluation benchmarks for evaluation metrics

Resources For

Partners

Emerging Technology

What’s New

Contact Us