Automatic construction of meta-evaluation benchmarks for evaluation metrics
Automatic construction of meta-evaluation benchmarks for evaluation metrics
01 August 2025
The evaluation of long-form text, particularly in domains requiring factual accuracy, is increasingly reliant on automated metrics. However, the reliability of these metrics themselves is often assumed rather than rigorously tested, especially for use cases where long-form generations are expected as output. This paper addresses this gap by proposing a framework to evaluate the quality of reference-based evaluation metrics. We introduce a methodology that iteratively perturbs candidate texts to assess the sensitivity and discrimination power of reference-based text evaluation metrics. Our experiments, conducted on the ACI-Bench medical dataset, demonstrate the importance of evaluating evaluation metrics for long-form text, highlighting the need for robust validation methodologies.
Venue : The 63rd Annual Meeting of the Association for Computational Linguistics
File Name : ACL_Evaluation_workshop_submission (2).pdf