Deep learning has revolutionized Artificial Intelligence (AI), with most AI applications now leveraging deep learning techniques to achieve state-of-the-art performance. One of the most prominent applications is Machine Translation (MT), which automatically translates text from one language to another. Over the past few years, the quality of automatic translation has improved dramatically, with current neural translation systems now approaching human performance on standard benchmarks. As a result, the MT field faces new challenges: As top systems produce translations that differ only in minor nuances rather than in major errors, improvements have become increasingly difficult to measure. Unlike many other AI applications, MT is particularly challenging to evaluate. Given a source text, there is rarely a single correct translation, and many outputs may be valid yet differ in form. Similarly, many outputs can be incorrect for different reasons. Due to this intrinsic complexity, MT evaluation has emerged as a dedicated research area focused on developing better evaluation techniques. This area has also advanced significantly in recent years, primarily due to the replacement of earlier heuristic approaches with neural evaluation metrics. However, while improving evaluation accuracy, neural metrics sacrificed interpretability. Indeed, the most widely used neural metrics are black-box models that reduce translation quality to a single scalar score, making the evaluation process opaque and providing limited insights into the reasons behind a particular assessment. If MT evaluation yields unreliable assessments, it risks steering system development in the wrong direction. This highlights the importance of another research area in MT: the MT meta-evaluation, which assesses the effectiveness of MT evaluation metrics and ensures their alignment with human evaluation by measuring their correlation with human judgments. In this dissertation, we advance the ability to measure progress in the MT field by addressing key challenges in both MT evaluation and meta-evaluation, with a particular emphasis on enhancing evaluation interpretability. To mitigate the lack of interpretability in MT evaluation, we make two main contributions: (i) we introduce neural metrics capable of identifying error spans within translations and assigning severity levels to them, thereby enabling finer-grained and more interpretable feedback, compared to neural metrics that output scalar assessments; (ii) we propose an interpretable meta-evaluation framework that assesses metric performance using Precision, Recall, F-score, and Re-Ranking Precision. While standard correlation with human judgments is primarily useful for ranking metrics, our meta-evaluation framework offers more intuitive insights into their absolute evaluation performance. Our contributions to MT meta-evaluation extend beyond improving interpretability. We identify fundamental flaws in widely used meta-evaluation strategies, which can inadvertently reward metrics for the wrong reasons. To reveal and quantify these weaknesses, we introduce the concept of sentinel metrics -- intentionally incomplete metrics designed to highlight shortcomings in meta-evaluation methods -- and incorporate them into the meta-evaluation process. Finally, we shift focus from improving evaluation techniques to facilitating better test data selection. As MT systems achieve increasingly higher performance, existing benchmarks have become too easy, making it difficult to distinguish among top-performing systems or identify areas requiring further improvement. We address this issue by introducing Translation Difficulty Estimation -- the task of identifying texts that are difficult to translate -- and show how difficulty estimators can be used to construct more challenging MT benchmarks. We also develop state-of-the-art difficulty estimators and use them to build the test set for the General Machine Translation Shared Task at the 2025 edition of the Conference on Machine Translation (WMT). Taken together, our contributions advance the field of Machine Translation by providing interpretable evaluation metrics and meta-metrics, more reliable meta-evaluation, and more informative benchmarks -- ultimately improving how progress in MT is measured and understood.
Towards accurate and interpretable machine translation evaluation / Perrella, Stefano. - (2026 Jan 29).
Towards accurate and interpretable machine translation evaluation
PERRELLA, STEFANO
29/01/2026
Abstract
Deep learning has revolutionized Artificial Intelligence (AI), with most AI applications now leveraging deep learning techniques to achieve state-of-the-art performance. One of the most prominent applications is Machine Translation (MT), which automatically translates text from one language to another. Over the past few years, the quality of automatic translation has improved dramatically, with current neural translation systems now approaching human performance on standard benchmarks. As a result, the MT field faces new challenges: As top systems produce translations that differ only in minor nuances rather than in major errors, improvements have become increasingly difficult to measure. Unlike many other AI applications, MT is particularly challenging to evaluate. Given a source text, there is rarely a single correct translation, and many outputs may be valid yet differ in form. Similarly, many outputs can be incorrect for different reasons. Due to this intrinsic complexity, MT evaluation has emerged as a dedicated research area focused on developing better evaluation techniques. This area has also advanced significantly in recent years, primarily due to the replacement of earlier heuristic approaches with neural evaluation metrics. However, while improving evaluation accuracy, neural metrics sacrificed interpretability. Indeed, the most widely used neural metrics are black-box models that reduce translation quality to a single scalar score, making the evaluation process opaque and providing limited insights into the reasons behind a particular assessment. If MT evaluation yields unreliable assessments, it risks steering system development in the wrong direction. This highlights the importance of another research area in MT: the MT meta-evaluation, which assesses the effectiveness of MT evaluation metrics and ensures their alignment with human evaluation by measuring their correlation with human judgments. In this dissertation, we advance the ability to measure progress in the MT field by addressing key challenges in both MT evaluation and meta-evaluation, with a particular emphasis on enhancing evaluation interpretability. To mitigate the lack of interpretability in MT evaluation, we make two main contributions: (i) we introduce neural metrics capable of identifying error spans within translations and assigning severity levels to them, thereby enabling finer-grained and more interpretable feedback, compared to neural metrics that output scalar assessments; (ii) we propose an interpretable meta-evaluation framework that assesses metric performance using Precision, Recall, F-score, and Re-Ranking Precision. While standard correlation with human judgments is primarily useful for ranking metrics, our meta-evaluation framework offers more intuitive insights into their absolute evaluation performance. Our contributions to MT meta-evaluation extend beyond improving interpretability. We identify fundamental flaws in widely used meta-evaluation strategies, which can inadvertently reward metrics for the wrong reasons. To reveal and quantify these weaknesses, we introduce the concept of sentinel metrics -- intentionally incomplete metrics designed to highlight shortcomings in meta-evaluation methods -- and incorporate them into the meta-evaluation process. Finally, we shift focus from improving evaluation techniques to facilitating better test data selection. As MT systems achieve increasingly higher performance, existing benchmarks have become too easy, making it difficult to distinguish among top-performing systems or identify areas requiring further improvement. We address this issue by introducing Translation Difficulty Estimation -- the task of identifying texts that are difficult to translate -- and show how difficulty estimators can be used to construct more challenging MT benchmarks. We also develop state-of-the-art difficulty estimators and use them to build the test set for the General Machine Translation Shared Task at the 2025 edition of the Conference on Machine Translation (WMT). Taken together, our contributions advance the field of Machine Translation by providing interpretable evaluation metrics and meta-metrics, more reliable meta-evaluation, and more informative benchmarks -- ultimately improving how progress in MT is measured and understood.| File | Dimensione | Formato | |
|---|---|---|---|
|
Tesi_dottorato_Perrella.pdf
accesso aperto
Note: tesi completa
Tipologia:
Tesi di dottorato
Licenza:
Creative commons
Dimensione
3.44 MB
Formato
Adobe PDF
|
3.44 MB | Adobe PDF |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


