Sentence alignment – establishing links between corresponding sentences in two related documents – is an important NLP task with several downstream applications, such as machine translation (MT). Despite the fact that existing sentence alignment systems have achieved promising results, their effectiveness is based on auxiliary information such as document metadata or machine-generated translations, as well as hyperparameter-sensitive techniques. Moreover, these systems often overlook the crucial role that context plays in the alignment process. In this paper, we address the aforementioned issues and propose CroCoAlign: the first context-aware, end-to-end and fully neural architecture for sentence alignment. Our system maps source and target sentences in long documents by contextualizing their sentence embeddings with respect to the other sentences in the document. We extensively evaluate CroCoAlign on a multilingual dataset consisting of 20 language pairs derived from the Opus project, and demonstrate that our model achieves state-of-the-art performance. To ensure reproducibility, we release our code and model checkpoints at https://github.com/Babelscape/CroCoAlign.

CroCoAlign: A Cross-Lingual, Context-Aware and Fully-Neural Sentence Alignment System for Long Texts / Molfese, FRANCESCO MARIA; Bejgu, ANDREI STEFAN; Tedeschi, Simone; Conia, Simone; Navigli, Roberto. - 1:(2024), pp. 2209-2220. (Intervento presentato al convegno European Association for Computational Linguistics tenutosi a St. Julian's; Malta).

CroCoAlign: A Cross-Lingual, Context-Aware and Fully-Neural Sentence Alignment System for Long Texts

Molfese Francesco
Primo
;
Bejgu Andrei
;
Tedeschi Simone
;
Conia Simone
;
Navigli Roberto
2024

Abstract

Sentence alignment – establishing links between corresponding sentences in two related documents – is an important NLP task with several downstream applications, such as machine translation (MT). Despite the fact that existing sentence alignment systems have achieved promising results, their effectiveness is based on auxiliary information such as document metadata or machine-generated translations, as well as hyperparameter-sensitive techniques. Moreover, these systems often overlook the crucial role that context plays in the alignment process. In this paper, we address the aforementioned issues and propose CroCoAlign: the first context-aware, end-to-end and fully neural architecture for sentence alignment. Our system maps source and target sentences in long documents by contextualizing their sentence embeddings with respect to the other sentences in the document. We extensively evaluate CroCoAlign on a multilingual dataset consisting of 20 language pairs derived from the Opus project, and demonstrate that our model achieves state-of-the-art performance. To ensure reproducibility, we release our code and model checkpoints at https://github.com/Babelscape/CroCoAlign.
2024
European Association for Computational Linguistics
sentence alignment; sentence embeddings; natural language processing; multilinguality; contextual information
04 Pubblicazione in atti di convegno::04b Atto di convegno in volume
CroCoAlign: A Cross-Lingual, Context-Aware and Fully-Neural Sentence Alignment System for Long Texts / Molfese, FRANCESCO MARIA; Bejgu, ANDREI STEFAN; Tedeschi, Simone; Conia, Simone; Navigli, Roberto. - 1:(2024), pp. 2209-2220. (Intervento presentato al convegno European Association for Computational Linguistics tenutosi a St. Julian's; Malta).
File allegati a questo prodotto
File Dimensione Formato  
Molfese_CroCoAlign_2024.pdf

accesso aperto

Note: https://aclanthology.org/2024.eacl-long.135/
Tipologia: Versione editoriale (versione pubblicata con il layout dell'editore)
Licenza: Creative commons
Dimensione 442.44 kB
Formato Adobe PDF
442.44 kB Adobe PDF

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1713460
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact