Translated Texts Under the Lens: From Machine Translation Detection to Source Language Identification

La Morgia, Massimo; Mei, Alessandro; Nemmi, Eugenio Nerio; Sabatini, Luca; Sassi, Francesco

doi:10.1007/978-3-031-30047-9_18

Machine Translation Systems are today used to break down linguistic barriers. People from different countries and languages can now interact with each other thanks to state-of-the-art translators from prominent software companies like Google and Microsoft. However, these tools are also used to expand the audience for phishing attacks, scam emails or to generate fake reviews to promote a product on different e-commerce platforms. In all these cases, detecting whether a text has been translated can be crucial information. In this work, we tackle the problem of the detection of translated texts from different angles. On top of addressing the classic task of machine translation detection, we investigate and find common patterns across different machine translation systems unrelated to the original text’s source language. Then, we show that it is possible to identify the machine translation system used to generate a translated text with high performances (F1-score 88.5%) and that it is also possible to identify the source language of the original text. We perform our tasks over two datasets that we use to evaluate our models: Books, a new dataset we built from scratch based on excerpts of novels, and the well-known Europarl dataset, based on proceedings of the European Parliament.

Translated Texts Under the Lens: From Machine Translation Detection to Source Language Identification / LA MORGIA, M., Mei, A., Nemmi, E.N., Sabatini, L., Sassi, F.. - 13876 LNCS:(2023), pp. 222-235. (Intelligent Data Analysis Louvain-la-Neuve, Belgium ) [10.1007/978-3-031-30047-9_18].

Translated Texts Under the Lens: From Machine Translation Detection to Source Language Identification

Massimo La Morgia;Alessandro Mei;Eugenio Nerio Nemmi;Luca Sabatini;Francesco Sassi

2023

Abstract

Machine Translation Systems are today used to break down linguistic barriers. People from different countries and languages can now interact with each other thanks to state-of-the-art translators from prominent software companies like Google and Microsoft. However, these tools are also used to expand the audience for phishing attacks, scam emails or to generate fake reviews to promote a product on different e-commerce platforms. In all these cases, detecting whether a text has been translated can be crucial information. In this work, we tackle the problem of the detection of translated texts from different angles. On top of addressing the classic task of machine translation detection, we investigate and find common patterns across different machine translation systems unrelated to the original text’s source language. Then, we show that it is possible to identify the machine translation system used to generate a translated text with high performances (F1-score 88.5%) and that it is also possible to identify the source language of the original text. We perform our tasks over two datasets that we use to evaluate our models: Books, a new dataset we built from scratch based on excerpts of novels, and the well-known Europarl dataset, based on proceedings of the European Parliament.

Scheda breve

Scheda completa

	Anno di pubblicazione
	
				2023
			
	Nome convegno
	
				Intelligent Data Analysis
			
	Parole chiave
	
				machine translation systems, machine learning · natural language processing
			
	Tipologia
	
				04 Pubblicazione in atti di convegno::04b Atto di convegno in volume
			
	Citazione
	
				Translated Texts Under the Lens: From Machine Translation Detection to Source Language Identification / LA MORGIA, M., Mei, A., Nemmi, E.N., Sabatini, L., Sassi, F.. - 13876 LNCS:(2023), pp. 222-235. (Intelligent Data Analysis Louvain-la-Neuve, Belgium ) [10.1007/978-3-031-30047-9_18].
			
	Appartiene alla tipologia:
	
				04b Atto di convegno in volume

File allegati a questo prodotto

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1678345

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

3

ND

Catalogo dei prodotti della ricerca