The software IraMuTeQ is gaining space in social and psychological research. It is free and easy to use, it provides quality outputs, and it fits with theoretical perspectives interested in communication and social construction of knowledge. As in other forms of automatized analysis of large textual corpora, its use involves pre-treatment and modification of the original text in order to reduce complexity. As we know, lemmatization is a very delicate phase, which affects the whole strategy of analysis (from the selection of lemmas according to statistical or substantive criteria, to the extraction of organising factors). However, algorithms implemented in commercial or free software, are often performed after the grammatical tagging, using reference dictionaries, in automatic and non-transparent way to the end-users. And, apart from anecdotical evidence, it is often difficult to evaluate the reliability of the automated procedures. The aim of this paper is to compare the outcomes of the procedures implemented by IraMuTeQ with the output obtained with other well established software. We used a large corpus in Italian language on the issue of "fiscal compact", consisting of over one million occurrences, drawn from over 3000 newspaper articles published from 2012 to 2015. The same corpus was lemmatised using the procedures available in IraMuTeQ (list based) and those implemented in Taltac, Tlab, and Tree Tagger. The proximity between resulting lists of lemma produced by each software will be compared using intertextual distance. Finally, in order to examine the effects of different procedures on the textual analysis, we took the two most distant lists of lemmas and we applied a correspondence analysis to the two matrixes lemmas/newspapers.

The effects of lemmatization on textual analysis conducted with IRaMuTeQ: results in comparison / Sarrica, Mauro; Mingo, Isabella; Mazzara, Bruno Maria; Leone, Giovanna. - STAMPA. - 1:(2016), pp. 249-260. (Intervento presentato al convegno 13th International Conference on Statistical Analysis of Textual Data tenutosi a Nice (F) nel 07-10 June 2016).

The effects of lemmatization on textual analysis conducted with IRaMuTeQ: results in comparison

SARRICA, Mauro
;
MINGO, Isabella;MAZZARA, Bruno Maria;LEONE, GIOVANNA
2016

Abstract

The software IraMuTeQ is gaining space in social and psychological research. It is free and easy to use, it provides quality outputs, and it fits with theoretical perspectives interested in communication and social construction of knowledge. As in other forms of automatized analysis of large textual corpora, its use involves pre-treatment and modification of the original text in order to reduce complexity. As we know, lemmatization is a very delicate phase, which affects the whole strategy of analysis (from the selection of lemmas according to statistical or substantive criteria, to the extraction of organising factors). However, algorithms implemented in commercial or free software, are often performed after the grammatical tagging, using reference dictionaries, in automatic and non-transparent way to the end-users. And, apart from anecdotical evidence, it is often difficult to evaluate the reliability of the automated procedures. The aim of this paper is to compare the outcomes of the procedures implemented by IraMuTeQ with the output obtained with other well established software. We used a large corpus in Italian language on the issue of "fiscal compact", consisting of over one million occurrences, drawn from over 3000 newspaper articles published from 2012 to 2015. The same corpus was lemmatised using the procedures available in IraMuTeQ (list based) and those implemented in Taltac, Tlab, and Tree Tagger. The proximity between resulting lists of lemma produced by each software will be compared using intertextual distance. Finally, in order to examine the effects of different procedures on the textual analysis, we took the two most distant lists of lemmas and we applied a correspondence analysis to the two matrixes lemmas/newspapers.
2016
13th International Conference on Statistical Analysis of Textual Data
Lemmatization; Software for Textual Analysis; Intertextual distance
04 Pubblicazione in atti di convegno::04b Atto di convegno in volume
The effects of lemmatization on textual analysis conducted with IRaMuTeQ: results in comparison / Sarrica, Mauro; Mingo, Isabella; Mazzara, Bruno Maria; Leone, Giovanna. - STAMPA. - 1:(2016), pp. 249-260. (Intervento presentato al convegno 13th International Conference on Statistical Analysis of Textual Data tenutosi a Nice (F) nel 07-10 June 2016).
File allegati a questo prodotto
File Dimensione Formato  
Sarrica_effects-of-lemmatization_2016.pdf

solo gestori archivio

Tipologia: Versione editoriale (versione pubblicata con il layout dell'editore)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 404.88 kB
Formato Adobe PDF
404.88 kB Adobe PDF   Contatta l'autore

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/890627
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact