Catalogo dei prodotti della ricerca

This paper faces a central theme in applied statistics and information science, which is the assessment of the stochastic structure of rank-size laws in text analysis. We consider the words in a corpus by ranking them on the basis of their frequencies in descending order. The starting point is that the ranked data generated in linguistic contexts can be viewed as the realisations of a discrete states Markov chain, whose stationary distribution behaves according to a discretisation of the best fitted rank-size law. The employed methodological toolkit is Markov Chain Monte Carlo, specifically referring to the Metropolis–Hastings algorithm. The theoretical framework is applied to the rank-size analysis of the hapax legomena occurring in the speeches of the US Presidents. We offer a large number of statistical tests leading to the consistency of our methodological proposal. To pursue our scopes, we also offer arguments supporting that hapaxes are rare (“extreme”) events resulting from memory-less-like processes. Moreover, we show that the considered sample has the stochastic structure of a Markov chain of order one. Importantly, we discuss the versatility of the method, which is considered suitable for deducing similar outcomes for other applied science contexts.

Markov Chain Monte Carlo for generating ranked textual data / Cerqueti, Roy; Ficcadenti, Valerio; Dhesi, Gurjeet; Ausloos, Marcel. - In: INFORMATION SCIENCES. - ISSN 0020-0255. - 610:(2022), pp. 425-439. [10.1016/j.ins.2022.07.137]

Markov Chain Monte Carlo for generating ranked textual data

Cerqueti, Roy;Ficcadenti, Valerio;Dhesi, Gurjeet;Ausloos, Marcel

2022

Abstract

This paper faces a central theme in applied statistics and information science, which is the assessment of the stochastic structure of rank-size laws in text analysis. We consider the words in a corpus by ranking them on the basis of their frequencies in descending order. The starting point is that the ranked data generated in linguistic contexts can be viewed as the realisations of a discrete states Markov chain, whose stationary distribution behaves according to a discretisation of the best fitted rank-size law. The employed methodological toolkit is Markov Chain Monte Carlo, specifically referring to the Metropolis–Hastings algorithm. The theoretical framework is applied to the rank-size analysis of the hapax legomena occurring in the speeches of the US Presidents. We offer a large number of statistical tests leading to the consistency of our methodological proposal. To pursue our scopes, we also offer arguments supporting that hapaxes are rare (“extreme”) events resulting from memory-less-like processes. Moreover, we show that the considered sample has the stochastic structure of a Markov chain of order one. Importantly, we discuss the versatility of the method, which is considered suitable for deducing similar outcomes for other applied science contexts.

Scheda breve

Scheda completa

	Anno di pubblicazione
	
				2022
			
	Parole chiave
	
				Markov Chain Monte Carlo; Zipf-Mandelbrot law; Ranked data; Text analysis; Hapax Legomena
			
	Tipologia
	
				01 Pubblicazione su rivista::01a Articolo in rivista
			
	Citazione
	
				Markov Chain Monte Carlo for generating ranked textual data / Cerqueti, Roy; Ficcadenti, Valerio; Dhesi, Gurjeet; Ausloos, Marcel. - In: INFORMATION SCIENCES. - ISSN 0020-0255. - 610:(2022), pp. 425-439. [10.1016/j.ins.2022.07.137]
			
	Appartiene alla tipologia:
	
				01a Articolo in rivista

File allegati a questo prodotto

File	Dimensione	Formato
INS_MCMC_FiccadentiAusloosDhesiCerqueti.pdf solo gestori archivio Tipologia: Versione editoriale (versione pubblicata con il layout dell'editore) Licenza: Tutti i diritti riservati (All rights reserved) Dimensione 1.67 MB Formato Adobe PDF Contatta l'autore	1.67 MB	Adobe PDF	Contatta l'autore

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1653006

Citazioni

ND

2

1

social impact