About eight decades ago, Zipf postulated that the word frequency distribution of languages is a power law, i.e., it is a straight line on a log-log plot. Over the years, this phenomenon has been documented and studied extensively. For many corpora, however, the empirical distribution barely resembles a power law: when plotted on a loglog scale, the distribution is concave and appears to be composed of two differently sloped straight lines joined by a smooth curve. A simple generative model is proposed to capture this phenomenon. Theword frequency distributions produced by this model are shown to match the observations both analytically and empirically. © 2017 Copyright held by the owner/author(s).

On the power laws of language: word frequency distributions / Chierichetti, Flavio; Kumar, Ravi; Pang, Bo. - (2017), pp. 385-394. (Intervento presentato al convegno 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2017 tenutosi a Tokyo, Shinjuku; Japan) [10.1145/3077136.3080821].

On the power laws of language: word frequency distributions

CHIERICHETTI, FLAVIO;
2017

Abstract

About eight decades ago, Zipf postulated that the word frequency distribution of languages is a power law, i.e., it is a straight line on a log-log plot. Over the years, this phenomenon has been documented and studied extensively. For many corpora, however, the empirical distribution barely resembles a power law: when plotted on a loglog scale, the distribution is concave and appears to be composed of two differently sloped straight lines joined by a smooth curve. A simple generative model is proposed to capture this phenomenon. Theword frequency distributions produced by this model are shown to match the observations both analytically and empirically. © 2017 Copyright held by the owner/author(s).
2017
40th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2017
data mining; information retrieval
04 Pubblicazione in atti di convegno::04b Atto di convegno in volume
On the power laws of language: word frequency distributions / Chierichetti, Flavio; Kumar, Ravi; Pang, Bo. - (2017), pp. 385-394. (Intervento presentato al convegno 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2017 tenutosi a Tokyo, Shinjuku; Japan) [10.1145/3077136.3080821].
File allegati a questo prodotto
File Dimensione Formato  
Chierichetti_Power_2017.pdf

accesso aperto

Tipologia: Versione editoriale (versione pubblicata con il layout dell'editore)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 2.32 MB
Formato Adobe PDF
2.32 MB Adobe PDF

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1002054
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 14
  • ???jsp.display-item.citation.isi??? 9
social impact