About eight decades ago, Zipf postulated that the word frequency distribution of languages is a power law, i.e., it is a straight line on a log-log plot. Over the years, this phenomenon has been documented and studied extensively. For many corpora, however, the empirical distribution barely resembles a power law: when plotted on a loglog scale, the distribution is concave and appears to be composed of two differently sloped straight lines joined by a smooth curve. A simple generative model is proposed to capture this phenomenon. Theword frequency distributions produced by this model are shown to match the observations both analytically and empirically. © 2017 Copyright held by the owner/author(s).
On the power laws of language: word frequency distributions / Chierichetti, Flavio; Kumar, Ravi; Pang, Bo. - (2017), pp. 385-394. (Intervento presentato al convegno 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2017 tenutosi a Tokyo, Shinjuku; Japan) [10.1145/3077136.3080821].
On the power laws of language: word frequency distributions
CHIERICHETTI, FLAVIO;
2017
Abstract
About eight decades ago, Zipf postulated that the word frequency distribution of languages is a power law, i.e., it is a straight line on a log-log plot. Over the years, this phenomenon has been documented and studied extensively. For many corpora, however, the empirical distribution barely resembles a power law: when plotted on a loglog scale, the distribution is concave and appears to be composed of two differently sloped straight lines joined by a smooth curve. A simple generative model is proposed to capture this phenomenon. Theword frequency distributions produced by this model are shown to match the observations both analytically and empirically. © 2017 Copyright held by the owner/author(s).File | Dimensione | Formato | |
---|---|---|---|
Chierichetti_Power_2017.pdf
accesso aperto
Tipologia:
Versione editoriale (versione pubblicata con il layout dell'editore)
Licenza:
Tutti i diritti riservati (All rights reserved)
Dimensione
2.32 MB
Formato
Adobe PDF
|
2.32 MB | Adobe PDF |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.