In this Letter we present a very general method for extracting information from a generic string of characters, e.g., a text, a DNA sequence, or a time series. Based on data-compression techniques, its key point is the computation of a suitable measure of the remoteness of two bodies of knowledge. We present the implementation of the method to linguistic motivated problems, featuring highly accurate results for language recognition, authorship attribution, and language classification.
Language trees and zipping / Benedetto, Dario; Caglioti, Emanuele; Loreto, Vittorio. - In: PHYSICAL REVIEW LETTERS. - ISSN 0031-9007. - 88:4(2002), pp. 048702:1-048702:4. [10.1103/physrevlett.88.048702]
Language trees and zipping
BENEDETTO, Dario;CAGLIOTI, Emanuele;LORETO, Vittorio
2002
Abstract
In this Letter we present a very general method for extracting information from a generic string of characters, e.g., a text, a DNA sequence, or a time series. Based on data-compression techniques, its key point is the computation of a suitable measure of the remoteness of two bodies of knowledge. We present the implementation of the method to linguistic motivated problems, featuring highly accurate results for language recognition, authorship attribution, and language classification.File | Dimensione | Formato | |
---|---|---|---|
Benedetto_Language Trees and Zipping_2002.pdf
accesso aperto
Tipologia:
Versione editoriale (versione pubblicata con il layout dell'editore)
Licenza:
Tutti i diritti riservati (All rights reserved)
Dimensione
102.68 kB
Formato
Adobe PDF
|
102.68 kB | Adobe PDF |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.