We investigate the use of topic models such as latent Dirichlet allocation (LDA) on two real-world problems: classifying documents and retrieving similar documents to a given document, with reference to a corpus of Italian Supreme Court decisions. A topic model is a generative model that specifies a simple probabilistic procedure by which documents in a corpus can be generated. In the LDA approach, a topic is a probability distribution over a fixed vocabulary of terms, each document is modeled as a mixture of K topics, and the mixing coefficients can be used for representing documents as points on K-1 dimensional simplex spanned by the topics. Approximate posterior inference is performed in order to learn the hidden topical structure from the observed data, i.e., the words in the documents.
Applying LDA topic model to a corpus of Italian Supreme Court decisions / Paolo, Fantini; Brutti, Pierpaolo. - ELETTRONICO. - (2014). (Intervento presentato al convegno Conference of European Statistics Stakeholders tenutosi a Roma nel 24-25 Novembre 2014).
Applying LDA topic model to a corpus of Italian Supreme Court decisions
BRUTTI, Pierpaolo
2014
Abstract
We investigate the use of topic models such as latent Dirichlet allocation (LDA) on two real-world problems: classifying documents and retrieving similar documents to a given document, with reference to a corpus of Italian Supreme Court decisions. A topic model is a generative model that specifies a simple probabilistic procedure by which documents in a corpus can be generated. In the LDA approach, a topic is a probability distribution over a fixed vocabulary of terms, each document is modeled as a mixture of K topics, and the mixing coefficients can be used for representing documents as points on K-1 dimensional simplex spanned by the topics. Approximate posterior inference is performed in order to learn the hidden topical structure from the observed data, i.e., the words in the documents.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.