Telegram is a widely adopted instant messaging platform. It has become worldwide popular because of its emphasis on privacy and its social network features such as channels-virtual rooms in which only the admins can post and broadcast messages to all the subscribers. Channels are used to deliver live updates (e.g., weather alerts) and content to a large audience (e.g., COVID-19 announcements) but unfortunately also to disseminate radical ideologies and coordinate attacks such as the Capitol Hill riot. This paper introduces the TGDataset, the most extensive publicly available collection of Telegram channels, comprising 120,979 channels and over 400 million messages. We outline the data collection process and provide a comprehensive overview of the data set. Using language detection, we identify the predominant languages within the dataset. We then focus on English channels, employing topic modeling to analyze the subjects they cover. Finally, we discuss some use cases in which our dataset can be instrumental in understanding the Telegram ecosystem and studying the diffusion of questionable news. Alongside the raw dataset, we release the scripts used in our analysis, as well as a list of channels associated with a novel conspiracy theory known as Sabmyk.

TGDataset: Collecting and Exploring the Largest Telegram Channels Dataset / La Morgia, Massimo; Mei, Alessandro; Mongardini, Alberto Maria. - (2025), pp. 2325-2334. ( ACM International Conference on Knowledge Discovery and Data Mining Toronto ) [10.1145/3690624.3709397].

TGDataset: Collecting and Exploring the Largest Telegram Channels Dataset

La Morgia, Massimo;Mei, Alessandro;Mongardini, Alberto Maria
2025

Abstract

Telegram is a widely adopted instant messaging platform. It has become worldwide popular because of its emphasis on privacy and its social network features such as channels-virtual rooms in which only the admins can post and broadcast messages to all the subscribers. Channels are used to deliver live updates (e.g., weather alerts) and content to a large audience (e.g., COVID-19 announcements) but unfortunately also to disseminate radical ideologies and coordinate attacks such as the Capitol Hill riot. This paper introduces the TGDataset, the most extensive publicly available collection of Telegram channels, comprising 120,979 channels and over 400 million messages. We outline the data collection process and provide a comprehensive overview of the data set. Using language detection, we identify the predominant languages within the dataset. We then focus on English channels, employing topic modeling to analyze the subjects they cover. Finally, we discuss some use cases in which our dataset can be instrumental in understanding the Telegram ecosystem and studying the diffusion of questionable news. Alongside the raw dataset, we release the scripts used in our analysis, as well as a list of channels associated with a novel conspiracy theory known as Sabmyk.
2025
ACM International Conference on Knowledge Discovery and Data Mining
Dataset, Telegram, Conspiracy Theories, Copyright Infringement
04 Pubblicazione in atti di convegno::04b Atto di convegno in volume
TGDataset: Collecting and Exploring the Largest Telegram Channels Dataset / La Morgia, Massimo; Mei, Alessandro; Mongardini, Alberto Maria. - (2025), pp. 2325-2334. ( ACM International Conference on Knowledge Discovery and Data Mining Toronto ) [10.1145/3690624.3709397].
File allegati a questo prodotto
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1741966
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact